eland.csv_to_eland#

eland.csv_to_eland(filepath_or_buffer, es_client: Union[str, List[str], Tuple[str, ...], Elasticsearch], es_dest_index: str, es_if_exists: str = 'fail', es_refresh: bool = False, es_dropna: bool = False, es_type_overrides: Optional[Mapping[str, str]] = None, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, chunksize=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, warn_bad_lines: bool = True, error_bad_lines: bool = True, on_bad_lines: str = 'error', delim_whitespace=False, low_memory: bool = True, memory_map=False, float_precision=None) DataFrame#

Read a comma-separated values (csv) file into eland.DataFrame (i.e. an Elasticsearch index).

Modifies an Elasticsearch index

Note pandas iteration options not supported

Parameters#

es_client: Elasticsearch client argument(s)
  • elasticsearch-py parameters or

  • elasticsearch-py instance

es_dest_index: str

Name of Elasticsearch index to be appended to

es_if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the index already exists.

  • fail: Raise a ValueError.

  • replace: Delete the index before inserting new values.

  • append: Insert new values to the existing index. Create if does not exist.

es_dropna: bool, default ‘False’
  • True: Remove missing values (see pandas.Series.dropna)

  • False: Include missing values - may cause bulk to fail

es_type_overrides: dict, default None

Dict of columns: es_type to override default es datatype mappings

chunksize

number of csv rows to read before bulk index into Elasticsearch

Other Parameters#

Parameters derived from pandas.read_csv.

See Also#

pandas.read_csv

Notes#

iterator not supported

Examples#

See if ‘churn’ index exists in Elasticsearch

>>> from elasticsearch import Elasticsearch 
>>> es = Elasticsearch() 
>>> es.indices.exists(index="churn") 
False

Read ‘churn.csv’ and use first column as _id (and eland.DataFrame index)

# churn.csv
,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
...
>>>  ed.csv_to_eland(
...      "churn.csv",
...      es_client='http://localhost:9200',
...      es_dest_index='churn',
...      es_refresh=True,
...      index_col=0
... ) 
          account length  area code  churn  customer service calls  ... total night calls  total night charge total night minutes voice mail plan
0                128        415      0                       1  ...                91               11.01               244.7             yes
1                107        415      0                       1  ...               103               11.45               254.4             yes
2                137        415      0                       0  ...               104                7.32               162.6              no
3                 84        408      0                       2  ...                89                8.86               196.9              no
4                 75        415      0                       3  ...               121                8.41               186.9              no
...              ...        ...    ...                     ...  ...               ...                 ...                 ...             ...
3328             192        415      0                       2  ...                83               12.56               279.1             yes
3329              68        415      0                       3  ...               123                8.61               191.3              no
3330              28        510      0                       2  ...                91                8.64               191.9              no
3331             184        510      0                       2  ...               137                6.26               139.2              no
3332              74        415      0                       0  ...                77               10.86               241.4             yes

[3333 rows x 21 columns]

Validate data now exists in ‘churn’ index:

>>> es.search(index="churn", size=1) 
{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3333, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'churn', '_id': '0', '_score': 1.0, '_source': {'state': 'KS', 'account length': 128, 'area code': 415, 'phone number': '382-4657', 'international plan': 'no', 'voice mail plan': 'yes', 'number vmail messages': 25, 'total day minutes': 265.1, 'total day calls': 110, 'total day charge': 45.07, 'total eve minutes': 197.4, 'total eve calls': 99, 'total eve charge': 16.78, 'total night minutes': 244.7, 'total night calls': 91, 'total night charge': 11.01, 'total intl minutes': 10.0, 'total intl calls': 3, 'total intl charge': 2.7, 'customer service calls': 1, 'churn': 0}}]}}

TODO - currently the eland.DataFrame may not retain the order of the data in the csv.