eland.DataFrame¶

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in Elasticsearch indices. Where possible APIs mirror pandas.DataFrame APIs. The underlying data is stored in Elasticsearch rather than core memory.

Parameters¶

es_client: Elasticsearch client argument(s) (e.g. ‘http://localhost:9200’)

elasticsearch-py parameters or
elasticsearch-py instance

es_index_pattern: str

Elasticsearch index pattern. This can contain wildcards. (e.g. ‘flights’)

columns: list of str, optional

List of DataFrame columns. A subset of the Elasticsearch index’s fields.

es_index_field: str, optional

The Elasticsearch index field to use as the DataFrame index. Defaults to _id if None is used.

Examples¶

Constructing DataFrame from an Elasticsearch configuration arguments and an Elasticsearch index

>>> df = ed.DataFrame('http://localhost:9200', 'flights')
>>> df.head()
   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
0      841.265642      False  ...         0 2018-01-01 00:00:00
1      882.982662      False  ...         0 2018-01-01 18:27:00
2      190.636904      False  ...         0 2018-01-01 17:11:14
3      181.694216       True  ...         0 2018-01-01 10:33:28
4      730.041778      False  ...         0 2018-01-01 05:13:00

[5 rows x 28 columns]

Constructing DataFrame from an Elasticsearch client and an Elasticsearch index

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch("http://localhost:9200")
>>> df = ed.DataFrame(es_client=es, es_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled'])
>>> df.head()
   AvgTicketPrice  Cancelled
0      841.265642      False
1      882.982662      False
2      190.636904      False
3      181.694216       True
4      730.041778      False

[5 rows x 2 columns]

Constructing DataFrame from an Elasticsearch client and an Elasticsearch index, with ‘timestamp’ as the DataFrame index field (TODO - currently index_field must also be a field if not _id)

>>> df = ed.DataFrame(
...     es_client='http://localhost:9200',
...     es_index_pattern='flights',
...     columns=['AvgTicketPrice', 'timestamp'],
...     es_index_field='timestamp'
... )
>>> df.head()
                     AvgTicketPrice           timestamp
2018-01-01T00:00:00      841.265642 2018-01-01 00:00:00
2018-01-01T00:02:06      772.100846 2018-01-01 00:02:06
2018-01-01T00:06:27      159.990962 2018-01-01 00:06:27
2018-01-01T00:33:31      800.217104 2018-01-01 00:33:31
2018-01-01T00:36:51      803.015200 2018-01-01 00:36:51

[5 rows x 2 columns]

There are effectively 2 constructors:

client, index_pattern, columns, index_field
query_compiler (eland.QueryCompiler)

The constructor with ‘query_compiler’ is for internal use only.

Methods

`__init__`([es_client, es_index_pattern, ...])	There are effectively 2 constructors:
`agg`(func[, axis, numeric_only])	Aggregate using one or more operations over the specified axis.
`aggregate`(func[, axis, numeric_only])	Aggregate using one or more operations over the specified axis.
`count`()	Count non-NA cells for each column.
`describe`()	Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
`drop`([labels, axis, index, columns, level, ...])	Return new object with labels in requested axis removed.
`es_info`()	A debug summary of an eland DataFrame internals.
`es_match`(text, *[, columns, match_phrase, ...])	Filters data with an Elasticsearch `match`, `match_phrase`, or `multi_match` query depending on the given parameters and columns.
`es_query`(query)	Applies an Elasticsearch DSL query to the current DataFrame.
`filter`([items, like, regex, axis])	Subset the dataframe rows or columns according to the specified index labels.
`get`(key[, default])	Get item from object for given key (ex: DataFrame column).
`groupby`([by, dropna])	Used to perform groupby operations
`head`([n])	Return the first n rows.
`hist`([column, by, grid, xlabelsize, xrot, ...])	Make a histogram of the DataFrame's.
`idxmax`([axis])	Return index of first occurrence of maximum over requested axis.
`idxmin`([axis])	Return index of first occurrence of minimum over requested axis.
`info`([verbose, buf, max_cols, memory_usage, ...])	Print a concise summary of a DataFrame.
`iterrows`()	Iterate over eland.DataFrame rows as (index, pandas.Series) pairs.
`itertuples`([index, name])	Iterate over eland.DataFrame rows as namedtuples.
`keys`()	Return columns
`mad`([numeric_only])	Return standard deviation for each numeric column
`max`([numeric_only])	Return the maximum value for each numeric column
`mean`([numeric_only])	Return mean value for each numeric column
`median`([numeric_only])	Return the median value for each numeric column
`min`([numeric_only])	Return the minimum value for each numeric column
`mode`([numeric_only, dropna, es_size])	Calculate mode of a DataFrame
`nunique`()	Return cardinality of each field.
`quantile`([q, numeric_only])	Used to calculate quantile for a given DataFrame.
`query`(expr)	Query the columns of a DataFrame with a boolean expression.
`sample`([n, frac, random_state])	Return n randomly sample rows or the specify fraction of rows
`select_dtypes`([include, exclude])	Return a subset of the DataFrame's columns based on the column dtypes.
`std`([numeric_only])	Return standard deviation for each numeric column
`sum`([numeric_only])	Return sum for each numeric column
`tail`([n])	Return the last n rows.
`to_csv`([path_or_buf, sep, na_rep, ...])	Write Elasticsearch data to a comma-separated values (csv) file.
`to_html`([buf, columns, col_space, header, ...])	Render a Elasticsearch data as an HTML table.
`to_json`([path_or_buf, orient, date_format, ...])	Write Elasticsearch data to a json file.
`to_numpy`()	Not implemented.
`to_pandas`([show_progress])	Utility method to convert eland.Dataframe to pandas.Dataframe
`to_string`([buf, columns, col_space, header, ...])	Render a DataFrame to a console-friendly tabular output.
`var`([numeric_only])	Return variance for each numeric column

Attributes

`columns`	The column labels of the DataFrame.
`dtypes`	Return the pandas dtypes in the DataFrame.
`empty`	Determines if the DataFrame is empty.
`es_dtypes`	Return the Elasticsearch dtypes in the index
`index`	Return eland index referencing Elasticsearch field to index a DataFrame/Series
`ndim`	Returns 2 by definition of a DataFrame
`shape`	Return a tuple representing the dimensionality of the DataFrame.
`size`	Return an int representing the number of elements in this object.
`values`	Not implemented.

eland.DataFrame¶

Parameters¶

See Also¶

Examples¶