eland.DataFrame#

class eland.DataFrame(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None)#

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in Elasticsearch indices. Where possible APIs mirror pandas.DataFrame APIs. The underlying data is stored in Elasticsearch rather than core memory.

Parameters#

es_client: Elasticsearch client argument(s) (e.g. ‘http://localhost:9200’)
  • elasticsearch-py parameters or

  • elasticsearch-py instance

es_index_pattern: str

Elasticsearch index pattern. This can contain wildcards. (e.g. ‘flights’)

columns: list of str, optional

List of DataFrame columns. A subset of the Elasticsearch index’s fields.

es_index_field: str, optional

The Elasticsearch index field to use as the DataFrame index. Defaults to _id if None is used.

See Also#

pandas.DataFrame

Examples#

Constructing DataFrame from an Elasticsearch configuration arguments and an Elasticsearch index

>>> df = ed.DataFrame('http://localhost:9200', 'flights')
>>> df.head()
   AvgTicketPrice  Cancelled  ... dayOfWeek           timestamp
0      841.265642      False  ...         0 2018-01-01 00:00:00
1      882.982662      False  ...         0 2018-01-01 18:27:00
2      190.636904      False  ...         0 2018-01-01 17:11:14
3      181.694216       True  ...         0 2018-01-01 10:33:28
4      730.041778      False  ...         0 2018-01-01 05:13:00

[5 rows x 27 columns]

Constructing DataFrame from an Elasticsearch client and an Elasticsearch index

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch("http://localhost:9200")
>>> df = ed.DataFrame(es_client=es, es_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled'])
>>> df.head()
   AvgTicketPrice  Cancelled
0      841.265642      False
1      882.982662      False
2      190.636904      False
3      181.694216       True
4      730.041778      False

[5 rows x 2 columns]

Constructing DataFrame from an Elasticsearch client and an Elasticsearch index, with ‘timestamp’ as the DataFrame index field (TODO - currently index_field must also be a field if not _id)

>>> df = ed.DataFrame(
...     es_client='http://localhost:9200',
...     es_index_pattern='flights',
...     columns=['AvgTicketPrice', 'timestamp'],
...     es_index_field='timestamp'
... )
>>> df.head()
                     AvgTicketPrice           timestamp
2018-01-01T00:00:00      841.265642 2018-01-01 00:00:00
2018-01-01T00:02:06      772.100846 2018-01-01 00:02:06
2018-01-01T00:06:27      159.990962 2018-01-01 00:06:27
2018-01-01T00:33:31      800.217104 2018-01-01 00:33:31
2018-01-01T00:36:51      803.015200 2018-01-01 00:36:51

[5 rows x 2 columns]
__init__(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None) None#

There are effectively 2 constructors:

  1. client, index_pattern, columns, index_field

  2. query_compiler (eland.QueryCompiler)

The constructor with ‘query_compiler’ is for internal use only.

Methods

__init__([es_client, es_index_pattern, ...])

There are effectively 2 constructors:

agg(func[, axis, numeric_only])

Aggregate using one or more operations over the specified axis.

aggregate(func[, axis, numeric_only])

Aggregate using one or more operations over the specified axis.

count()

Count non-NA cells for each column.

describe()

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

drop([labels, axis, index, columns, level, ...])

Return new object with labels in requested axis removed.

es_info()

A debug summary of an eland DataFrame internals.

es_match(text, *[, columns, match_phrase, ...])

Filters data with an Elasticsearch match, match_phrase, or multi_match query depending on the given parameters and columns.

es_query(query)

Applies an Elasticsearch DSL query to the current DataFrame.

filter([items, like, regex, axis])

Subset the dataframe rows or columns according to the specified index labels.

get(key[, default])

Get item from object for given key (ex: DataFrame column).

groupby([by, dropna])

Used to perform groupby operations

head([n])

Return the first n rows.

hist([column, by, grid, xlabelsize, xrot, ...])

Make a histogram of the DataFrame's.

idxmax([axis])

Return index of first occurrence of maximum over requested axis.

idxmin([axis])

Return index of first occurrence of minimum over requested axis.

info([verbose, buf, max_cols, memory_usage, ...])

Print a concise summary of a DataFrame.

iterrows()

Iterate over eland.DataFrame rows as (index, pandas.Series) pairs.

itertuples([index, name])

Iterate over eland.DataFrame rows as namedtuples.

keys()

Return columns

mad([numeric_only])

Return standard deviation for each numeric column

max([numeric_only])

Return the maximum value for each numeric column

mean([numeric_only])

Return mean value for each numeric column

median([numeric_only])

Return the median value for each numeric column

min([numeric_only])

Return the minimum value for each numeric column

mode([numeric_only, dropna, es_size])

Calculate mode of a DataFrame

nunique()

Return cardinality of each field.

quantile([q, numeric_only])

Used to calculate quantile for a given DataFrame.

query(expr)

Query the columns of a DataFrame with a boolean expression.

sample([n, frac, random_state])

Return n randomly sample rows or the specify fraction of rows

select_dtypes([include, exclude])

Return a subset of the DataFrame's columns based on the column dtypes.

std([numeric_only])

Return standard deviation for each numeric column

sum([numeric_only])

Return sum for each numeric column

tail([n])

Return the last n rows.

to_csv([path_or_buf, sep, na_rep, ...])

Write Elasticsearch data to a comma-separated values (csv) file.

to_html([buf, columns, col_space, header, ...])

Render a Elasticsearch data as an HTML table.

to_json([path_or_buf, orient, date_format, ...])

Write Elasticsearch data to a json file.

to_numpy()

Not implemented.

to_pandas([show_progress])

Utility method to convert eland.Dataframe to pandas.Dataframe

to_string([buf, columns, col_space, header, ...])

Render a DataFrame to a console-friendly tabular output.

var([numeric_only])

Return variance for each numeric column

Attributes

columns

The column labels of the DataFrame.

dtypes

Return the pandas dtypes in the DataFrame.

empty

Determines if the DataFrame is empty.

es_dtypes

Return the Elasticsearch dtypes in the index

index

Return eland index referencing Elasticsearch field to index a DataFrame/Series

ndim

Returns 2 by definition of a DataFrame

shape

Return a tuple representing the dimensionality of the DataFrame.

size

Return an int representing the number of elements in this object.

values

Not implemented.