eland.DataFrame#
- class eland.DataFrame(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None)#
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in Elasticsearch indices. Where possible APIs mirror pandas.DataFrame APIs. The underlying data is stored in Elasticsearch rather than core memory.
Parameters#
- es_client: Elasticsearch client argument(s) (e.g. ‘http://localhost:9200’)
elasticsearch-py parameters or
elasticsearch-py instance
- es_index_pattern: str
Elasticsearch index pattern. This can contain wildcards. (e.g. ‘flights’)
- columns: list of str, optional
List of DataFrame columns. A subset of the Elasticsearch index’s fields.
- es_index_field: str, optional
The Elasticsearch index field to use as the DataFrame index. Defaults to _id if None is used.
See Also#
Examples#
Constructing DataFrame from an Elasticsearch configuration arguments and an Elasticsearch index
>>> df = ed.DataFrame('http://localhost:9200', 'flights') >>> df.head() AvgTicketPrice Cancelled ... dayOfWeek timestamp 0 841.265642 False ... 0 2018-01-01 00:00:00 1 882.982662 False ... 0 2018-01-01 18:27:00 2 190.636904 False ... 0 2018-01-01 17:11:14 3 181.694216 True ... 0 2018-01-01 10:33:28 4 730.041778 False ... 0 2018-01-01 05:13:00 [5 rows x 28 columns]
Constructing DataFrame from an Elasticsearch client and an Elasticsearch index
>>> from elasticsearch import Elasticsearch >>> es = Elasticsearch("http://localhost:9200") >>> df = ed.DataFrame(es_client=es, es_index_pattern='flights', columns=['AvgTicketPrice', 'Cancelled']) >>> df.head() AvgTicketPrice Cancelled 0 841.265642 False 1 882.982662 False 2 190.636904 False 3 181.694216 True 4 730.041778 False [5 rows x 2 columns]
Constructing DataFrame from an Elasticsearch client and an Elasticsearch index, with ‘timestamp’ as the DataFrame index field (TODO - currently index_field must also be a field if not _id)
>>> df = ed.DataFrame( ... es_client='http://localhost:9200', ... es_index_pattern='flights', ... columns=['AvgTicketPrice', 'timestamp'], ... es_index_field='timestamp' ... ) >>> df.head() AvgTicketPrice timestamp 2018-01-01T00:00:00 841.265642 2018-01-01 00:00:00 2018-01-01T00:02:06 772.100846 2018-01-01 00:02:06 2018-01-01T00:06:27 159.990962 2018-01-01 00:06:27 2018-01-01T00:33:31 800.217104 2018-01-01 00:33:31 2018-01-01T00:36:51 803.015200 2018-01-01 00:36:51 [5 rows x 2 columns]
- __init__(es_client: Optional[Union[str, List[str], Tuple[str, ...], Elasticsearch]] = None, es_index_pattern: Optional[str] = None, columns: Optional[List[str]] = None, es_index_field: Optional[str] = None, _query_compiler: Optional[QueryCompiler] = None) None #
There are effectively 2 constructors:
client, index_pattern, columns, index_field
query_compiler (eland.QueryCompiler)
The constructor with ‘query_compiler’ is for internal use only.
Methods
__init__
([es_client, es_index_pattern, ...])There are effectively 2 constructors:
agg
(func[, axis, numeric_only])Aggregate using one or more operations over the specified axis.
aggregate
(func[, axis, numeric_only])Aggregate using one or more operations over the specified axis.
count
()Count non-NA cells for each column.
describe
()Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
drop
([labels, axis, index, columns, level, ...])Return new object with labels in requested axis removed.
es_info
()A debug summary of an eland DataFrame internals.
es_match
(text, *[, columns, match_phrase, ...])Filters data with an Elasticsearch
match
,match_phrase
, ormulti_match
query depending on the given parameters and columns.es_query
(query)Applies an Elasticsearch DSL query to the current DataFrame.
filter
([items, like, regex, axis])Subset the dataframe rows or columns according to the specified index labels.
get
(key[, default])Get item from object for given key (ex: DataFrame column).
groupby
([by, dropna])Used to perform groupby operations
head
([n])Return the first n rows.
hist
([column, by, grid, xlabelsize, xrot, ...])Make a histogram of the DataFrame's.
idxmax
([axis])Return index of first occurrence of maximum over requested axis.
idxmin
([axis])Return index of first occurrence of minimum over requested axis.
info
([verbose, buf, max_cols, memory_usage, ...])Print a concise summary of a DataFrame.
iterrows
()Iterate over eland.DataFrame rows as (index, pandas.Series) pairs.
itertuples
([index, name])Iterate over eland.DataFrame rows as namedtuples.
keys
()Return columns
mad
([numeric_only])Return standard deviation for each numeric column
max
([numeric_only])Return the maximum value for each numeric column
mean
([numeric_only])Return mean value for each numeric column
median
([numeric_only])Return the median value for each numeric column
min
([numeric_only])Return the minimum value for each numeric column
mode
([numeric_only, dropna, es_size])Calculate mode of a DataFrame
nunique
()Return cardinality of each field.
quantile
([q, numeric_only])Used to calculate quantile for a given DataFrame.
query
(expr)Query the columns of a DataFrame with a boolean expression.
sample
([n, frac, random_state])Return n randomly sample rows or the specify fraction of rows
select_dtypes
([include, exclude])Return a subset of the DataFrame's columns based on the column dtypes.
std
([numeric_only])Return standard deviation for each numeric column
sum
([numeric_only])Return sum for each numeric column
tail
([n])Return the last n rows.
to_csv
([path_or_buf, sep, na_rep, ...])Write Elasticsearch data to a comma-separated values (csv) file.
to_html
([buf, columns, col_space, header, ...])Render a Elasticsearch data as an HTML table.
to_json
([path_or_buf, orient, date_format, ...])Write Elasticsearch data to a json file.
to_numpy
()Not implemented.
to_pandas
([show_progress])Utility method to convert eland.Dataframe to pandas.Dataframe
to_string
([buf, columns, col_space, header, ...])Render a DataFrame to a console-friendly tabular output.
var
([numeric_only])Return variance for each numeric column
Attributes
The column labels of the DataFrame.
Return the pandas dtypes in the DataFrame.
Determines if the DataFrame is empty.
Return the Elasticsearch dtypes in the index
Return eland index referencing Elasticsearch field to index a DataFrame/Series
Returns 2 by definition of a DataFrame
Return a tuple representing the dimensionality of the DataFrame.
Return an int representing the number of elements in this object.
Not implemented.