Eland: DataFrames and Machine Learning backed by Elasticsearch¶
Date: Dec 16, 2021 Version: 8.0.0b1
Useful links: Source Repository | Issues & Ideas | Q&A Support
Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.
Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.
Installing Eland¶
Eland can be installed from PyPI via pip:
$ python -m pip install eland
Eland can also be installed from Conda Forge with Conda:
$ conda install -c conda-forge eland
Getting Started¶
If it’s your first time using Eland we recommend looking through the Examples documentation for ideas on what Eland is capable of.
If you’re new to Elasticsearch we recommend reading the documentation.
API Reference¶
This page gives an overview of all public eland objects, functions and
methods. All classes and functions exposed in eland.*
namespace are public.
Supported Pandas APIs¶
The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not Eland supports the method on the corresponding object.
Note
Even if an interface is listed here as “supported” doesn’t mean all parameters are implemented. Sometimes only a subset of parameters can be supported.
If you have need of an operation that is listed as not implemented, feel free to open an issue or give a thumbs up to already created issues. Contributions are also welcome!
Note
Some Pandas methods are not implementable due to the constraints of the library using Elasticsearch as the backend for the majority of data processing. Functions like
DataFrame.iloc[i]
orDataFrame.transpose()
likely won’t be implementable in Eland due to this constraint.
This list should be automatically generated with utils/generate-supported-apis.py
script instead of being modified manually.
Method or Property | Supported? |
---|---|
ed.DataFrame.abs() |
No |
ed.DataFrame.add() |
No |
ed.DataFrame.add_prefix() |
No |
ed.DataFrame.add_suffix() |
No |
ed.DataFrame.agg() |
Yes |
ed.DataFrame.aggregate() |
Yes |
ed.DataFrame.align() |
No |
ed.DataFrame.all() |
No |
ed.DataFrame.any() |
No |
ed.DataFrame.append() |
No |
ed.DataFrame.apply() |
No |
ed.DataFrame.applymap() |
No |
ed.DataFrame.asfreq() |
No |
ed.DataFrame.asof() |
No |
ed.DataFrame.assign() |
No |
ed.DataFrame.astype() |
No |
ed.DataFrame.at |
No |
ed.DataFrame.at_time() |
No |
ed.DataFrame.attrs |
No |
ed.DataFrame.axes |
No |
ed.DataFrame.backfill() |
No |
ed.DataFrame.between_time() |
No |
ed.DataFrame.bfill() |
No |
ed.DataFrame.bool() |
No |
ed.DataFrame.boxplot() |
No |
ed.DataFrame.clip() |
No |
ed.DataFrame.columns |
Yes |
ed.DataFrame.combine() |
No |
ed.DataFrame.combine_first() |
No |
ed.DataFrame.compare() |
No |
ed.DataFrame.convert_dtypes() |
No |
ed.DataFrame.copy() |
No |
ed.DataFrame.corr() |
No |
ed.DataFrame.corrwith() |
No |
ed.DataFrame.count() |
Yes |
ed.DataFrame.cov() |
No |
ed.DataFrame.cummax() |
No |
ed.DataFrame.cummin() |
No |
ed.DataFrame.cumprod() |
No |
ed.DataFrame.cumsum() |
No |
ed.DataFrame.describe() |
Yes |
ed.DataFrame.diff() |
No |
ed.DataFrame.div() |
No |
ed.DataFrame.divide() |
No |
ed.DataFrame.dot() |
No |
ed.DataFrame.drop() |
Yes |
ed.DataFrame.drop_duplicates() |
No |
ed.DataFrame.droplevel() |
No |
ed.DataFrame.dropna() |
No |
ed.DataFrame.dtypes |
Yes |
ed.DataFrame.duplicated() |
No |
ed.DataFrame.empty |
Yes |
ed.DataFrame.eq() |
No |
ed.DataFrame.equals() |
No |
ed.DataFrame.eval() |
No |
ed.DataFrame.ewm() |
No |
ed.DataFrame.expanding() |
No |
ed.DataFrame.explode() |
No |
ed.DataFrame.ffill() |
No |
ed.DataFrame.fillna() |
No |
ed.DataFrame.filter() |
Yes |
ed.DataFrame.first() |
No |
ed.DataFrame.first_valid_index() |
No |
ed.DataFrame.flags |
No |
ed.DataFrame.floordiv() |
No |
ed.DataFrame.from_dict() |
No |
ed.DataFrame.from_records() |
No |
ed.DataFrame.ge() |
No |
ed.DataFrame.get() |
Yes |
ed.DataFrame.groupby() |
Yes |
ed.DataFrame.gt() |
No |
ed.DataFrame.head() |
Yes |
ed.DataFrame.hist() |
Yes |
ed.DataFrame.iat |
No |
ed.DataFrame.idxmax() |
Yes |
ed.DataFrame.idxmin() |
Yes |
ed.DataFrame.iloc |
No |
ed.DataFrame.index |
Yes |
ed.DataFrame.infer_objects() |
No |
ed.DataFrame.info() |
Yes |
ed.DataFrame.insert() |
No |
ed.DataFrame.interpolate() |
No |
ed.DataFrame.isin() |
No |
ed.DataFrame.isna() |
No |
ed.DataFrame.isnull() |
No |
ed.DataFrame.items() |
No |
ed.DataFrame.iteritems() |
No |
ed.DataFrame.iterrows() |
Yes |
ed.DataFrame.itertuples() |
Yes |
ed.DataFrame.join() |
No |
ed.DataFrame.keys() |
Yes |
ed.DataFrame.kurt() |
No |
ed.DataFrame.kurtosis() |
No |
ed.DataFrame.last() |
No |
ed.DataFrame.last_valid_index() |
No |
ed.DataFrame.le() |
No |
ed.DataFrame.loc |
No |
ed.DataFrame.lookup() |
No |
ed.DataFrame.lt() |
No |
ed.DataFrame.mad() |
Yes |
ed.DataFrame.mask() |
No |
ed.DataFrame.max() |
Yes |
ed.DataFrame.mean() |
Yes |
ed.DataFrame.median() |
Yes |
ed.DataFrame.melt() |
No |
ed.DataFrame.memory_usage() |
No |
ed.DataFrame.merge() |
No |
ed.DataFrame.min() |
Yes |
ed.DataFrame.mod() |
No |
ed.DataFrame.mode() |
Yes |
ed.DataFrame.mul() |
No |
ed.DataFrame.multiply() |
No |
ed.DataFrame.ndim |
Yes |
ed.DataFrame.ne() |
No |
ed.DataFrame.nlargest() |
No |
ed.DataFrame.notna() |
No |
ed.DataFrame.notnull() |
No |
ed.DataFrame.nsmallest() |
No |
ed.DataFrame.nunique() |
Yes |
ed.DataFrame.pad() |
No |
ed.DataFrame.pct_change() |
No |
ed.DataFrame.pipe() |
No |
ed.DataFrame.pivot() |
No |
ed.DataFrame.pivot_table() |
No |
ed.DataFrame.pop() |
No |
ed.DataFrame.pow() |
No |
ed.DataFrame.prod() |
No |
ed.DataFrame.product() |
No |
ed.DataFrame.quantile() |
Yes |
ed.DataFrame.query() |
Yes |
ed.DataFrame.radd() |
No |
ed.DataFrame.rank() |
No |
ed.DataFrame.rdiv() |
No |
ed.DataFrame.reindex() |
No |
ed.DataFrame.reindex_like() |
No |
ed.DataFrame.rename() |
No |
ed.DataFrame.rename_axis() |
No |
ed.DataFrame.reorder_levels() |
No |
ed.DataFrame.replace() |
No |
ed.DataFrame.resample() |
No |
ed.DataFrame.reset_index() |
No |
ed.DataFrame.rfloordiv() |
No |
ed.DataFrame.rmod() |
No |
ed.DataFrame.rmul() |
No |
ed.DataFrame.rolling() |
No |
ed.DataFrame.round() |
No |
ed.DataFrame.rpow() |
No |
ed.DataFrame.rsub() |
No |
ed.DataFrame.rtruediv() |
No |
ed.DataFrame.sample() |
Yes |
ed.DataFrame.select_dtypes() |
Yes |
ed.DataFrame.sem() |
No |
ed.DataFrame.set_axis() |
No |
ed.DataFrame.set_flags() |
No |
ed.DataFrame.set_index() |
No |
ed.DataFrame.shape |
Yes |
ed.DataFrame.shift() |
No |
ed.DataFrame.size |
Yes |
ed.DataFrame.skew() |
No |
ed.DataFrame.slice_shift() |
No |
ed.DataFrame.sort_index() |
No |
ed.DataFrame.sort_values() |
No |
ed.DataFrame.squeeze() |
No |
ed.DataFrame.stack() |
No |
ed.DataFrame.std() |
Yes |
ed.DataFrame.style |
No |
ed.DataFrame.sub() |
No |
ed.DataFrame.subtract() |
No |
ed.DataFrame.sum() |
Yes |
ed.DataFrame.swapaxes() |
No |
ed.DataFrame.swaplevel() |
No |
ed.DataFrame.T |
No |
ed.DataFrame.tail() |
Yes |
ed.DataFrame.take() |
No |
ed.DataFrame.to_clipboard() |
No |
ed.DataFrame.to_csv() |
Yes |
ed.DataFrame.to_dict() |
No |
ed.DataFrame.to_excel() |
No |
ed.DataFrame.to_feather() |
No |
ed.DataFrame.to_gbq() |
No |
ed.DataFrame.to_hdf() |
No |
ed.DataFrame.to_html() |
Yes |
ed.DataFrame.to_json() |
No |
ed.DataFrame.to_latex() |
No |
ed.DataFrame.to_markdown() |
No |
ed.DataFrame.to_numpy() |
Yes |
ed.DataFrame.to_parquet() |
No |
ed.DataFrame.to_period() |
No |
ed.DataFrame.to_pickle() |
No |
ed.DataFrame.to_records() |
No |
ed.DataFrame.to_sql() |
No |
ed.DataFrame.to_stata() |
No |
ed.DataFrame.to_string() |
Yes |
ed.DataFrame.to_timestamp() |
No |
ed.DataFrame.to_xarray() |
No |
ed.DataFrame.to_xml() |
No |
ed.DataFrame.transform() |
No |
ed.DataFrame.transpose() |
No |
ed.DataFrame.truediv() |
No |
ed.DataFrame.truncate() |
No |
ed.DataFrame.tshift() |
No |
ed.DataFrame.tz_convert() |
No |
ed.DataFrame.tz_localize() |
No |
ed.DataFrame.unstack() |
No |
ed.DataFrame.update() |
No |
ed.DataFrame.value_counts() |
No |
ed.DataFrame.values |
Yes |
ed.DataFrame.var() |
Yes |
ed.DataFrame.where() |
No |
ed.DataFrame.xs() |
No |
ed.DataFrame.__abs__() |
No |
ed.DataFrame.__add__() |
No |
ed.DataFrame.__and__() |
No |
ed.DataFrame.__annotations__ |
No |
ed.DataFrame.__array__() |
No |
ed.DataFrame.__array_priority__ |
No |
ed.DataFrame.__array_ufunc__() |
No |
ed.DataFrame.__array_wrap__() |
No |
ed.DataFrame.__bool__() |
No |
ed.DataFrame.__contains__() |
No |
ed.DataFrame.__copy__() |
No |
ed.DataFrame.__deepcopy__() |
No |
ed.DataFrame.__delattr__ |
Yes |
ed.DataFrame.__delitem__() |
No |
ed.DataFrame.__dict__ |
Yes |
ed.DataFrame.__dir__() |
Yes |
ed.DataFrame.__divmod__() |
No |
ed.DataFrame.__doc__ |
Yes |
ed.DataFrame.__eq__() |
Yes |
ed.DataFrame.__finalize__() |
No |
ed.DataFrame.__floordiv__() |
No |
ed.DataFrame.__format__ |
Yes |
ed.DataFrame.__ge__() |
Yes |
ed.DataFrame.__getattr__() |
Yes |
ed.DataFrame.__getattribute__ |
Yes |
ed.DataFrame.__getitem__() |
Yes |
ed.DataFrame.__getstate__() |
No |
ed.DataFrame.__gt__() |
Yes |
ed.DataFrame.__hash__ |
Yes |
ed.DataFrame.__iadd__() |
No |
ed.DataFrame.__iand__() |
No |
ed.DataFrame.__ifloordiv__() |
No |
ed.DataFrame.__imod__() |
No |
ed.DataFrame.__imul__() |
No |
ed.DataFrame.__init__() |
Yes |
ed.DataFrame.__init_subclass__ |
Yes |
ed.DataFrame.__invert__() |
No |
ed.DataFrame.__ior__() |
No |
ed.DataFrame.__ipow__() |
No |
ed.DataFrame.__isub__() |
No |
ed.DataFrame.__iter__() |
No |
ed.DataFrame.__itruediv__() |
No |
ed.DataFrame.__ixor__() |
No |
ed.DataFrame.__le__() |
Yes |
ed.DataFrame.__len__() |
Yes |
ed.DataFrame.__lt__() |
Yes |
ed.DataFrame.__matmul__() |
No |
ed.DataFrame.__mod__() |
No |
ed.DataFrame.__module__ |
Yes |
ed.DataFrame.__mul__() |
No |
ed.DataFrame.__ne__() |
Yes |
ed.DataFrame.__neg__() |
No |
ed.DataFrame.__new__ |
Yes |
ed.DataFrame.__nonzero__() |
No |
ed.DataFrame.__or__() |
No |
ed.DataFrame.__pos__() |
No |
ed.DataFrame.__pow__() |
No |
ed.DataFrame.__radd__() |
No |
ed.DataFrame.__rand__() |
No |
ed.DataFrame.__rdivmod__() |
No |
ed.DataFrame.__reduce__ |
Yes |
ed.DataFrame.__reduce_ex__ |
Yes |
ed.DataFrame.__repr__() |
Yes |
ed.DataFrame.__rfloordiv__() |
No |
ed.DataFrame.__rmatmul__() |
No |
ed.DataFrame.__rmod__() |
No |
ed.DataFrame.__rmul__() |
No |
ed.DataFrame.__ror__() |
No |
ed.DataFrame.__round__() |
No |
ed.DataFrame.__rpow__() |
No |
ed.DataFrame.__rsub__() |
No |
ed.DataFrame.__rtruediv__() |
No |
ed.DataFrame.__rxor__() |
No |
ed.DataFrame.__setattr__() |
Yes |
ed.DataFrame.__setitem__() |
No |
ed.DataFrame.__setstate__() |
No |
ed.DataFrame.__sizeof__() |
Yes |
ed.DataFrame.__str__ |
Yes |
ed.DataFrame.__sub__() |
No |
ed.DataFrame.__subclasshook__ |
Yes |
ed.DataFrame.__truediv__() |
No |
ed.DataFrame.__weakref__ |
Yes |
ed.DataFrame.__xor__() |
No |
ed.Series.abs() |
No |
ed.Series.add() |
Yes |
ed.Series.add_prefix() |
No |
ed.Series.add_suffix() |
No |
ed.Series.agg() |
No |
ed.Series.aggregate() |
No |
ed.Series.align() |
No |
ed.Series.all() |
No |
ed.Series.any() |
No |
ed.Series.append() |
No |
ed.Series.apply() |
No |
ed.Series.argmax() |
No |
ed.Series.argmin() |
No |
ed.Series.argsort() |
No |
ed.Series.array |
No |
ed.Series.asfreq() |
No |
ed.Series.asof() |
No |
ed.Series.astype() |
No |
ed.Series.at |
No |
ed.Series.at_time() |
No |
ed.Series.attrs |
No |
ed.Series.autocorr() |
No |
ed.Series.axes |
No |
ed.Series.backfill() |
No |
ed.Series.between() |
No |
ed.Series.between_time() |
No |
ed.Series.bfill() |
No |
ed.Series.bool() |
No |
ed.Series.clip() |
No |
ed.Series.combine() |
No |
ed.Series.combine_first() |
No |
ed.Series.compare() |
No |
ed.Series.convert_dtypes() |
No |
ed.Series.copy() |
No |
ed.Series.corr() |
No |
ed.Series.count() |
No |
ed.Series.cov() |
No |
ed.Series.cummax() |
No |
ed.Series.cummin() |
No |
ed.Series.cumprod() |
No |
ed.Series.cumsum() |
No |
ed.Series.describe() |
Yes |
ed.Series.diff() |
No |
ed.Series.div() |
Yes |
ed.Series.divide() |
Yes |
ed.Series.divmod() |
No |
ed.Series.dot() |
No |
ed.Series.drop() |
No |
ed.Series.drop_duplicates() |
No |
ed.Series.droplevel() |
No |
ed.Series.dropna() |
No |
ed.Series.dtype |
Yes |
ed.Series.dtypes |
Yes |
ed.Series.duplicated() |
No |
ed.Series.empty |
Yes |
ed.Series.eq() |
No |
ed.Series.equals() |
No |
ed.Series.ewm() |
No |
ed.Series.expanding() |
No |
ed.Series.explode() |
No |
ed.Series.factorize() |
No |
ed.Series.ffill() |
No |
ed.Series.fillna() |
No |
ed.Series.filter() |
Yes |
ed.Series.first() |
No |
ed.Series.first_valid_index() |
No |
ed.Series.flags |
No |
ed.Series.floordiv() |
Yes |
ed.Series.ge() |
No |
ed.Series.get() |
No |
ed.Series.groupby() |
No |
ed.Series.gt() |
No |
ed.Series.hasnans |
No |
ed.Series.head() |
Yes |
ed.Series.hist() |
Yes |
ed.Series.iat |
No |
ed.Series.idxmax() |
No |
ed.Series.idxmin() |
No |
ed.Series.iloc |
No |
ed.Series.index |
Yes |
ed.Series.infer_objects() |
No |
ed.Series.interpolate() |
No |
ed.Series.is_monotonic |
No |
ed.Series.is_monotonic_decreasing |
No |
ed.Series.is_monotonic_increasing |
No |
ed.Series.is_unique |
No |
ed.Series.isin() |
Yes |
ed.Series.isna() |
Yes |
ed.Series.isnull() |
Yes |
ed.Series.item() |
No |
ed.Series.items() |
No |
ed.Series.iteritems() |
No |
ed.Series.keys() |
No |
ed.Series.kurt() |
No |
ed.Series.kurtosis() |
No |
ed.Series.last() |
No |
ed.Series.last_valid_index() |
No |
ed.Series.le() |
No |
ed.Series.loc |
No |
ed.Series.lt() |
No |
ed.Series.mad() |
Yes |
ed.Series.map() |
No |
ed.Series.mask() |
No |
ed.Series.max() |
Yes |
ed.Series.mean() |
Yes |
ed.Series.median() |
Yes |
ed.Series.memory_usage() |
No |
ed.Series.min() |
Yes |
ed.Series.mod() |
Yes |
ed.Series.mode() |
Yes |
ed.Series.mul() |
Yes |
ed.Series.multiply() |
Yes |
ed.Series.name |
Yes |
ed.Series.nbytes |
No |
ed.Series.ndim |
Yes |
ed.Series.ne() |
No |
ed.Series.nlargest() |
No |
ed.Series.notna() |
Yes |
ed.Series.notnull() |
Yes |
ed.Series.nsmallest() |
No |
ed.Series.nunique() |
Yes |
ed.Series.pad() |
No |
ed.Series.pct_change() |
No |
ed.Series.pipe() |
No |
ed.Series.pop() |
No |
ed.Series.pow() |
Yes |
ed.Series.prod() |
No |
ed.Series.product() |
No |
ed.Series.quantile() |
Yes |
ed.Series.radd() |
Yes |
ed.Series.rank() |
No |
ed.Series.ravel() |
No |
ed.Series.rdiv() |
Yes |
ed.Series.rdivmod() |
No |
ed.Series.reindex() |
No |
ed.Series.reindex_like() |
No |
ed.Series.rename() |
Yes |
ed.Series.rename_axis() |
No |
ed.Series.reorder_levels() |
No |
ed.Series.repeat() |
No |
ed.Series.replace() |
No |
ed.Series.resample() |
No |
ed.Series.reset_index() |
No |
ed.Series.rfloordiv() |
Yes |
ed.Series.rmod() |
Yes |
ed.Series.rmul() |
Yes |
ed.Series.rolling() |
No |
ed.Series.round() |
No |
ed.Series.rpow() |
Yes |
ed.Series.rsub() |
Yes |
ed.Series.rtruediv() |
Yes |
ed.Series.sample() |
Yes |
ed.Series.searchsorted() |
No |
ed.Series.sem() |
No |
ed.Series.set_axis() |
No |
ed.Series.set_flags() |
No |
ed.Series.shape |
Yes |
ed.Series.shift() |
No |
ed.Series.size |
Yes |
ed.Series.skew() |
No |
ed.Series.slice_shift() |
No |
ed.Series.sort_index() |
No |
ed.Series.sort_values() |
No |
ed.Series.squeeze() |
No |
ed.Series.std() |
Yes |
ed.Series.sub() |
Yes |
ed.Series.subtract() |
Yes |
ed.Series.sum() |
Yes |
ed.Series.swapaxes() |
No |
ed.Series.swaplevel() |
No |
ed.Series.T |
No |
ed.Series.tail() |
Yes |
ed.Series.take() |
No |
ed.Series.to_clipboard() |
No |
ed.Series.to_csv() |
No |
ed.Series.to_dict() |
No |
ed.Series.to_excel() |
No |
ed.Series.to_frame() |
No |
ed.Series.to_hdf() |
No |
ed.Series.to_json() |
No |
ed.Series.to_latex() |
No |
ed.Series.to_list() |
No |
ed.Series.to_markdown() |
No |
ed.Series.to_numpy() |
Yes |
ed.Series.to_period() |
No |
ed.Series.to_pickle() |
No |
ed.Series.to_sql() |
No |
ed.Series.to_string() |
Yes |
ed.Series.to_timestamp() |
No |
ed.Series.to_xarray() |
No |
ed.Series.tolist() |
No |
ed.Series.transform() |
No |
ed.Series.transpose() |
No |
ed.Series.truediv() |
Yes |
ed.Series.truncate() |
No |
ed.Series.tshift() |
No |
ed.Series.tz_convert() |
No |
ed.Series.tz_localize() |
No |
ed.Series.unique() |
No |
ed.Series.unstack() |
No |
ed.Series.update() |
No |
ed.Series.value_counts() |
Yes |
ed.Series.values |
No |
ed.Series.var() |
Yes |
ed.Series.view() |
No |
ed.Series.where() |
No |
ed.Series.xs() |
No |
ed.Series.__abs__() |
No |
ed.Series.__add__() |
Yes |
ed.Series.__and__() |
No |
ed.Series.__annotations__ |
No |
ed.Series.__array__() |
No |
ed.Series.__array_priority__ |
No |
ed.Series.__array_ufunc__() |
No |
ed.Series.__array_wrap__() |
No |
ed.Series.__bool__() |
No |
ed.Series.__contains__() |
No |
ed.Series.__copy__() |
No |
ed.Series.__deepcopy__() |
No |
ed.Series.__delattr__ |
Yes |
ed.Series.__delitem__() |
No |
ed.Series.__dict__ |
Yes |
ed.Series.__dir__() |
Yes |
ed.Series.__divmod__() |
No |
ed.Series.__doc__ |
Yes |
ed.Series.__eq__() |
Yes |
ed.Series.__finalize__() |
No |
ed.Series.__float__() |
No |
ed.Series.__floordiv__() |
Yes |
ed.Series.__format__ |
Yes |
ed.Series.__ge__() |
Yes |
ed.Series.__getattr__() |
No |
ed.Series.__getattribute__ |
Yes |
ed.Series.__getitem__() |
No |
ed.Series.__getstate__() |
No |
ed.Series.__gt__() |
Yes |
ed.Series.__hash__ |
Yes |
ed.Series.__iadd__() |
No |
ed.Series.__iand__() |
No |
ed.Series.__ifloordiv__() |
No |
ed.Series.__imod__() |
No |
ed.Series.__imul__() |
No |
ed.Series.__init__() |
Yes |
ed.Series.__init_subclass__ |
Yes |
ed.Series.__int__() |
No |
ed.Series.__invert__() |
No |
ed.Series.__ior__() |
No |
ed.Series.__ipow__() |
No |
ed.Series.__isub__() |
No |
ed.Series.__iter__() |
No |
ed.Series.__itruediv__() |
No |
ed.Series.__ixor__() |
No |
ed.Series.__le__() |
Yes |
ed.Series.__len__() |
Yes |
ed.Series.__long__() |
No |
ed.Series.__lt__() |
Yes |
ed.Series.__matmul__() |
No |
ed.Series.__mod__() |
Yes |
ed.Series.__module__ |
Yes |
ed.Series.__mul__() |
Yes |
ed.Series.__ne__() |
Yes |
ed.Series.__neg__() |
No |
ed.Series.__new__ |
Yes |
ed.Series.__nonzero__() |
No |
ed.Series.__or__() |
No |
ed.Series.__pos__() |
No |
ed.Series.__pow__() |
Yes |
ed.Series.__radd__() |
Yes |
ed.Series.__rand__() |
No |
ed.Series.__rdivmod__() |
No |
ed.Series.__reduce__ |
Yes |
ed.Series.__reduce_ex__ |
Yes |
ed.Series.__repr__() |
Yes |
ed.Series.__rfloordiv__() |
Yes |
ed.Series.__rmatmul__() |
No |
ed.Series.__rmod__() |
Yes |
ed.Series.__rmul__() |
Yes |
ed.Series.__ror__() |
No |
ed.Series.__round__() |
No |
ed.Series.__rpow__() |
Yes |
ed.Series.__rsub__() |
Yes |
ed.Series.__rtruediv__() |
Yes |
ed.Series.__rxor__() |
No |
ed.Series.__setattr__() |
Yes |
ed.Series.__setitem__() |
No |
ed.Series.__setstate__() |
No |
ed.Series.__sizeof__() |
Yes |
ed.Series.__str__ |
Yes |
ed.Series.__sub__() |
Yes |
ed.Series.__subclasshook__ |
Yes |
ed.Series.__truediv__() |
Yes |
ed.Series.__weakref__ |
Yes |
ed.Series.__xor__() |
No |
DataFrame¶
Constructor¶
DataFrame (es_client, List[str], Tuple[str, …) |
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in Elasticsearch indices. |
Attributes and Underlying Data¶
DataFrame.index |
Return eland index referencing Elasticsearch field to index a DataFrame/Series |
DataFrame.columns |
The column labels of the DataFrame. |
DataFrame.dtypes |
Return the pandas dtypes in the DataFrame. |
DataFrame.select_dtypes ([include, exclude]) |
Return a subset of the DataFrame’s columns based on the column dtypes. |
DataFrame.values |
Not implemented. |
DataFrame.empty |
Determines if the DataFrame is empty. |
DataFrame.shape |
Return a tuple representing the dimensionality of the DataFrame. |
DataFrame.ndim |
Returns 2 by definition of a DataFrame |
DataFrame.size |
Return an int representing the number of elements in this object. |
Indexing, Iteration¶
DataFrame.head (n) |
Return the first n rows. |
DataFrame.keys () |
Return columns |
DataFrame.tail (n) |
Return the last n rows. |
DataFrame.get (key, default) |
Get item from object for given key (ex: DataFrame column). |
DataFrame.query (expr) |
Query the columns of a DataFrame with a boolean expression. |
DataFrame.sample (n, frac, random_state) |
Return n randomly sample rows or the specify fraction of rows |
DataFrame.iterrows () |
Iterate over eland.DataFrame rows as (index, pandas.Series) pairs. |
DataFrame.itertuples (index, name) |
Iterate over eland.DataFrame rows as namedtuples. |
Function Application, GroupBy & Window¶
Note
Elasticsearch aggregations using cardinality (count
) are accurate
approximations using the HyperLogLog++ algorithm so may not
be exact.
DataFrame.agg (func, List[str]], axis, …) |
Aggregate using one or more operations over the specified axis. |
DataFrame.aggregate (func, List[str]], axis, …) |
Aggregate using one or more operations over the specified axis. |
DataFrame.groupby (by, List[str], …) |
Used to perform groupby operations |
DataFrameGroupBy (by, query_compiler, dropna) |
This holds all the groupby methods for eland.DataFrame.groupby() |
DataFrameGroupBy.agg (func, List[str]], …) |
Used to groupby and aggregate |
DataFrameGroupBy.aggregate (func, List[str]], …) |
Used to groupby and aggregate |
DataFrameGroupBy.count () |
Compute the count value for each group. |
DataFrameGroupBy.mad (numeric_only) |
Compute the median absolute deviation value for each group. |
DataFrameGroupBy.max (numeric_only) |
Compute the max value for each group. |
DataFrameGroupBy.mean (numeric_only) |
Compute the mean value for each group. |
DataFrameGroupBy.median (numeric_only) |
Compute the median value for each group. |
DataFrameGroupBy.min (numeric_only) |
Compute the min value for each group. |
DataFrameGroupBy.nunique () |
Compute the nunique value for each group. |
DataFrameGroupBy.std (numeric_only) |
Compute the standard deviation value for each group. |
DataFrameGroupBy.sum (numeric_only) |
Compute the sum value for each group. |
DataFrameGroupBy.var (numeric_only) |
Compute the variance value for each group. |
DataFrameGroupBy.quantile (q, float, …) |
Used to groupby and calculate quantile for a given DataFrame. |
GroupBy (by, query_compiler, dropna) |
Base class for calls to eland.DataFrame.groupby() |
Computations / Descriptive Stats¶
DataFrame.count () |
Count non-NA cells for each column. |
DataFrame.describe () |
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. |
DataFrame.info (verbose, buf, max_cols, …) |
Print a concise summary of a DataFrame. |
DataFrame.max (numeric_only) |
Return the maximum value for each numeric column |
DataFrame.mean (numeric_only) |
Return mean value for each numeric column |
DataFrame.min (numeric_only) |
Return the minimum value for each numeric column |
DataFrame.median (numeric_only) |
Return the median value for each numeric column |
DataFrame.mad (numeric_only) |
Return standard deviation for each numeric column |
DataFrame.std (numeric_only) |
Return standard deviation for each numeric column |
DataFrame.var (numeric_only) |
Return variance for each numeric column |
DataFrame.sum (numeric_only) |
Return sum for each numeric column |
DataFrame.nunique () |
Return cardinality of each field. |
DataFrame.mode (numeric_only, dropna, es_size) |
Calculate mode of a DataFrame |
DataFrame.quantile (q, float, List[int], …) |
Used to calculate quantile for a given DataFrame. |
DataFrame.idxmax (axis) |
Return index of first occurrence of maximum over requested axis. |
DataFrame.idxmin (axis) |
Return index of first occurrence of minimum over requested axis. |
Reindexing / Selection / Label Manipulation¶
DataFrame.drop ([labels, axis, index, …]) |
Return new object with labels in requested axis removed. |
DataFrame.filter (items, like, regex, axis, …) |
Subset the dataframe rows or columns according to the specified index labels. |
Plotting¶
DataFrame.hist ([column, by, grid, …]) |
Make a histogram of the DataFrame’s. |
Elasticsearch Functions¶
DataFrame.es_info () |
A debug summary of an eland DataFrame internals. |
DataFrame.es_match (text, *, columns, …) |
Filters data with an Elasticsearch match , match_phrase , or multi_match query depending on the given parameters and columns. |
DataFrame.es_query (query) |
Applies an Elasticsearch DSL query to the current DataFrame. |
DataFrame.es_dtypes |
Return the Elasticsearch dtypes in the index |
Serialization / IO / Conversion¶
DataFrame.info (verbose, buf, max_cols, …) |
Print a concise summary of a DataFrame. |
DataFrame.to_numpy () |
Not implemented. |
DataFrame.to_csv ([path_or_buf, sep, na_rep, …]) |
Write Elasticsearch data to a comma-separated values (csv) file. |
DataFrame.to_html ([buf, columns, col_space, …]) |
Render a Elasticsearch data as an HTML table. |
DataFrame.to_string ([buf, columns, …]) |
Render a DataFrame to a console-friendly tabular output. |
DataFrame.to_pandas (show_progress) |
Utility method to convert eland.Dataframe to pandas.Dataframe |
Series¶
Constructor¶
Series (es_client, es_index_pattern, name, …) |
pandas.Series like API that proxies into Elasticsearch index(es). |
Attributes and Underlying Data¶
Series.index |
Return eland index referencing Elasticsearch field to index a DataFrame/Series |
Series.dtype |
Return the dtype object of the underlying data. |
Series.dtypes |
Return the pandas dtypes in the DataFrame. |
Series.shape |
Return a tuple representing the dimensionality of the Series. |
Series.name |
|
Series.empty |
Determines if the Series is empty. |
Series.ndim |
Returns 1 by definition of a Series |
Series.size |
Return an int representing the number of elements in this object. |
Indexing, Iteration¶
Series.head (n) |
|
Series.tail (n) |
|
Series.sample (n, frac, random_state) |
Binary Operator Functions¶
Series.add (right) |
Return addition of series and right, element-wise (binary operator add). |
Series.sub (right) |
Return subtraction of series and right, element-wise (binary operator sub). |
Series.subtract (right) |
Return subtraction of series and right, element-wise (binary operator sub). |
Series.mul (right) |
Return multiplication of series and right, element-wise (binary operator mul). |
Series.multiply (right) |
Return multiplication of series and right, element-wise (binary operator mul). |
Series.div (right) |
Return floating division of series and right, element-wise (binary operator truediv). |
Series.divide (right) |
Return floating division of series and right, element-wise (binary operator truediv). |
Series.truediv (right) |
Return floating division of series and right, element-wise (binary operator truediv). |
Series.floordiv (right) |
Return integer division of series and right, element-wise (binary operator floordiv //). |
Series.mod (right) |
Return modulo of series and right, element-wise (binary operator mod %). |
Series.pow (right) |
Return exponential power of series and right, element-wise (binary operator pow). |
Series.radd (left) |
Return addition of series and left, element-wise (binary operator add). |
Series.rsub (left) |
Return subtraction of series and left, element-wise (binary operator sub). |
Series.rsubtract (left) |
Return subtraction of series and left, element-wise (binary operator sub). |
Series.rmul (left) |
Return multiplication of series and left, element-wise (binary operator mul). |
Series.rmultiply (left) |
Return multiplication of series and left, element-wise (binary operator mul). |
Series.rdiv (left) |
Return division of series and left, element-wise (binary operator div). |
Series.rdivide (left) |
Return division of series and left, element-wise (binary operator div). |
Series.rtruediv (left) |
Return division of series and left, element-wise (binary operator div). |
Series.rfloordiv (left) |
Return integer division of series and left, element-wise (binary operator floordiv //). |
Series.rmod (left) |
Return modulo of series and left, element-wise (binary operator mod %). |
Series.rpow (left) |
Return exponential power of series and left, element-wise (binary operator pow). |
Computations / Descriptive Stats¶
Series.describe () |
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. |
Series.max (numeric_only) |
Return the maximum of the Series values |
Series.mean (numeric_only) |
Return the mean of the Series values |
Series.min (numeric_only) |
Return the minimum of the Series values |
Series.sum (numeric_only) |
Return the sum of the Series values |
Series.median (numeric_only) |
Return the median of the Series values |
Series.mad (numeric_only) |
Return median absolute deviation for a Series |
Series.std (numeric_only) |
Return standard deviation for a Series |
Series.var (numeric_only) |
Return variance for a Series |
Series.nunique () |
Return the number of unique values in a Series |
Series.value_counts (es_size) |
Return the value counts for the specified field. |
Series.mode (es_size) |
Calculate mode of a series |
Series.quantile (q, float, List[int], …) |
Used to calculate quantile for a given Series. |
Reindexing / Selection / Label Manipulation¶
Series.rename (new_name) |
Rename name of series. |
Series.isna () |
Detect missing values. |
Series.notna () |
Detect existing (non-missing) values. |
Series.isnull () |
Detect missing values. |
Series.notnull () |
Detect existing (non-missing) values. |
Series.isin (other, pandas.core.series.Series]) |
|
Series.filter (items, like, regex, axis, str, …) |
Subset the dataframe rows or columns according to the specified index labels. |
Plotting¶
Series.hist ([by, ax, grid, xlabelsize, …]) |
Draw histogram of the input series using matplotlib. |
Serialization / IO / Conversion¶
Series.to_string ([buf, na_rep, …]) |
Render a string representation of the Series. |
Series.to_numpy () |
Not implemented. |
Series.to_pandas (show_progress) |
Elasticsearch Functions¶
Series.es_info () |
|
Series.es_match (text, *, match_phrase, …) |
Filters data with an Elasticsearch match or match_phrase query depending on the given parameters. |
Series.es_dtype |
Return the Elasticsearch type of the underlying data. |
Series.es_dtypes |
Return the Elasticsearch dtypes in the index |
Machine Learning¶
Machine learning is built into the Elastic Stack and enables users to gain insights into their Elasticsearch data. There are a wide range of capabilities from identifying in anomalies in your data, to training and deploying regression or classification models based on Elasticsearch data.
To use the Elastic Stack machine learning features, you must have the appropriate license and at least one machine learning node in your Elasticsearch cluster. If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges.
The fastest way to get started with machine learning features is to start a free 14-day trial of Elastic Cloud.
See Elasticsearch Machine Learning documentation more details.
MLModel¶
Constructor¶
MLModel (es_client, List[str], Tuple[str, …) |
A machine learning model managed by Elasticsearch. |
Predictions¶
MLModel.predict (X, List[float], …) |
Make a prediction using a trained model stored in Elasticsearch. |
Manage Models¶
MLModel.import_model (es_client, List[str], …) |
Transform and serialize a trained 3rd party model into Elasticsearch. |
MLModel.exists_model () |
Check if the model already exists in Elasticsearch |
MLModel.delete_model () |
Delete an inference model saved in Elasticsearch |
Index¶
Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame) and those should most likely be used before calling these methods directly.
General Utility Functions¶
Pandas and Eland¶
pandas_to_eland (pd_df, es_client, List[str], …) |
Append a pandas DataFrame to an Elasticsearch index. |
eland_to_pandas (ed_df, show_progress) |
Convert an eland.Dataframe to a pandas.DataFrame |
Input/Output¶
Flat File¶
csv_to_eland (filepath_or_buffer, es_client, …) |
Read a comma-separated values (csv) file into eland.DataFrame (i.e. |
Examples¶
Eland Demo Notebook¶
[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from elasticsearch import Elasticsearch
# Import standard test settings for consistent results
from eland.conftest import *
Compare Eland DataFrame vs pandas DataFrame¶
Create an eland.DataFrame from a flights
index
[2]:
ed_flights = ed.DataFrame('http://localhost:9200', 'flights')
[3]:
type(ed_flights)
[3]:
eland.dataframe.DataFrame
Compare to pandas DataFrame (created from the same data)
[4]:
pd_flights = ed.eland_to_pandas(ed_flights)
[5]:
type(pd_flights)
[5]:
pandas.core.frame.DataFrame
Attributes and underlying data¶
DataFrame.columns¶
[6]:
pd_flights.columns
[6]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[7]:
ed_flights.columns
[7]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.dtypes¶
[8]:
pd_flights.dtypes
[8]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
[9]:
ed_flights.dtypes
[9]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
DataFrame.select_dtypes¶
[10]:
pd_flights.select_dtypes(include=np.number)
[10]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
13054 | 1080.446279 | 8058.581753 | ... | 402.929088 | 6 |
13055 | 646.612941 | 7088.598322 | ... | 644.418029 | 6 |
13056 | 997.751876 | 10920.652972 | ... | 937.540811 | 6 |
13057 | 1102.814465 | 18748.859647 | ... | 1697.404971 | 6 |
13058 | 858.144337 | 16809.141923 | ... | 1610.761827 | 6 |
13059 rows × 7 columns
[11]:
ed_flights.select_dtypes(include=np.number)
[11]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
13054 | 1080.446279 | 8058.581753 | ... | 402.929088 | 6 |
13055 | 646.612941 | 7088.598322 | ... | 644.418029 | 6 |
13056 | 997.751876 | 10920.652972 | ... | 937.540811 | 6 |
13057 | 1102.814465 | 18748.859647 | ... | 1697.404971 | 6 |
13058 | 858.144337 | 16809.141923 | ... | 1610.761827 | 6 |
13059 rows × 7 columns
DataFrame.index¶
Note, eland.DataFrame.index
does not mirror pandas.DataFrame.index
.
[16]:
pd_flights.index
[16]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'13049', '13050', '13051', '13052', '13053', '13054', '13055', '13056', '13057', '13058'],
dtype='object', length=13059)
[17]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[17]:
<eland.index.Index at 0x7ff1a25debb0>
[18]:
ed_flights.index.es_index_field
[18]:
'_id'
DataFrame.values¶
Note, eland.DataFrame.values
is not supported.
[19]:
pd_flights.values
[19]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
Timestamp('2018-01-01 00:00:00')],
[882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
Timestamp('2018-01-01 18:27:00')],
[190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
Timestamp('2018-01-01 17:11:14')],
...,
[997.7518761454494, False, 'Logstash Airways', ..., 'Sunny', 6,
Timestamp('2018-02-11 04:09:27')],
[1102.8144645388556, False, 'JetBeats', ..., 'Hail', 6,
Timestamp('2018-02-11 08:28:21')],
[858.1443369038839, False, 'JetBeats', ..., 'Rain', 6,
Timestamp('2018-02-11 14:54:34')]], dtype=object)
[20]:
try:
ed_flights.values
except AttributeError as e:
print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
Indexing, iteration¶
DataFrame.head¶
[21]:
pd_flights.head()
[21]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 rows × 27 columns
[22]:
ed_flights.head()
[22]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 rows × 27 columns
DataFrame.tail¶
[23]:
pd_flights.tail()
[23]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
13054 | 1080.446279 | False | ... | 6 | 2018-02-11 20:42:25 |
13055 | 646.612941 | False | ... | 6 | 2018-02-11 01:41:57 |
13056 | 997.751876 | False | ... | 6 | 2018-02-11 04:09:27 |
13057 | 1102.814465 | False | ... | 6 | 2018-02-11 08:28:21 |
13058 | 858.144337 | False | ... | 6 | 2018-02-11 14:54:34 |
5 rows × 27 columns
[24]:
ed_flights.tail()
[24]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
13054 | 1080.446279 | False | ... | 6 | 2018-02-11 20:42:25 |
13055 | 646.612941 | False | ... | 6 | 2018-02-11 01:41:57 |
13056 | 997.751876 | False | ... | 6 | 2018-02-11 04:09:27 |
13057 | 1102.814465 | False | ... | 6 | 2018-02-11 08:28:21 |
13058 | 858.144337 | False | ... | 6 | 2018-02-11 14:54:34 |
5 rows × 27 columns
DataFrame.keys¶
[25]:
pd_flights.keys()
[25]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[26]:
ed_flights.keys()
[26]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.get¶
[27]:
pd_flights.get('Carrier')
[27]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
13054 Logstash Airways
13055 Logstash Airways
13056 Logstash Airways
13057 JetBeats
13058 JetBeats
Name: Carrier, Length: 13059, dtype: object
[28]:
ed_flights.get('Carrier')
[28]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
13054 Logstash Airways
13055 Logstash Airways
13056 Logstash Airways
13057 JetBeats
13058 JetBeats
Name: Carrier, Length: 13059, dtype: object
[29]:
pd_flights.get(['Carrier', 'Origin'])
[29]:
Carrier | Origin | |
---|---|---|
0 | Kibana Airlines | Frankfurt am Main Airport |
1 | Logstash Airways | Cape Town International Airport |
2 | Logstash Airways | Venice Marco Polo Airport |
3 | Kibana Airlines | Naples International Airport |
4 | Kibana Airlines | Licenciado Benito Juarez International Airport |
... | ... | ... |
13054 | Logstash Airways | Pisa International Airport |
13055 | Logstash Airways | Winnipeg / James Armstrong Richardson Internat... |
13056 | Logstash Airways | Licenciado Benito Juarez International Airport |
13057 | JetBeats | Itami Airport |
13058 | JetBeats | Adelaide International Airport |
13059 rows × 2 columns
List input not currently supported by eland.DataFrame.get
[30]:
try:
ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
print(e)
unhashable type: 'list'
DataFrame.query¶
[31]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[31]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 rows × 27 columns
eland.DataFrame.query
requires qualifier on bool i.e.
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled')
fails
[32]:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[32]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 rows × 27 columns
[33]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
(pd_flights.AvgTicketPrice > 900.0) &
(pd_flights.Cancelled == True)]
[33]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 rows × 27 columns
[34]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
(ed_flights.AvgTicketPrice > 900.0) &
(ed_flights.Cancelled == True)]
[34]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 rows × 27 columns
Function application, GroupBy & window¶
DataFrame.aggs¶
[35]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[35]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.261629e+07 | 8.204365e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.578438e+03 | 2.663969e+02 |
eland.DataFrame.aggregate
currently only supported numeric columns
[36]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[36]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.261629e+07 | 8.204365e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.578614e+03 | 2.664071e+02 |
Computations / descriptive stats¶
DataFrame.count¶
[37]:
pd_flights.count()
[37]:
AvgTicketPrice 13059
Cancelled 13059
Carrier 13059
Dest 13059
DestAirportID 13059
...
OriginLocation 13059
OriginRegion 13059
OriginWeather 13059
dayOfWeek 13059
timestamp 13059
Length: 27, dtype: int64
[38]:
ed_flights.count()
[38]:
AvgTicketPrice 13059
Cancelled 13059
Carrier 13059
Dest 13059
DestAirportID 13059
...
OriginLocation 13059
OriginRegion 13059
OriginWeather 13059
dayOfWeek 13059
timestamp 13059
Length: 27, dtype: int64
DataFrame.describe¶
[39]:
pd_flights.describe()
[39]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 13059.000000 | 13059.000000 | ... | 13059.000000 | 13059.000000 |
mean | 628.253689 | 7092.142455 | ... | 511.127842 | 2.835975 |
std | 266.396861 | 4578.438497 | ... | 334.753952 | 1.939439 |
min | 100.020528 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 409.893816 | 2459.705673 | ... | 252.333192 | 1.000000 |
50% | 640.556668 | 7610.330866 | ... | 503.045170 | 3.000000 |
75% | 842.185470 | 9736.637600 | ... | 720.416036 | 4.000000 |
max | 1199.729053 | 19881.482315 | ... | 1902.902032 | 6.000000 |
8 rows × 7 columns
Values returned from eland.DataFrame.describe
may vary due to results of Elasticsearch aggregations.
[40]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[40]:
AvgTicketPrice | Cancelled | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 13059.000000 | 13059.000000 | ... | 13059.000000 | 13059.000000 |
mean | 628.253689 | 0.128494 | ... | 511.127842 | 2.835975 |
std | 266.407061 | 0.334664 | ... | 334.766770 | 1.939513 |
min | 100.020531 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 410.008918 | 0.000000 | ... | 251.938710 | 1.000000 |
50% | 640.387285 | 0.000000 | ... | 503.148975 | 3.000000 |
75% | 842.213490 | 0.000000 | ... | 720.505705 | 4.000000 |
max | 1199.729004 | 1.000000 | ... | 1902.901978 | 6.000000 |
8 rows × 9 columns
DataFrame.info¶
[41]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13059 non-null float64
1 Cancelled 13059 non-null bool
2 Carrier 13059 non-null object
3 Dest 13059 non-null object
4 DestAirportID 13059 non-null object
5 DestCityName 13059 non-null object
6 DestCountry 13059 non-null object
7 DestLocation 13059 non-null object
8 DestRegion 13059 non-null object
9 DestWeather 13059 non-null object
10 DistanceKilometers 13059 non-null float64
11 DistanceMiles 13059 non-null float64
12 FlightDelay 13059 non-null bool
13 FlightDelayMin 13059 non-null int64
14 FlightDelayType 13059 non-null object
15 FlightNum 13059 non-null object
16 FlightTimeHour 13059 non-null float64
17 FlightTimeMin 13059 non-null float64
18 Origin 13059 non-null object
19 OriginAirportID 13059 non-null object
20 OriginCityName 13059 non-null object
21 OriginCountry 13059 non-null object
22 OriginLocation 13059 non-null object
23 OriginRegion 13059 non-null object
24 OriginWeather 13059 non-null object
25 dayOfWeek 13059 non-null int64
26 timestamp 13059 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[42]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13059 non-null float64
1 Cancelled 13059 non-null bool
2 Carrier 13059 non-null object
3 Dest 13059 non-null object
4 DestAirportID 13059 non-null object
5 DestCityName 13059 non-null object
6 DestCountry 13059 non-null object
7 DestLocation 13059 non-null object
8 DestRegion 13059 non-null object
9 DestWeather 13059 non-null object
10 DistanceKilometers 13059 non-null float64
11 DistanceMiles 13059 non-null float64
12 FlightDelay 13059 non-null bool
13 FlightDelayMin 13059 non-null int64
14 FlightDelayType 13059 non-null object
15 FlightNum 13059 non-null object
16 FlightTimeHour 13059 non-null float64
17 FlightTimeMin 13059 non-null float64
18 Origin 13059 non-null object
19 OriginAirportID 13059 non-null object
20 OriginCityName 13059 non-null object
21 OriginCountry 13059 non-null object
22 OriginLocation 13059 non-null object
23 OriginRegion 13059 non-null object
24 OriginWeather 13059 non-null object
25 dayOfWeek 13059 non-null int64
26 timestamp 13059 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 5.786 MB
DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum¶
[43]:
pd_flights.max(numeric_only=True)
[43]:
AvgTicketPrice 1199.729053
Cancelled True
DistanceKilometers 19881.482315
DistanceMiles 12353.780369
FlightDelay True
FlightDelayMin 360
FlightTimeHour 31.715034
FlightTimeMin 1902.902032
dayOfWeek 6
dtype: object
eland.DataFrame.max,min,mean,sum
only aggregate numeric columns
[44]:
ed_flights.max(numeric_only=True)
[44]:
AvgTicketPrice 1199.729004
Cancelled 1.000000
DistanceKilometers 19881.482422
DistanceMiles 12353.780273
FlightDelay 1.000000
FlightDelayMin 360.000000
FlightTimeHour 31.715034
FlightTimeMin 1902.901978
dayOfWeek 6.000000
dtype: float64
[45]:
pd_flights.min(numeric_only=True)
[45]:
AvgTicketPrice 100.020528
Cancelled False
DistanceKilometers 0.0
DistanceMiles 0.0
FlightDelay False
FlightDelayMin 0
FlightTimeHour 0.0
FlightTimeMin 0.0
dayOfWeek 0
dtype: object
[46]:
ed_flights.min(numeric_only=True)
[46]:
AvgTicketPrice 100.020531
Cancelled 0.000000
DistanceKilometers 0.000000
DistanceMiles 0.000000
FlightDelay 0.000000
FlightDelayMin 0.000000
FlightTimeHour 0.000000
FlightTimeMin 0.000000
dayOfWeek 0.000000
dtype: float64
[47]:
pd_flights.mean(numeric_only=True)
[47]:
AvgTicketPrice 628.253689
Cancelled 0.128494
DistanceKilometers 7092.142455
DistanceMiles 4406.853013
FlightDelay 0.251168
FlightDelayMin 47.335171
FlightTimeHour 8.518797
FlightTimeMin 511.127842
dayOfWeek 2.835975
dtype: float64
[48]:
ed_flights.mean(numeric_only=True)
[48]:
AvgTicketPrice 628.253689
Cancelled 0.128494
DistanceKilometers 7092.142457
DistanceMiles 4406.853010
FlightDelay 0.251168
FlightDelayMin 47.335171
FlightTimeHour 8.518797
FlightTimeMin 511.127842
dayOfWeek 2.835975
dtype: float64
[49]:
pd_flights.sum(numeric_only=True)
[49]:
AvgTicketPrice 8.204365e+06
Cancelled 1.678000e+03
DistanceKilometers 9.261629e+07
DistanceMiles 5.754909e+07
FlightDelay 3.280000e+03
FlightDelayMin 6.181500e+05
FlightTimeHour 1.112470e+05
FlightTimeMin 6.674818e+06
dayOfWeek 3.703500e+04
dtype: float64
[50]:
ed_flights.sum(numeric_only=True)
[50]:
AvgTicketPrice 8.204365e+06
Cancelled 1.678000e+03
DistanceKilometers 9.261629e+07
DistanceMiles 5.754909e+07
FlightDelay 3.280000e+03
FlightDelayMin 6.181500e+05
FlightTimeHour 1.112470e+05
FlightTimeMin 6.674818e+06
dayOfWeek 3.703500e+04
dtype: float64
DataFrame.nunique¶
[51]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[51]:
Carrier 4
Origin 156
Dest 156
dtype: int64
[52]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[52]:
Carrier 4
Origin 156
Dest 156
dtype: int64
DataFrame.drop¶
[53]:
pd_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[53]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
13054 | Logstash Airways | SE-BD | ... | 6 | 2018-02-11 20:42:25 |
13055 | Logstash Airways | CH-ZH | ... | 6 | 2018-02-11 01:41:57 |
13056 | Logstash Airways | RU-AMU | ... | 6 | 2018-02-11 04:09:27 |
13057 | JetBeats | SE-BD | ... | 6 | 2018-02-11 08:28:21 |
13058 | JetBeats | US-DC | ... | 6 | 2018-02-11 14:54:34 |
13059 rows × 20 columns
[54]:
ed_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[54]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
13054 | Logstash Airways | SE-BD | ... | 6 | 2018-02-11 20:42:25 |
13055 | Logstash Airways | CH-ZH | ... | 6 | 2018-02-11 01:41:57 |
13056 | Logstash Airways | RU-AMU | ... | 6 | 2018-02-11 04:09:27 |
13057 | JetBeats | SE-BD | ... | 6 | 2018-02-11 08:28:21 |
13058 | JetBeats | US-DC | ... | 6 | 2018-02-11 14:54:34 |
13059 rows × 20 columns
Plotting¶
[55]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()

[56]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()

Elasticsearch utilities¶
[57]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[58]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
timestamp timestamp True date strict_date_hour_minute_second datetime64[ns] True True False timestamp
OriginAirportID OriginAirportID True keyword None object True True False OriginAirportID
DestAirportID DestAirportID True keyword None object True True False DestAirportID
FlightDelayMin FlightDelayMin True integer None int64 True True False FlightDelayMin
Operations:
tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
size: 5
sort_params: {'_doc': 'desc'}
_source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
post_processing: [('sort_index')]
Introduction to Eland Webinar¶
- Webinar Recording on Youtube
- Eland Documentation
- Source Code on GitHub
- Elastic Cloud
- NYC Open Data dataset
This Jupyter Notebook goes along with the webinar ‘Introduction to Eland’ which is available on Youtube. To follow along either create an Elasticsearch deployment on Elastic Cloud (free trial available) or start your own Elasticsearch cluster locally.
You’ll need to install the following libraries:
$ python -m pip install eland numpy pandas
DataFrame Demo¶
[19]:
# Standard imports
import eland as ed
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch
# Function for pretty-printing JSON
def json(x):
import json
print(json.dumps(x, indent=2, sort_keys=True))
[20]:
# Connect to an Elastic Cloud instance
# or another Elasticsearch index below
ELASTIC_CLOUD_ID = "<cloud-id>"
ELASTIC_CLOUD_PASSWORD = "<password>"
es = Elasticsearch(
cloud_id=ELASTIC_CLOUD_ID,
basic_auth=("elastic", ELASTIC_CLOUD_PASSWORD)
)
json(es.info())
{
"cluster_name": "167e473c7bba4bae85004385d4e0ce46",
"cluster_uuid": "4Y2FwBhRSsWq9uGedb1DmQ",
"name": "instance-0000000000",
"tagline": "You Know, for Search",
"version": {
"build_date": "2020-06-14T19:35:50.234439Z",
"build_flavor": "default",
"build_hash": "757314695644ea9a1dc2fecd26d1a43856725e65",
"build_snapshot": false,
"build_type": "docker",
"lucene_version": "8.5.1",
"minimum_index_compatibility_version": "6.0.0-beta1",
"minimum_wire_compatibility_version": "6.8.0",
"number": "7.8.0"
}
}
[21]:
# Load the dataset from NYC Open Data and take a look
pd_df = pd.read_csv("nyc-restaurants.csv").dropna()
pd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193197 entries, 0 to 400255
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CAMIS 193197 non-null int64
1 DBA 193197 non-null object
2 BORO 193197 non-null object
3 BUILDING 193197 non-null object
4 STREET 193197 non-null object
5 ZIPCODE 193197 non-null float64
6 PHONE 193197 non-null object
7 CUISINE DESCRIPTION 193197 non-null object
8 INSPECTION DATE 193197 non-null object
9 ACTION 193197 non-null object
10 VIOLATION CODE 193197 non-null object
11 VIOLATION DESCRIPTION 193197 non-null object
12 CRITICAL FLAG 193197 non-null object
13 SCORE 193197 non-null float64
14 GRADE 193197 non-null object
15 GRADE DATE 193197 non-null object
16 RECORD DATE 193197 non-null object
17 INSPECTION TYPE 193197 non-null object
18 Latitude 193197 non-null float64
19 Longitude 193197 non-null float64
20 Community Board 193197 non-null float64
21 Council District 193197 non-null float64
22 Census Tract 193197 non-null float64
23 BIN 193197 non-null float64
24 BBL 193197 non-null float64
25 NTA 193197 non-null object
dtypes: float64(9), int64(1), object(16)
memory usage: 39.8+ MB
[22]:
# Rename the columns to be snake_case
pd_df.columns = [x.lower().replace(" ", "_") for x in pd_df.columns]
# Combine the 'latitude' and 'longitude' columns into one column 'location' for 'geo_point'
pd_df["location"] = pd_df[["latitude", "longitude"]].apply(lambda x: ",".join(str(item) for item in x), axis=1)
# Drop the old columns in favor of 'location'
pd_df.drop(["latitude", "longitude"], axis=1, inplace=True)
pd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193197 entries, 0 to 400255
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 camis 193197 non-null int64
1 dba 193197 non-null object
2 boro 193197 non-null object
3 building 193197 non-null object
4 street 193197 non-null object
5 zipcode 193197 non-null float64
6 phone 193197 non-null object
7 cuisine_description 193197 non-null object
8 inspection_date 193197 non-null object
9 action 193197 non-null object
10 violation_code 193197 non-null object
11 violation_description 193197 non-null object
12 critical_flag 193197 non-null object
13 score 193197 non-null float64
14 grade 193197 non-null object
15 grade_date 193197 non-null object
16 record_date 193197 non-null object
17 inspection_type 193197 non-null object
18 community_board 193197 non-null float64
19 council_district 193197 non-null float64
20 census_tract 193197 non-null float64
21 bin 193197 non-null float64
22 bbl 193197 non-null float64
23 nta 193197 non-null object
24 location 193197 non-null object
dtypes: float64(7), int64(1), object(17)
memory usage: 38.3+ MB
[23]:
df = ed.pandas_to_eland(
pd_df=pd_df,
es_client=es,
# Where the data will live in Elasticsearch
es_dest_index="nyc-restaurants",
# Type overrides for certain columns, 'location' detected
# automatically as 'keyword' but we want these interpreted as 'geo_point'.
es_type_overrides={
"location": "geo_point",
"dba": "text",
"zipcode": "short"
},
# If the index already exists what should we do?
es_if_exists="replace",
# Wait for data to be indexed before returning
es_refresh=True,
)
df.info()
<class 'eland.dataframe.DataFrame'>
Index: 193197 entries, 10388 to 398749
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 camis 193197 non-null int64
1 dba 193197 non-null object
2 boro 193197 non-null object
3 building 193197 non-null object
4 street 193197 non-null object
5 zipcode 193197 non-null int64
6 phone 193197 non-null object
7 cuisine_description 193197 non-null object
8 inspection_date 193197 non-null object
9 action 193197 non-null object
10 violation_code 193197 non-null object
11 violation_description 193197 non-null object
12 critical_flag 193197 non-null object
13 score 193197 non-null float64
14 grade 193197 non-null object
15 grade_date 193197 non-null object
16 record_date 193197 non-null object
17 inspection_type 193197 non-null object
18 community_board 193197 non-null float64
19 council_district 193197 non-null float64
20 census_tract 193197 non-null float64
21 bin 193197 non-null float64
22 bbl 193197 non-null float64
23 nta 193197 non-null object
24 location 193197 non-null object
dtypes: float64(6), int64(2), object(17)
memory usage: 80.0 bytes
[24]:
json(es.indices.get_mapping(index="nyc-restaurants"))
{
"nyc-restaurants": {
"mappings": {
"properties": {
"action": {
"type": "keyword"
},
"bbl": {
"type": "double"
},
"bin": {
"type": "double"
},
"boro": {
"type": "keyword"
},
"building": {
"type": "keyword"
},
"camis": {
"type": "long"
},
"census_tract": {
"type": "double"
},
"community_board": {
"type": "double"
},
"council_district": {
"type": "double"
},
"critical_flag": {
"type": "keyword"
},
"cuisine_description": {
"type": "keyword"
},
"dba": {
"type": "text"
},
"grade": {
"type": "keyword"
},
"grade_date": {
"type": "keyword"
},
"inspection_date": {
"type": "keyword"
},
"inspection_type": {
"type": "keyword"
},
"location": {
"type": "geo_point"
},
"nta": {
"type": "keyword"
},
"phone": {
"type": "keyword"
},
"record_date": {
"type": "keyword"
},
"score": {
"type": "double"
},
"street": {
"type": "keyword"
},
"violation_code": {
"type": "keyword"
},
"violation_description": {
"type": "keyword"
},
"zipcode": {
"type": "short"
}
}
}
}
}
[25]:
# Shape is determined by using count API
df.shape
[25]:
(193197, 25)
[34]:
# DataFrame has many APIs compatible with Pandas
#df.head(10)
#df.columns
#df.dba
#df["grade"]
#df[df.grade.isin(["A", "B"])]
#print(df[df.grade.isin(["A", "B"])].es_info())
#print(df.tail(10).es_info())
es_index_pattern: nyc-restaurants
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
camis camis True long None int64 True True False camis
dba dba True text None object True False False None
boro boro True keyword None object True True False boro
building building True keyword None object True True False building
street street True keyword None object True True False street
zipcode zipcode True short None int64 True True False zipcode
phone phone True keyword None object True True False phone
cuisine_description cuisine_description True keyword None object True True False cuisine_description
inspection_date inspection_date True keyword None object True True False inspection_date
action action True keyword None object True True False action
violation_code violation_code True keyword None object True True False violation_code
violation_description violation_description True keyword None object True True False violation_description
critical_flag critical_flag True keyword None object True True False critical_flag
score score True double None float64 True True False score
grade grade True keyword None object True True False grade
grade_date grade_date True keyword None object True True False grade_date
record_date record_date True keyword None object True True False record_date
inspection_type inspection_type True keyword None object True True False inspection_type
community_board community_board True double None float64 True True False community_board
council_district council_district True double None float64 True True False council_district
census_tract census_tract True double None float64 True True False census_tract
bin bin True double None float64 True True False bin
bbl bbl True double None float64 True True False bbl
nta nta True keyword None object True True False nta
location location True geo_point None object True True False location
Operations:
tasks: [('tail': ('sort_field': '_doc', 'count': 10))]
size: 10
sort_params: _doc:desc
_source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location']
body: {}
post_processing: [('sort_index')]
[39]:
# Aggregating values
df.describe()
[39]:
camis | zipcode | score | community_board | council_district | census_tract | bin | bbl | |
---|---|---|---|---|---|---|---|---|
count | 1.931970e+05 | 193197.000000 | 193197.000000 | 193197.000000 | 193197.000000 | 193197.000000 | 1.931970e+05 | 1.931970e+05 |
mean | 4.605010e+07 | 10677.212540 | 12.947680 | 248.602603 | 20.020715 | 28796.048298 | 2.513373e+06 | 2.450622e+09 |
std | 4.415232e+06 | 595.142246 | 8.180244 | 130.697014 | 15.809664 | 30672.683469 | 1.351134e+06 | 1.313578e+09 |
min | 3.011234e+07 | 10000.000000 | -1.000000 | 101.000000 | 1.000000 | 100.000000 | 1.000000e+06 | 1.000000e+09 |
25% | 4.138051e+07 | 10022.000000 | 9.000000 | 105.000000 | 4.000000 | 7895.605691 | 1.042708e+06 | 1.011024e+09 |
50% | 5.000527e+07 | 10468.006114 | 12.000000 | 301.000000 | 19.747529 | 16022.917106 | 3.007191e+06 | 3.002924e+09 |
75% | 5.005661e+07 | 11228.624535 | 13.000000 | 401.000000 | 34.000000 | 40246.000337 | 4.002294e+06 | 4.003343e+09 |
max | 5.010416e+07 | 12345.000000 | 99.000000 | 503.000000 | 51.000000 | 162100.000000 | 5.799501e+06 | 5.270001e+09 |
[40]:
# Plotting with matplotlib
from matplotlib import pyplot as plt
df[["score"]].hist(figsize=[10,10])
plt.show()

[42]:
# es_query() allows for the full Elasticsearch querying capabilities
df.es_query({
"geo_distance": {
"distance": "50m",
"location": {
"lat": 40.643852716573,
"lon": -74.011628212186
}
}
})
[42]:
camis | dba | boro | building | street | zipcode | phone | cuisine_description | inspection_date | action | ... | grade_date | record_date | inspection_type | community_board | council_district | census_tract | bin | bbl | nta | location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
53127 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 12/26/2018 | Violations were cited in the following area(s). | ... | 12/26/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
61268 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 07/20/2017 | Violations were cited in the following area(s). | ... | 07/20/2017 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
20717 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 03/04/2020 | Violations were cited in the following area(s). | ... | 03/04/2020 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
4648 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 05/25/2019 | Violations were cited in the following area(s). | ... | 05/25/2019 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
224 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 05/25/2019 | Violations were cited in the following area(s). | ... | 05/25/2019 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
9465 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 03/04/2020 | Violations were cited in the following area(s). | ... | 03/04/2020 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
104512 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/19/2018 | Violations were cited in the following area(s). | ... | 12/19/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
106728 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 01/25/2018 | Violations were cited in the following area(s). | ... | 01/25/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
62748 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 05/28/2019 | Violations were cited in the following area(s). | ... | 05/28/2019 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
79211 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 11/05/2016 | Violations were cited in the following area(s). | ... | 11/05/2016 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
218545 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 01/10/2018 | Violations were cited in the following area(s). | ... | 01/10/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
238663 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 11/05/2016 | Violations were cited in the following area(s). | ... | 11/05/2016 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
245205 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/19/2018 | Violations were cited in the following area(s). | ... | 12/19/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
245233 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 01/25/2018 | Violations were cited in the following area(s). | ... | 01/25/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
247417 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 05/05/2017 | Violations were cited in the following area(s). | ... | 05/05/2017 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
186874 | 50099704 | MASTER'S PIZZERIA | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 11/18/2019 | Violations were cited in the following area(s). | ... | 11/18/2019 | 07/07/2020 | Pre-permit (Operational) / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
198104 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/28/2017 | Violations were cited in the following area(s). | ... | 12/28/2017 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
213425 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/19/2018 | Violations were cited in the following area(s). | ... | 12/19/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
202363 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 05/28/2019 | Violations were cited in the following area(s). | ... | 05/28/2019 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
158059 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/19/2018 | Violations were cited in the following area(s). | ... | 12/19/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
163672 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 08/13/2018 | Violations were cited in the following area(s). | ... | 08/13/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
138508 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 01/29/2020 | Violations were cited in the following area(s). | ... | 01/29/2020 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
140940 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 07/20/2017 | Violations were cited in the following area(s). | ... | 07/20/2017 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
143157 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 01/10/2018 | Violations were cited in the following area(s). | ... | 01/10/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
149548 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 07/20/2017 | Violations were cited in the following area(s). | ... | 07/20/2017 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
149742 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 05/31/2018 | Violations were cited in the following area(s). | ... | 05/31/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
249994 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 01/25/2018 | Violations were cited in the following area(s). | ... | 01/25/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
257603 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 08/13/2018 | Violations were cited in the following area(s). | ... | 08/13/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
268823 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 01/10/2018 | Violations were cited in the following area(s). | ... | 01/10/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
269521 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 12/17/2019 | Violations were cited in the following area(s). | ... | 12/17/2019 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
277500 | 50099704 | MASTER'S PIZZERIA | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 11/18/2019 | Violations were cited in the following area(s). | ... | 11/18/2019 | 07/07/2020 | Pre-permit (Operational) / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
279503 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 01/29/2020 | Violations were cited in the following area(s). | ... | 01/29/2020 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
299863 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 12/26/2018 | Violations were cited in the following area(s). | ... | 12/26/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
319787 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 05/25/2019 | Violations were cited in the following area(s). | ... | 05/25/2019 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
336570 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 01/10/2018 | Violations were cited in the following area(s). | ... | 01/10/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
340551 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 04/10/2017 | Establishment re-opened by DOHMH | ... | 04/10/2017 | 07/07/2020 | Cycle Inspection / Reopening Inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
395508 | 41144258 | BURGER KING | Brooklyn | 5212 | 5 AVENUE | 11220.0 | 7187650844 | Hamburgers | 12/17/2019 | Violations were cited in the following area(s). | ... | 12/17/2019 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 7600.0 | 3329902.0 | 3.008070e+09 | BK32 | 40.643852716573,-74.011628212186 |
309366 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 12/28/2017 | Violations were cited in the following area(s). | ... | 12/28/2017 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
340857 | 40396492 | ROYAL KING'S PIZZA | Brooklyn | 5211 | 5 AVENUE | 11220.0 | 7184923846 | Pizza | 01/29/2020 | Violations were cited in the following area(s). | ... | 01/29/2020 | 07/07/2020 | Cycle Inspection / Re-inspection | 307.0 | 38.0 | 10000.0 | 3013939.0 | 3.008080e+09 | BK34 | 40.643849974348996,-74.01160298782 |
358660 | 50004330 | KFC | Brooklyn | 5223 | 5 AVENUE | 11220.0 | 7184922813 | Chicken | 05/31/2018 | Violations were cited in the following area(s). | ... | 05/31/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013937.0 | 3.008080e+09 | BK34 | 40.643800563168,-74.01165342693001 |
393451 | 41271801 | PINO'S | Brooklyn | 5201 | 5 AVENUE | 11220.0 | 7184396012 | Pizza | 06/05/2018 | Violations were cited in the following area(s). | ... | 06/05/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 307.0 | 38.0 | 10000.0 | 3013942.0 | 3.008080e+09 | BK34 | 40.643888405293005,-74.011563356969 |
41 rows × 25 columns
[43]:
# Full-text search example
df.es_query({"match": {"dba": "red"}})
[43]:
camis | dba | boro | building | street | zipcode | phone | cuisine_description | inspection_date | action | ... | grade_date | record_date | inspection_type | community_board | council_district | census_tract | bin | bbl | nta | location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5765 | 50033781 | RED HOOK LOBSTER POUND | Brooklyn | 284 | VAN BRUNT STREET | 11231.0 | 7188587650 | Seafood | 04/19/2018 | Violations were cited in the following area(s). | ... | 04/19/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 306.0 | 38.0 | 5900.0 | 3008365.0 | 3.005290e+09 | BK33 | 40.67974632809,-74.010098611838 |
12379 | 50058053 | RED HOT II | Brooklyn | 349 | 7 AVENUE | 11215.0 | 7183692577 | Chinese | 05/17/2018 | Violations were cited in the following area(s). | ... | 05/17/2018 | 07/07/2020 | Cycle Inspection / Re-inspection | 306.0 | 39.0 | 15100.0 | 3026127.0 | 3.010940e+09 | BK37 | 40.666194419994,-73.98214269199799 |
12978 | 50059700 | RED POKE | Manhattan | 600 | 9 AVENUE | 10036.0 | 2129748100 | Hawaiian | 03/21/2017 | Violations were cited in the following area(s). | ... | 03/21/2017 | 07/07/2020 | Pre-permit (Operational) / Re-inspection | 104.0 | 3.0 | 12100.0 | 1088997.0 | 1.010330e+09 | MN15 | 40.758993434643,-73.992203122611 |
16759 | 40365239 | DORRIAN'S RED HAND RESTAURANT | Manhattan | 1616 | 2 AVENUE | 10028.0 | 2127726660 | Irish | 11/08/2018 | Violations were cited in the following area(s). | ... | 11/08/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 108.0 | 5.0 | 13800.0 | 1049947.0 | 1.015460e+09 | MN32 | 40.776404966262,-73.952802065662 |
18624 | 50095340 | RED PEONY CHINESE CUISINE | Manhattan | 24 | WEST 56 STREET | 10019.0 | 2123808883 | Chinese | 11/21/2019 | Violations were cited in the following area(s). | ... | 11/21/2019 | 07/07/2020 | Pre-permit (Operational) / Re-inspection | 105.0 | 4.0 | 10400.0 | 1034840.0 | 1.012710e+09 | MN17 | 40.762699245064,-73.975463733228 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
391229 | 50061162 | CODE RED | Bronx | 1320 | EAST GUN HILL ROAD | 10469.0 | 7188811808 | Caribbean | 05/14/2018 | Violations were cited in the following area(s). | ... | 05/14/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 211.0 | 12.0 | 35000.0 | 2056100.0 | 2.045890e+09 | BX31 | 40.871378316318996,-73.848028279305 |
393531 | 50014078 | RED LOBSTER | Manhattan | 5 | TIMES SQ | 10036.0 | 2127306706 | Seafood | 11/08/2017 | Violations were cited in the following area(s). | ... | 11/08/2017 | 07/07/2020 | Cycle Inspection / Re-inspection | 105.0 | 3.0 | 11300.0 | 1024656.0 | 1.010130e+09 | MN17 | 40.755702020307005,-73.987207980138 |
396171 | 40368313 | RED FLAME DINER | Manhattan | 67 | WEST 44 STREET | 10036.0 | 2128693965 | American | 02/16/2018 | Violations were cited in the following area(s). | ... | 02/16/2018 | 07/07/2020 | Cycle Inspection / Initial Inspection | 105.0 | 4.0 | 9600.0 | 1034217.0 | 1.012600e+09 | MN17 | 40.755627203336,-73.981938150269 |
396501 | 50068499 | RED GINGER | Staten Island | 1650 | RICHMOND AVENUE | 10314.0 | 7189828808 | Other | 09/19/2017 | Violations were cited in the following area(s). | ... | 09/19/2017 | 07/07/2020 | Pre-permit (Operational) / Initial Inspection | 502.0 | 50.0 | 29103.0 | 5037014.0 | 5.022360e+09 | SI05 | 40.608078102502,-74.162260908042 |
398950 | 50059700 | RED POKE | Manhattan | 600 | 9 AVENUE | 10036.0 | 2129748100 | Hawaiian | 12/08/2017 | Violations were cited in the following area(s). | ... | 12/08/2017 | 07/07/2020 | Cycle Inspection / Re-inspection | 104.0 | 3.0 | 12100.0 | 1088997.0 | 1.010330e+09 | MN15 | 40.758993434643,-73.992203122611 |
573 rows × 25 columns
[44]:
# Pull a subset of your data for building graphs / operations locally.
sample_df = df[df.grade == "B"].sample(100).to_pandas()
sample_df.info()
print(type(sample_df))
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 107677 to 96813
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 camis 100 non-null int64
1 dba 100 non-null object
2 boro 100 non-null object
3 building 100 non-null object
4 street 100 non-null object
5 zipcode 100 non-null float64
6 phone 100 non-null object
7 cuisine_description 100 non-null object
8 inspection_date 100 non-null object
9 action 100 non-null object
10 violation_code 100 non-null object
11 violation_description 100 non-null object
12 critical_flag 100 non-null object
13 score 100 non-null float64
14 grade 100 non-null object
15 grade_date 100 non-null object
16 record_date 100 non-null object
17 inspection_type 100 non-null object
18 community_board 100 non-null float64
19 council_district 100 non-null float64
20 census_tract 100 non-null float64
21 bin 100 non-null float64
22 bbl 100 non-null float64
23 nta 100 non-null object
24 location 100 non-null object
dtypes: float64(7), int64(1), object(17)
memory usage: 20.3+ KB
<class 'pandas.core.frame.DataFrame'>
Machine Learning Demo¶
[45]:
# Import scikit-learn and train a dataset locally
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
# Train the data locally
digits = datasets.load_wine()
print("Feature Names:", digits.feature_names)
print("Data example:", digits.data[0])
# Save 10, 80, and 140 for testing our model
data = [x for i, x in enumerate(digits.data) if i not in (10, 80, 140)]
target = [x for i, x in enumerate(digits.target) if i not in (10, 80, 140)]
sk_classifier = DecisionTreeClassifier()
sk_classifier.fit(data, target)
# Test out our model against the three targets
print(sk_classifier.predict(digits.data[[10, 80, 140]]))
print(digits.target[[10, 80, 140]])
Feature Names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Data example: [1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
[0 1 2]
[0 1 2]
[46]:
from eland.ml import MLModel
# Serialize the scikit-learn model into Elasticsearch
ed_classifier = MLModel.import_model(
es_client=es,
model_id="wine-classifier",
model=sk_classifier,
feature_names=digits.feature_names,
overwrite=True
)
# Capture the Elasticsearch API call w/ logging
import logging
logger = logging.getLogger("elasticsearch")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())
# Use the same data as before, but now with the model in Elasticsearch
print(ed_classifier.predict(digits.data[[10, 80, 140]].tolist()))
print(digits.target[[10, 80, 140]])
logger.handlers = []
POST https://167e473c7bba4bae85004385d4e0ce46.us-central1.gcp.cloud.es.io/_ingest/pipeline/_simulate [status:200 request:0.053s]
> {"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]}
< {"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]}
[0 1 2]
[0 1 2]
[47]:
json({"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]})
{
"docs": [
{
"_source": {
"alcalinity_of_ash": 18.0,
"alcohol": 14.1,
"ash": 2.3,
"color_intensity": 5.75,
"flavanoids": 3.32,
"hue": 1.25,
"magnesium": 105.0,
"malic_acid": 2.16,
"nonflavanoid_phenols": 0.22,
"od280/od315_of_diluted_wines": 3.17,
"proanthocyanins": 2.38,
"proline": 1510.0,
"total_phenols": 2.95
}
},
{
"_source": {
"alcalinity_of_ash": 19.0,
"alcohol": 12.0,
"ash": 2.0,
"color_intensity": 2.5,
"flavanoids": 2.26,
"hue": 1.38,
"magnesium": 86.0,
"malic_acid": 0.92,
"nonflavanoid_phenols": 0.3,
"od280/od315_of_diluted_wines": 3.12,
"proanthocyanins": 1.43,
"proline": 278.0,
"total_phenols": 2.42
}
},
{
"_source": {
"alcalinity_of_ash": 21.0,
"alcohol": 12.93,
"ash": 2.7,
"color_intensity": 4.6,
"flavanoids": 0.5,
"hue": 0.77,
"magnesium": 96.0,
"malic_acid": 2.81,
"nonflavanoid_phenols": 0.53,
"od280/od315_of_diluted_wines": 2.31,
"proanthocyanins": 0.75,
"proline": 600.0,
"total_phenols": 1.54
}
}
],
"pipeline": {
"processors": [
{
"inference": {
"field_map": {},
"inference_config": {
"classification": {}
},
"model_id": "wine-classifier"
}
}
]
}
}
[48]:
json({"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]})
{
"docs": [
{
"doc": {
"_id": "_id",
"_index": "_index",
"_ingest": {
"timestamp": "2020-07-08T15:35:49.98965Z"
},
"_source": {
"alcalinity_of_ash": 18.0,
"alcohol": 14.1,
"ash": 2.3,
"color_intensity": 5.75,
"flavanoids": 3.32,
"hue": 1.25,
"magnesium": 105.0,
"malic_acid": 2.16,
"ml": {
"inference": {
"model_id": "wine-classifier",
"predicted_value": "0"
}
},
"nonflavanoid_phenols": 0.22,
"od280/od315_of_diluted_wines": 3.17,
"proanthocyanins": 2.38,
"proline": 1510.0,
"total_phenols": 2.95
},
"_type": "_doc"
}
},
{
"doc": {
"_id": "_id",
"_index": "_index",
"_ingest": {
"timestamp": "2020-07-08T15:35:49.98966Z"
},
"_source": {
"alcalinity_of_ash": 19.0,
"alcohol": 12.0,
"ash": 2.0,
"color_intensity": 2.5,
"flavanoids": 2.26,
"hue": 1.38,
"magnesium": 86.0,
"malic_acid": 0.92,
"ml": {
"inference": {
"model_id": "wine-classifier",
"predicted_value": "1"
}
},
"nonflavanoid_phenols": 0.3,
"od280/od315_of_diluted_wines": 3.12,
"proanthocyanins": 1.43,
"proline": 278.0,
"total_phenols": 2.42
},
"_type": "_doc"
}
},
{
"doc": {
"_id": "_id",
"_index": "_index",
"_ingest": {
"timestamp": "2020-07-08T15:35:49.989672Z"
},
"_source": {
"alcalinity_of_ash": 21.0,
"alcohol": 12.93,
"ash": 2.7,
"color_intensity": 4.6,
"flavanoids": 0.5,
"hue": 0.77,
"magnesium": 96.0,
"malic_acid": 2.81,
"ml": {
"inference": {
"model_id": "wine-classifier",
"predicted_value": "2"
}
},
"nonflavanoid_phenols": 0.53,
"od280/od315_of_diluted_wines": 2.31,
"proanthocyanins": 0.75,
"proline": 600.0,
"total_phenols": 1.54
},
"_type": "_doc"
}
}
]
}
[50]:
print(df[df["zipcode"] > df["score"]].es_info())
es_index_pattern: nyc-restaurants
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
camis camis True long None int64 True True False camis
dba dba True text None object True False False None
boro boro True keyword None object True True False boro
building building True keyword None object True True False building
street street True keyword None object True True False street
zipcode zipcode True short None int64 True True False zipcode
phone phone True keyword None object True True False phone
cuisine_description cuisine_description True keyword None object True True False cuisine_description
inspection_date inspection_date True keyword None object True True False inspection_date
action action True keyword None object True True False action
violation_code violation_code True keyword None object True True False violation_code
violation_description violation_description True keyword None object True True False violation_description
critical_flag critical_flag True keyword None object True True False critical_flag
score score True double None float64 True True False score
grade grade True keyword None object True True False grade
grade_date grade_date True keyword None object True True False grade_date
record_date record_date True keyword None object True True False record_date
inspection_type inspection_type True keyword None object True True False inspection_type
community_board community_board True double None float64 True True False community_board
council_district council_district True double None float64 True True False council_district
census_tract census_tract True double None float64 True True False census_tract
bin bin True double None float64 True True False bin
bbl bbl True double None float64 True True False bbl
nta nta True keyword None object True True False nta
location location True geo_point None object True True False location
Operations:
tasks: [('boolean_filter': ('boolean_filter': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}))]
size: None
sort_params: None
_source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location']
body: {'query': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}}
post_processing: []
[ ]:
[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Fix console size for consistent test results
from eland.conftest import *
Online Retail Analysis¶
Getting Started¶
To get started, let’s create an eland.DataFrame
by reading a csv file. This creates and populates the online-retail
index in the local Elasticsearch cluster.
[2]:
df = ed.csv_to_eland("data/online-retail.csv.gz",
es_client='http://localhost:9200',
es_dest_index='online-retail',
es_if_exists='replace',
es_dropna=True,
es_refresh=True,
compression='gzip',
index_col=0)
Here we see that the "_id"
field was used to index our data frame.
[3]:
df.index.es_index_field
[3]:
'_id'
Next, we can check which field from elasticsearch are available to our eland data frame. columns
is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.
[4]:
df.columns
[4]:
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode',
'UnitPrice'],
dtype='object')
Now, let’s see the data types of our fields. Running df.dtypes
, we can see that elasticsearch field types are mapped to pandas field types.
[5]:
df.dtypes
[5]:
Country object
CustomerID float64
Description object
InvoiceDate object
InvoiceNo object
Quantity int64
StockCode object
UnitPrice float64
dtype: object
We also offer a .es_info()
data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.
[6]:
print(df.es_info())
es_index_pattern: online-retail
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
Country Country True keyword None object True True False Country
CustomerID CustomerID True double None float64 True True False CustomerID
Description Description True keyword None object True True False Description
InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate
InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo
Quantity Quantity True long None int64 True True False Quantity
StockCode StockCode True keyword None object True True False StockCode
UnitPrice UnitPrice True double None float64 True True False UnitPrice
Operations:
tasks: []
size: None
sort_params: None
_source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
body: {}
post_processing: []
Selecting and Indexing Data¶
Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.
head and tail¶
much like pandas, eland data frames offer .head(n)
and .tail(n)
methods that return the first and last n rows, respectively.
[7]:
df.head(2)
[7]:
Country | CustomerID | ... | StockCode | UnitPrice | |
---|---|---|---|---|---|
0 | United Kingdom | 17850.0 | ... | 85123A | 2.55 |
1 | United Kingdom | 17850.0 | ... | 71053 | 3.39 |
2 rows × 8 columns
[8]:
print(df.tail(2).head(2).tail(2).es_info())
es_index_pattern: online-retail
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
Country Country True keyword None object True True False Country
CustomerID CustomerID True double None float64 True True False CustomerID
Description Description True keyword None object True True False Description
InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate
InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo
Quantity Quantity True long None int64 True True False Quantity
StockCode StockCode True keyword None object True True False StockCode
UnitPrice UnitPrice True double None float64 True True False UnitPrice
Operations:
tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))]
size: 2
sort_params: {'_doc': 'desc'}
_source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
body: {}
post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]
[9]:
df.tail(2)
[9]:
Country | CustomerID | ... | StockCode | UnitPrice | |
---|---|---|---|---|---|
12498 | United Kingdom | 16710.0 | ... | 20975 | 0.65 |
12499 | United Kingdom | 16710.0 | ... | 22445 | 2.95 |
2 rows × 8 columns
Selecting columns¶
you can also pass a list of columns to select columns from the data frame in a specified order.
[10]:
df[['Country', 'InvoiceDate']].head(5)
[10]:
Country | InvoiceDate | |
---|---|---|
0 | United Kingdom | 2010-12-01 08:26:00 |
1 | United Kingdom | 2010-12-01 08:26:00 |
2 | United Kingdom | 2010-12-01 08:26:00 |
3 | United Kingdom | 2010-12-01 08:26:00 |
4 | United Kingdom | 2010-12-01 08:26:00 |
5 rows × 2 columns
Boolean Indexing¶
we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms
query that is then passed to elasticsearch to filter the index.
[11]:
# the construction of a boolean vector maps directly to an elasticsearch query
print(df['Country']=='Germany')
df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
[11]:
Country | CustomerID | ... | StockCode | UnitPrice | |
---|---|---|---|---|---|
1109 | Germany | 12662.0 | ... | 22809 | 2.95 |
1110 | Germany | 12662.0 | ... | 84347 | 2.55 |
1111 | Germany | 12662.0 | ... | 84945 | 0.85 |
1112 | Germany | 12662.0 | ... | 22242 | 1.65 |
1113 | Germany | 12662.0 | ... | 22244 | 1.95 |
5 rows × 8 columns
we can also filter the data frame using a list of values.
[12]:
print(df['Country'].isin(['Germany', 'United States']))
df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
[12]:
Country | CustomerID | ... | StockCode | UnitPrice | |
---|---|---|---|---|---|
0 | United Kingdom | 17850.0 | ... | 85123A | 2.55 |
1 | United Kingdom | 17850.0 | ... | 71053 | 3.39 |
2 | United Kingdom | 17850.0 | ... | 84406B | 2.75 |
3 | United Kingdom | 17850.0 | ... | 84029G | 3.39 |
4 | United Kingdom | 17850.0 | ... | 84029E | 3.39 |
5 rows × 8 columns
We can also combine boolean vectors to further filter the data frame.
[13]:
df[(df['Country']=='Germany') & (df['Quantity']>90)]
[13]:
Country | CustomerID | ... | StockCode | UnitPrice |
---|
0 rows × 8 columns
Using this example, let see how eland translates this boolean filter to an elasticsearch bool
query.
[14]:
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].es_info())
es_index_pattern: online-retail
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
Country Country True keyword None object True True False Country
CustomerID CustomerID True double None float64 True True False CustomerID
Description Description True keyword None object True True False Description
InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate
InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo
Quantity Quantity True long None int64 True True False Quantity
StockCode StockCode True keyword None object True True False StockCode
UnitPrice UnitPrice True double None float64 True True False UnitPrice
Operations:
tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))]
size: None
sort_params: None
_source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}}
post_processing: []
Aggregation and Descriptive Statistics¶
Let’s begin to ask some questions of our data and use eland to get the answers.
How many different countries are there?
[15]:
df['Country'].nunique()
[15]:
16
What is the total sum of products ordered?
[16]:
df['Quantity'].sum()
[16]:
111960
Show me the sum, mean, min, and max of the qunatity and unit_price fields
[17]:
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
[17]:
Quantity | UnitPrice | |
---|---|---|
sum | 111960.000 | 61548.490000 |
mean | 7.464 | 4.103233 |
max | 2880.000 | 950.990000 |
min | -9360.000 | 0.000000 |
Give me descriptive statistics for the entire data frame
[18]:
# NBVAL_IGNORE_OUTPUT
df.describe()
[18]:
CustomerID | Quantity | UnitPrice | |
---|---|---|---|
count | 10729.000000 | 15000.000000 | 15000.000000 |
mean | 15590.776680 | 7.464000 | 4.103233 |
std | 1764.189592 | 85.930116 | 20.106214 |
min | 12347.000000 | -9360.000000 | 0.000000 |
25% | 14222.249164 | 1.000000 | 1.250000 |
50% | 15663.037856 | 2.000000 | 2.510000 |
75% | 17219.040670 | 6.425347 | 4.210000 |
max | 18239.000000 | 2880.000000 | 950.990000 |
Show me a histogram of numeric columns
[19]:
df[(df['Quantity']>-50) &
(df['Quantity']<50) &
(df['UnitPrice']>0) &
(df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30)
plt.show()

[20]:
df[(df['Quantity']>-50) &
(df['Quantity']<50) &
(df['UnitPrice']>0) &
(df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True)
plt.show()

[21]:
df.query('Quantity>50 & UnitPrice<100')
[21]:
Country | CustomerID | ... | StockCode | UnitPrice | |
---|---|---|---|---|---|
46 | United Kingdom | 13748.0 | ... | 22086 | 2.55 |
83 | United Kingdom | 15291.0 | ... | 21733 | 2.55 |
96 | United Kingdom | 14688.0 | ... | 21212 | 0.42 |
102 | United Kingdom | 14688.0 | ... | 85071B | 0.38 |
176 | United Kingdom | 16029.0 | ... | 85099C | 1.65 |
... | ... | ... | ... | ... | ... |
11924 | United Kingdom | 14708.0 | ... | 84945 | 0.72 |
12007 | United Kingdom | 18113.0 | ... | 84946 | 1.06 |
12015 | United Kingdom | 17596.0 | ... | 17003 | 0.21 |
12058 | United Kingdom | 17596.0 | ... | 84536A | 0.42 |
12448 | EIRE | 14911.0 | ... | 84945 | 0.85 |
258 rows × 8 columns
Arithmetic Operations¶
Numeric values
[22]:
df['Quantity'].head()
[22]:
0 6
1 6
2 8
3 6
4 6
Name: Quantity, dtype: int64
[23]:
df['UnitPrice'].head()
[23]:
0 2.55
1 3.39
2 2.75
3 3.39
4 3.39
Name: UnitPrice, dtype: float64
[24]:
product = df['Quantity'] * df['UnitPrice']
[25]:
product.head()
[25]:
0 15.30
1 20.34
2 22.00
3 20.34
4 20.34
dtype: float64
String concatenation
[26]:
df['Country'] + df['StockCode']
[26]:
0 United Kingdom85123A
1 United Kingdom71053
2 United Kingdom84406B
3 United Kingdom84029G
4 United Kingdom84029E
...
12495 United Kingdom84692
12496 United Kingdom22075
12497 United Kingdom20979
12498 United Kingdom20975
12499 United Kingdom22445
Length: 15000, dtype: object
Development¶
Contributing to Eland¶
Eland is an open source project and we love to receive contributions from our community — you! There are many ways to contribute, from writing tutorials or blog posts, improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated into eland itself.
Bug reports¶
If you think you have found a bug in eland, first make sure that you are testing against the latest version of eland - your issue may already have been fixed. If not, search our issues list on GitHub in case a similar issue has already been opened.
It is very helpful if you can prepare a reproduction of the bug. In other words, provide a small test case which we can run to confirm your bug. It makes it easier to find the problem and to fix it. Test cases should be provided as python scripts, ideally with some details of your Elasticsearch environment and index mappings, and (where appropriate) a pandas example.
Provide as much information as you can. You may think that the problem lies with your query, when actually it depends on how your data is indexed. The easier it is for us to recreate your problem, the faster it is likely to be fixed.
Feature requests¶
If you find yourself wishing for a feature that doesn’t exist in eland, you are probably not alone. There are bound to be others out there with similar needs. Many of the features that eland has today have been added because our users saw the need. Open an issue on our issues list on GitHub which describes the feature you would like to see, why you need it, and how it should work.
Contributing code and documentation changes¶
If you have a bugfix or new feature that you would like to contribute to eland, please find or open an issue about it first. Talk about what you would like to do. It may be that somebody is already working on it, or that there are particular issues that you should know about before implementing the change.
We enjoy working with contributors to get their code accepted. There are many approaches to fixing a problem and it is important to find the best approach before writing too much code.
Note that it is unlikely the project will merge refactors for the sake of refactoring. These types of pull requests have a high cost to maintainers in reviewing and testing with little to no tangible benefit. This especially includes changes generated by tools.
The process for contributing to any of the Elastic repositories is similar. Details for individual projects can be found below.
Fork and clone the repository¶
You will need to fork the main eland code or documentation repository and clone it to your local machine. See Github fork a repo page - for help.
Further instructions for specific projects are given below.
Submitting your changes¶
Once your changes and tests are ready to submit for review:
Run the linter and test suite to ensure your changes do not break the existing code: (TODO Add link to the testing document)
# Run Auto-format, lint, mypy type checker for your changes $ nox -s format # Run the test suite $ pytest --doctest-modules eland/ tests/ $ pytest --nbval tests/notebook/
Sign the Contributor License Agreement
Please make sure you have signed our Contributor License Agreement. We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code. You only need to sign the CLA once.
Rebase your changes
Update your local repository with the most recent code from the main eland repository, and rebase your branch on top of the latest main branch. We prefer your initial changes to be squashed into a single commit. Later, if we ask you to make changes, add them as separate commits. This makes them easier to review. As a final step before merging we will either ask you to squash all commits yourself or we’ll do it for you.
Submit a pull request
Push your local changes to your forked copy of the repository and submit a pull request. In the pull request, choose a title which sums up the changes that you have made, and in the body provide more details about what your changes do. Also mention the number of the issue where discussion has taken place, eg “Closes #123”.
Then sit back and wait. There will probably be discussion about the pull request and, if any changes are needed, we would love to work with you to get your pull request merged into eland.
Please adhere to the general guideline that you should never force push to a publicly shared branch. Once you have opened your pull request, you should consider your branch publicly shared. Instead of force pushing you can just add incremental commits; this is generally easier on your reviewers. If you need to pick up changes from main, you can merge main into your branch. A reviewer might ask you to rebase a long-running pull request in which case force pushing is okay for that request. Note that squashing at the end of the review process should also not be done, that can be done when the pull request is integrated via GitHub.
Contributing to the eland codebase¶
Repository: https://github.com/elastic/eland
We internally develop using the PyCharm IDE. For PyCharm, we are currently using a minimum version of PyCharm 2019.2.4.
Configuring PyCharm And Running Tests¶
(All commands should be run from module root)
- Create a new project via ‘Check out from Version
Control’->’Git’ on the “Welcome to PyCharm” page <or other>
- Enter the URL to your fork of eland
<e.g. git@github.com:stevedodson/eland.git>
Click ‘Yes’ for ‘Checkout from Version Control’
- Configure PyCharm environment:
In ‘Preferences’ configure a ‘Project: eland’->’Project Interpreter’. Generally, we recommend creating a virtual environment. (TODO link to installing for python version support)
- In ‘Preferences’ set ‘Tools’->’Python Integrated
Tools’->’Default test runner’ to pytest
- In ‘Preferences’ set ‘Tools’->’Python Integrated
Tools’->’Docstring format’ to numpy
- Install development requirements. Open terminal in virtual environment and run
`pip install -r requirements-dev.txt`
- Setup Elasticsearch instance with docker
ELASTICSEARCH_VERSION=elasticsearch:7.x-SNAPSHOT .ci/run-elasticsearch.sh`
Check http://localhost:9200 to verify if ElasticSearch Instance is running.
- Install local eland module <required to execute notebook tests>
python setup.py install
- To setup test environment -note this modifies Elasticsearch indices run
python -m tests.setup_tests
- To validate installation, open python console and run
import eland as ed ed_df = ed.DataFrame('localhost', 'flights')
- To run the automatic formatter and check for lint issues
run `nox -s format`
- To test specific versions of Python run
nox -s test-3.8
Documentation¶
- Install pandoc on your system -
# For Ubuntu or Debian sudo apt-get install -y pandoc
- Install documentation requirements. Open terminal in virtual environment and run
pip install -r docs/requirements-docs.txt
- To verify/generate documentation run
nox -s docs
Implementation Details¶
The goal of an eland.DataFrame
is to enable users who are familiar with pandas.DataFrame
to access, explore and manipulate data that resides in Elasticsearch.
Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.
Pandas and 3rd Party Storage Systems¶
Generally, integrations with 3rd party storage systems (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to Apache Arrow structures.
Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.
An alternative option is to create an API that proxies pandas.DataFrame
-like calls to Elasticsearch
queries and operations. This could allow the Elasticsearch cluster to perform operations such as
aggregations rather than exporting all the data and performing this operation in-core.
Implementation Options¶
An option would be to replace the pandas.DataFrame
backend in-core memory structures with Elasticsearch
accessors. This would allow full access to the pandas.DataFrame
APIs. However, this has issues:
- If a
pandas.DataFrame
instance maps to an index, typical manipulation of apandas.DataFrame
may involve creating many derivedpandas.DataFrame
instances. Constructing an index perpandas.DataFrame
may result in many Elasticsearch indexes and a significant load on Elasticsearch. For example,df_a = df['a']
should not require Elasticsearch indicesdf
anddf_a
- Not all
pandas.DataFrame
APIs map to things we may want to do in Elasticsearch. In particular, API calls that involve exporting all data from Elasticsearch into memory e.g.df.to_dict()
. - The backend
pandas.DataFrame
structures are not easily abstractable and are deeply embedded in the implementation.
Another option is to create a eland.DataFrame
API that mimics appropriate aspects of
the pandas.DataFrame
API. This resolves some of the issues above as:
df_a = df['a']
could be implemented as a change to the Elasticsearch query used, rather than a new index- Instead of supporting the enitre
pandas.DataFrame
API we can support a subset appropriate for Elasticsearch. If addition calls are required, we could to create aeland.DataFrame._to_pandas()
method which would explicitly export all data to apandas.DataFrame
- Creating a new
eland.DataFrame
API gives us full flexibility in terms of implementation. However, it does create a large amount of work which may duplicate a lot of thepandas
code - for example, printing objects etc. - this creates maintenance issues etc.