Eland: DataFrames and Machine Learning backed by Elasticsearch#

Date: Jul 11, 2022 Version: 8.3.0

Useful links: Source Repository | Issues & Ideas | Q&A Support

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.

Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.

Installing Eland#

Eland can be installed from PyPI via pip:

$ python -m pip install eland

Eland can also be installed from Conda Forge with Conda:

$ conda install -c conda-forge eland

Getting Started#

If it’s your first time using Eland we recommend looking through the Examples documentation for ideas on what Eland is capable of.

If you’re new to Elasticsearch we recommend reading the documentation.

API Reference#

This page gives an overview of all public eland objects, functions and methods. All classes and functions exposed in eland.* namespace are public.

Supported Pandas APIs#

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not Eland supports the method on the corresponding object.

Note

Even if an interface is listed here as “supported” doesn’t mean all parameters are implemented. Sometimes only a subset of parameters can be supported.

If you have need of an operation that is listed as not implemented, feel free to open an issue or give a thumbs up to already created issues. Contributions are also welcome!

Note

Some Pandas methods are not implementable due to the constraints of the library using Elasticsearch as the backend for the majority of data processing. Functions like DataFrame.iloc[i] or DataFrame.transpose() likely won’t be implementable in Eland due to this constraint.

This list should be automatically generated with utils/generate-supported-apis.py script instead of being modified manually.

Method or Property

Supported?

ed.DataFrame.abs()

No

ed.DataFrame.add()

No

ed.DataFrame.add_prefix()

No

ed.DataFrame.add_suffix()

No

ed.DataFrame.agg()

Yes

ed.DataFrame.aggregate()

Yes

ed.DataFrame.align()

No

ed.DataFrame.all()

No

ed.DataFrame.any()

No

ed.DataFrame.append()

No

ed.DataFrame.apply()

No

ed.DataFrame.applymap()

No

ed.DataFrame.asfreq()

No

ed.DataFrame.asof()

No

ed.DataFrame.assign()

No

ed.DataFrame.astype()

No

ed.DataFrame.at

No

ed.DataFrame.at_time()

No

ed.DataFrame.attrs

No

ed.DataFrame.axes

No

ed.DataFrame.backfill()

No

ed.DataFrame.between_time()

No

ed.DataFrame.bfill()

No

ed.DataFrame.bool()

No

ed.DataFrame.boxplot()

No

ed.DataFrame.clip()

No

ed.DataFrame.columns

Yes

ed.DataFrame.combine()

No

ed.DataFrame.combine_first()

No

ed.DataFrame.compare()

No

ed.DataFrame.convert_dtypes()

No

ed.DataFrame.copy()

No

ed.DataFrame.corr()

No

ed.DataFrame.corrwith()

No

ed.DataFrame.count()

Yes

ed.DataFrame.cov()

No

ed.DataFrame.cummax()

No

ed.DataFrame.cummin()

No

ed.DataFrame.cumprod()

No

ed.DataFrame.cumsum()

No

ed.DataFrame.describe()

Yes

ed.DataFrame.diff()

No

ed.DataFrame.div()

No

ed.DataFrame.divide()

No

ed.DataFrame.dot()

No

ed.DataFrame.drop()

Yes

ed.DataFrame.drop_duplicates()

No

ed.DataFrame.droplevel()

No

ed.DataFrame.dropna()

No

ed.DataFrame.dtypes

Yes

ed.DataFrame.duplicated()

No

ed.DataFrame.empty

Yes

ed.DataFrame.eq()

No

ed.DataFrame.equals()

No

ed.DataFrame.eval()

No

ed.DataFrame.ewm()

No

ed.DataFrame.expanding()

No

ed.DataFrame.explode()

No

ed.DataFrame.ffill()

No

ed.DataFrame.fillna()

No

ed.DataFrame.filter()

Yes

ed.DataFrame.first()

No

ed.DataFrame.first_valid_index()

No

ed.DataFrame.flags

No

ed.DataFrame.floordiv()

No

ed.DataFrame.from_dict()

No

ed.DataFrame.from_records()

No

ed.DataFrame.ge()

No

ed.DataFrame.get()

Yes

ed.DataFrame.groupby()

Yes

ed.DataFrame.gt()

No

ed.DataFrame.head()

Yes

ed.DataFrame.hist()

Yes

ed.DataFrame.iat

No

ed.DataFrame.idxmax()

Yes

ed.DataFrame.idxmin()

Yes

ed.DataFrame.iloc

No

ed.DataFrame.index

Yes

ed.DataFrame.infer_objects()

No

ed.DataFrame.info()

Yes

ed.DataFrame.insert()

No

ed.DataFrame.interpolate()

No

ed.DataFrame.isin()

No

ed.DataFrame.isna()

No

ed.DataFrame.isnull()

No

ed.DataFrame.items()

No

ed.DataFrame.iteritems()

No

ed.DataFrame.iterrows()

Yes

ed.DataFrame.itertuples()

Yes

ed.DataFrame.join()

No

ed.DataFrame.keys()

Yes

ed.DataFrame.kurt()

No

ed.DataFrame.kurtosis()

No

ed.DataFrame.last()

No

ed.DataFrame.last_valid_index()

No

ed.DataFrame.le()

No

ed.DataFrame.loc

No

ed.DataFrame.lookup()

No

ed.DataFrame.lt()

No

ed.DataFrame.mad()

Yes

ed.DataFrame.mask()

No

ed.DataFrame.max()

Yes

ed.DataFrame.mean()

Yes

ed.DataFrame.median()

Yes

ed.DataFrame.melt()

No

ed.DataFrame.memory_usage()

No

ed.DataFrame.merge()

No

ed.DataFrame.min()

Yes

ed.DataFrame.mod()

No

ed.DataFrame.mode()

Yes

ed.DataFrame.mul()

No

ed.DataFrame.multiply()

No

ed.DataFrame.ndim

Yes

ed.DataFrame.ne()

No

ed.DataFrame.nlargest()

No

ed.DataFrame.notna()

No

ed.DataFrame.notnull()

No

ed.DataFrame.nsmallest()

No

ed.DataFrame.nunique()

Yes

ed.DataFrame.pad()

No

ed.DataFrame.pct_change()

No

ed.DataFrame.pipe()

No

ed.DataFrame.pivot()

No

ed.DataFrame.pivot_table()

No

ed.DataFrame.pop()

No

ed.DataFrame.pow()

No

ed.DataFrame.prod()

No

ed.DataFrame.product()

No

ed.DataFrame.quantile()

Yes

ed.DataFrame.query()

Yes

ed.DataFrame.radd()

No

ed.DataFrame.rank()

No

ed.DataFrame.rdiv()

No

ed.DataFrame.reindex()

No

ed.DataFrame.reindex_like()

No

ed.DataFrame.rename()

No

ed.DataFrame.rename_axis()

No

ed.DataFrame.reorder_levels()

No

ed.DataFrame.replace()

No

ed.DataFrame.resample()

No

ed.DataFrame.reset_index()

No

ed.DataFrame.rfloordiv()

No

ed.DataFrame.rmod()

No

ed.DataFrame.rmul()

No

ed.DataFrame.rolling()

No

ed.DataFrame.round()

No

ed.DataFrame.rpow()

No

ed.DataFrame.rsub()

No

ed.DataFrame.rtruediv()

No

ed.DataFrame.sample()

Yes

ed.DataFrame.select_dtypes()

Yes

ed.DataFrame.sem()

No

ed.DataFrame.set_axis()

No

ed.DataFrame.set_flags()

No

ed.DataFrame.set_index()

No

ed.DataFrame.shape

Yes

ed.DataFrame.shift()

No

ed.DataFrame.size

Yes

ed.DataFrame.skew()

No

ed.DataFrame.slice_shift()

No

ed.DataFrame.sort_index()

No

ed.DataFrame.sort_values()

No

ed.DataFrame.squeeze()

No

ed.DataFrame.stack()

No

ed.DataFrame.std()

Yes

ed.DataFrame.style

No

ed.DataFrame.sub()

No

ed.DataFrame.subtract()

No

ed.DataFrame.sum()

Yes

ed.DataFrame.swapaxes()

No

ed.DataFrame.swaplevel()

No

ed.DataFrame.T

No

ed.DataFrame.tail()

Yes

ed.DataFrame.take()

No

ed.DataFrame.to_clipboard()

No

ed.DataFrame.to_csv()

Yes

ed.DataFrame.to_dict()

No

ed.DataFrame.to_excel()

No

ed.DataFrame.to_feather()

No

ed.DataFrame.to_gbq()

No

ed.DataFrame.to_hdf()

No

ed.DataFrame.to_html()

Yes

ed.DataFrame.to_json()

No

ed.DataFrame.to_latex()

No

ed.DataFrame.to_markdown()

No

ed.DataFrame.to_numpy()

Yes

ed.DataFrame.to_parquet()

No

ed.DataFrame.to_period()

No

ed.DataFrame.to_pickle()

No

ed.DataFrame.to_records()

No

ed.DataFrame.to_sql()

No

ed.DataFrame.to_stata()

No

ed.DataFrame.to_string()

Yes

ed.DataFrame.to_timestamp()

No

ed.DataFrame.to_xarray()

No

ed.DataFrame.to_xml()

No

ed.DataFrame.transform()

No

ed.DataFrame.transpose()

No

ed.DataFrame.truediv()

No

ed.DataFrame.truncate()

No

ed.DataFrame.tshift()

No

ed.DataFrame.tz_convert()

No

ed.DataFrame.tz_localize()

No

ed.DataFrame.unstack()

No

ed.DataFrame.update()

No

ed.DataFrame.value_counts()

No

ed.DataFrame.values

Yes

ed.DataFrame.var()

Yes

ed.DataFrame.where()

No

ed.DataFrame.xs()

No

ed.DataFrame.__abs__()

No

ed.DataFrame.__add__()

No

ed.DataFrame.__and__()

No

ed.DataFrame.__annotations__

No

ed.DataFrame.__array__()

No

ed.DataFrame.__array_priority__

No

ed.DataFrame.__array_ufunc__()

No

ed.DataFrame.__array_wrap__()

No

ed.DataFrame.__bool__()

No

ed.DataFrame.__contains__()

No

ed.DataFrame.__copy__()

No

ed.DataFrame.__deepcopy__()

No

ed.DataFrame.__delattr__

Yes

ed.DataFrame.__delitem__()

No

ed.DataFrame.__dict__

Yes

ed.DataFrame.__dir__()

Yes

ed.DataFrame.__divmod__()

No

ed.DataFrame.__doc__

Yes

ed.DataFrame.__eq__()

Yes

ed.DataFrame.__finalize__()

No

ed.DataFrame.__floordiv__()

No

ed.DataFrame.__format__

Yes

ed.DataFrame.__ge__()

Yes

ed.DataFrame.__getattr__()

Yes

ed.DataFrame.__getattribute__

Yes

ed.DataFrame.__getitem__()

Yes

ed.DataFrame.__getstate__()

No

ed.DataFrame.__gt__()

Yes

ed.DataFrame.__hash__

Yes

ed.DataFrame.__iadd__()

No

ed.DataFrame.__iand__()

No

ed.DataFrame.__ifloordiv__()

No

ed.DataFrame.__imod__()

No

ed.DataFrame.__imul__()

No

ed.DataFrame.__init__()

Yes

ed.DataFrame.__init_subclass__

Yes

ed.DataFrame.__invert__()

No

ed.DataFrame.__ior__()

No

ed.DataFrame.__ipow__()

No

ed.DataFrame.__isub__()

No

ed.DataFrame.__iter__()

No

ed.DataFrame.__itruediv__()

No

ed.DataFrame.__ixor__()

No

ed.DataFrame.__le__()

Yes

ed.DataFrame.__len__()

Yes

ed.DataFrame.__lt__()

Yes

ed.DataFrame.__matmul__()

No

ed.DataFrame.__mod__()

No

ed.DataFrame.__module__

Yes

ed.DataFrame.__mul__()

No

ed.DataFrame.__ne__()

Yes

ed.DataFrame.__neg__()

No

ed.DataFrame.__new__

Yes

ed.DataFrame.__nonzero__()

No

ed.DataFrame.__or__()

No

ed.DataFrame.__pos__()

No

ed.DataFrame.__pow__()

No

ed.DataFrame.__radd__()

No

ed.DataFrame.__rand__()

No

ed.DataFrame.__rdivmod__()

No

ed.DataFrame.__reduce__

Yes

ed.DataFrame.__reduce_ex__

Yes

ed.DataFrame.__repr__()

Yes

ed.DataFrame.__rfloordiv__()

No

ed.DataFrame.__rmatmul__()

No

ed.DataFrame.__rmod__()

No

ed.DataFrame.__rmul__()

No

ed.DataFrame.__ror__()

No

ed.DataFrame.__round__()

No

ed.DataFrame.__rpow__()

No

ed.DataFrame.__rsub__()

No

ed.DataFrame.__rtruediv__()

No

ed.DataFrame.__rxor__()

No

ed.DataFrame.__setattr__()

Yes

ed.DataFrame.__setitem__()

No

ed.DataFrame.__setstate__()

No

ed.DataFrame.__sizeof__()

Yes

ed.DataFrame.__str__

Yes

ed.DataFrame.__sub__()

No

ed.DataFrame.__subclasshook__

Yes

ed.DataFrame.__truediv__()

No

ed.DataFrame.__weakref__

Yes

ed.DataFrame.__xor__()

No

ed.Series.abs()

No

ed.Series.add()

Yes

ed.Series.add_prefix()

No

ed.Series.add_suffix()

No

ed.Series.agg()

No

ed.Series.aggregate()

No

ed.Series.align()

No

ed.Series.all()

No

ed.Series.any()

No

ed.Series.append()

No

ed.Series.apply()

No

ed.Series.argmax()

No

ed.Series.argmin()

No

ed.Series.argsort()

No

ed.Series.array

No

ed.Series.asfreq()

No

ed.Series.asof()

No

ed.Series.astype()

No

ed.Series.at

No

ed.Series.at_time()

No

ed.Series.attrs

No

ed.Series.autocorr()

No

ed.Series.axes

No

ed.Series.backfill()

No

ed.Series.between()

No

ed.Series.between_time()

No

ed.Series.bfill()

No

ed.Series.bool()

No

ed.Series.clip()

No

ed.Series.combine()

No

ed.Series.combine_first()

No

ed.Series.compare()

No

ed.Series.convert_dtypes()

No

ed.Series.copy()

No

ed.Series.corr()

No

ed.Series.count()

No

ed.Series.cov()

No

ed.Series.cummax()

No

ed.Series.cummin()

No

ed.Series.cumprod()

No

ed.Series.cumsum()

No

ed.Series.describe()

Yes

ed.Series.diff()

No

ed.Series.div()

Yes

ed.Series.divide()

Yes

ed.Series.divmod()

No

ed.Series.dot()

No

ed.Series.drop()

No

ed.Series.drop_duplicates()

No

ed.Series.droplevel()

No

ed.Series.dropna()

No

ed.Series.dtype

Yes

ed.Series.dtypes

Yes

ed.Series.duplicated()

No

ed.Series.empty

Yes

ed.Series.eq()

No

ed.Series.equals()

No

ed.Series.ewm()

No

ed.Series.expanding()

No

ed.Series.explode()

No

ed.Series.factorize()

No

ed.Series.ffill()

No

ed.Series.fillna()

No

ed.Series.filter()

Yes

ed.Series.first()

No

ed.Series.first_valid_index()

No

ed.Series.flags

No

ed.Series.floordiv()

Yes

ed.Series.ge()

No

ed.Series.get()

No

ed.Series.groupby()

No

ed.Series.gt()

No

ed.Series.hasnans

No

ed.Series.head()

Yes

ed.Series.hist()

Yes

ed.Series.iat

No

ed.Series.idxmax()

No

ed.Series.idxmin()

No

ed.Series.iloc

No

ed.Series.index

Yes

ed.Series.infer_objects()

No

ed.Series.info()

No

ed.Series.interpolate()

No

ed.Series.is_monotonic

No

ed.Series.is_monotonic_decreasing

No

ed.Series.is_monotonic_increasing

No

ed.Series.is_unique

No

ed.Series.isin()

Yes

ed.Series.isna()

Yes

ed.Series.isnull()

Yes

ed.Series.item()

No

ed.Series.items()

No

ed.Series.iteritems()

No

ed.Series.keys()

No

ed.Series.kurt()

No

ed.Series.kurtosis()

No

ed.Series.last()

No

ed.Series.last_valid_index()

No

ed.Series.le()

No

ed.Series.loc

No

ed.Series.lt()

No

ed.Series.mad()

Yes

ed.Series.map()

No

ed.Series.mask()

No

ed.Series.max()

Yes

ed.Series.mean()

Yes

ed.Series.median()

Yes

ed.Series.memory_usage()

No

ed.Series.min()

Yes

ed.Series.mod()

Yes

ed.Series.mode()

Yes

ed.Series.mul()

Yes

ed.Series.multiply()

Yes

ed.Series.name

Yes

ed.Series.nbytes

No

ed.Series.ndim

Yes

ed.Series.ne()

No

ed.Series.nlargest()

No

ed.Series.notna()

Yes

ed.Series.notnull()

Yes

ed.Series.nsmallest()

No

ed.Series.nunique()

Yes

ed.Series.pad()

No

ed.Series.pct_change()

No

ed.Series.pipe()

No

ed.Series.pop()

No

ed.Series.pow()

Yes

ed.Series.prod()

No

ed.Series.product()

No

ed.Series.quantile()

Yes

ed.Series.radd()

Yes

ed.Series.rank()

No

ed.Series.ravel()

No

ed.Series.rdiv()

Yes

ed.Series.rdivmod()

No

ed.Series.reindex()

No

ed.Series.reindex_like()

No

ed.Series.rename()

Yes

ed.Series.rename_axis()

No

ed.Series.reorder_levels()

No

ed.Series.repeat()

No

ed.Series.replace()

No

ed.Series.resample()

No

ed.Series.reset_index()

No

ed.Series.rfloordiv()

Yes

ed.Series.rmod()

Yes

ed.Series.rmul()

Yes

ed.Series.rolling()

No

ed.Series.round()

No

ed.Series.rpow()

Yes

ed.Series.rsub()

Yes

ed.Series.rtruediv()

Yes

ed.Series.sample()

Yes

ed.Series.searchsorted()

No

ed.Series.sem()

No

ed.Series.set_axis()

No

ed.Series.set_flags()

No

ed.Series.shape

Yes

ed.Series.shift()

No

ed.Series.size

Yes

ed.Series.skew()

No

ed.Series.slice_shift()

No

ed.Series.sort_index()

No

ed.Series.sort_values()

No

ed.Series.squeeze()

No

ed.Series.std()

Yes

ed.Series.sub()

Yes

ed.Series.subtract()

Yes

ed.Series.sum()

Yes

ed.Series.swapaxes()

No

ed.Series.swaplevel()

No

ed.Series.T

No

ed.Series.tail()

Yes

ed.Series.take()

No

ed.Series.to_clipboard()

No

ed.Series.to_csv()

No

ed.Series.to_dict()

No

ed.Series.to_excel()

No

ed.Series.to_frame()

No

ed.Series.to_hdf()

No

ed.Series.to_json()

No

ed.Series.to_latex()

No

ed.Series.to_list()

No

ed.Series.to_markdown()

No

ed.Series.to_numpy()

Yes

ed.Series.to_period()

No

ed.Series.to_pickle()

No

ed.Series.to_sql()

No

ed.Series.to_string()

Yes

ed.Series.to_timestamp()

No

ed.Series.to_xarray()

No

ed.Series.tolist()

No

ed.Series.transform()

No

ed.Series.transpose()

No

ed.Series.truediv()

Yes

ed.Series.truncate()

No

ed.Series.tshift()

No

ed.Series.tz_convert()

No

ed.Series.tz_localize()

No

ed.Series.unique()

Yes

ed.Series.unstack()

No

ed.Series.update()

No

ed.Series.value_counts()

Yes

ed.Series.values

No

ed.Series.var()

Yes

ed.Series.view()

No

ed.Series.where()

No

ed.Series.xs()

No

ed.Series.__abs__()

No

ed.Series.__add__()

Yes

ed.Series.__and__()

No

ed.Series.__annotations__

No

ed.Series.__array__()

No

ed.Series.__array_priority__

No

ed.Series.__array_ufunc__()

No

ed.Series.__array_wrap__()

No

ed.Series.__bool__()

No

ed.Series.__contains__()

No

ed.Series.__copy__()

No

ed.Series.__deepcopy__()

No

ed.Series.__delattr__

Yes

ed.Series.__delitem__()

No

ed.Series.__dict__

Yes

ed.Series.__dir__()

Yes

ed.Series.__divmod__()

No

ed.Series.__doc__

Yes

ed.Series.__eq__()

Yes

ed.Series.__finalize__()

No

ed.Series.__float__()

No

ed.Series.__floordiv__()

Yes

ed.Series.__format__

Yes

ed.Series.__ge__()

Yes

ed.Series.__getattr__()

No

ed.Series.__getattribute__

Yes

ed.Series.__getitem__()

No

ed.Series.__getstate__()

No

ed.Series.__gt__()

Yes

ed.Series.__hash__

Yes

ed.Series.__iadd__()

No

ed.Series.__iand__()

No

ed.Series.__ifloordiv__()

No

ed.Series.__imod__()

No

ed.Series.__imul__()

No

ed.Series.__init__()

Yes

ed.Series.__init_subclass__

Yes

ed.Series.__int__()

No

ed.Series.__invert__()

No

ed.Series.__ior__()

No

ed.Series.__ipow__()

No

ed.Series.__isub__()

No

ed.Series.__iter__()

No

ed.Series.__itruediv__()

No

ed.Series.__ixor__()

No

ed.Series.__le__()

Yes

ed.Series.__len__()

Yes

ed.Series.__long__()

No

ed.Series.__lt__()

Yes

ed.Series.__matmul__()

No

ed.Series.__mod__()

Yes

ed.Series.__module__

Yes

ed.Series.__mul__()

Yes

ed.Series.__ne__()

Yes

ed.Series.__neg__()

No

ed.Series.__new__

Yes

ed.Series.__nonzero__()

No

ed.Series.__or__()

No

ed.Series.__pos__()

No

ed.Series.__pow__()

Yes

ed.Series.__radd__()

Yes

ed.Series.__rand__()

No

ed.Series.__rdivmod__()

No

ed.Series.__reduce__

Yes

ed.Series.__reduce_ex__

Yes

ed.Series.__repr__()

Yes

ed.Series.__rfloordiv__()

Yes

ed.Series.__rmatmul__()

No

ed.Series.__rmod__()

Yes

ed.Series.__rmul__()

Yes

ed.Series.__ror__()

No

ed.Series.__round__()

No

ed.Series.__rpow__()

Yes

ed.Series.__rsub__()

Yes

ed.Series.__rtruediv__()

Yes

ed.Series.__rxor__()

No

ed.Series.__setattr__()

Yes

ed.Series.__setitem__()

No

ed.Series.__setstate__()

No

ed.Series.__sizeof__()

Yes

ed.Series.__str__

Yes

ed.Series.__sub__()

Yes

ed.Series.__subclasshook__

Yes

ed.Series.__truediv__()

Yes

ed.Series.__weakref__

Yes

ed.Series.__xor__()

No

DataFrame#

Constructor#
Attributes and Underlying Data#
Indexing, Iteration#
Function Application, GroupBy & Window#

Note

Elasticsearch aggregations using cardinality (count) are accurate approximations using the HyperLogLog++ algorithm so may not be exact.

Computations / Descriptive Stats#
Reindexing / Selection / Label Manipulation#
Plotting#
Elasticsearch Functions#
Serialization / IO / Conversion#

Series#

Constructor#
Attributes and Underlying Data#
Indexing, Iteration#
Binary Operator Functions#
Computations / Descriptive Stats#
Reindexing / Selection / Label Manipulation#
Plotting#
Serialization / IO / Conversion#
Elasticsearch Functions#

Machine Learning#

Machine learning is built into the Elastic Stack and enables users to gain insights into their Elasticsearch data. There are a wide range of capabilities from identifying in anomalies in your data, to training and deploying regression or classification models based on Elasticsearch data.

To use the Elastic Stack machine learning features, you must have the appropriate license and at least one machine learning node in your Elasticsearch cluster. If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges.

The fastest way to get started with machine learning features is to start a free 14-day trial of Elastic Cloud.

See Elasticsearch Machine Learning documentation more details.

MLModel#
Constructor#
Predictions#
Manage Models#

Index#

Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame) and those should most likely be used before calling these methods directly.

Constructor#

General Utility Functions#

Pandas and Eland#

Input/Output#

Flat File#

Examples#

Eland Demo Notebook#

[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from elasticsearch import Elasticsearch

# Import standard test settings for consistent results
from eland.conftest import *
Compare Eland DataFrame vs pandas DataFrame#

Create an eland.DataFrame from a flights index

[2]:
ed_flights = ed.DataFrame('http://localhost:9200', 'flights')
[3]:
type(ed_flights)
[3]:
eland.dataframe.DataFrame

Compare to pandas DataFrame (created from the same data)

[4]:
pd_flights = ed.eland_to_pandas(ed_flights)
[5]:
type(pd_flights)
[5]:
pandas.core.frame.DataFrame
Attributes and underlying data#
DataFrame.columns#
[6]:
pd_flights.columns
[6]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[7]:
ed_flights.columns
[7]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
DataFrame.dtypes#
[8]:
pd_flights.dtypes
[8]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
[9]:
ed_flights.dtypes
[9]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
DataFrame.select_dtypes#
[10]:
pd_flights.select_dtypes(include=np.number)
[10]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13059 rows × 7 columns

[11]:
ed_flights.select_dtypes(include=np.number)
[11]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13059 rows × 7 columns

DataFrame.empty#
[12]:
pd_flights.empty
[12]:
False
[13]:
ed_flights.empty
[13]:
False
DataFrame.shape#
[14]:
pd_flights.shape
[14]:
(13059, 27)
[15]:
ed_flights.shape
[15]:
(13059, 27)
DataFrame.index#

Note, eland.DataFrame.index does not mirror pandas.DataFrame.index.

[16]:
pd_flights.index
[16]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '13049', '13050', '13051', '13052', '13053', '13054', '13055', '13056', '13057', '13058'],
      dtype='object', length=13059)
[17]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[17]:
<eland.index.Index at 0x7ff1a25debb0>
[18]:
ed_flights.index.es_index_field
[18]:
'_id'
DataFrame.values#

Note, eland.DataFrame.values is not supported.

[19]:
pd_flights.values
[19]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
        Timestamp('2018-01-01 00:00:00')],
       [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
        Timestamp('2018-01-01 18:27:00')],
       [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
        Timestamp('2018-01-01 17:11:14')],
       ...,
       [997.7518761454494, False, 'Logstash Airways', ..., 'Sunny', 6,
        Timestamp('2018-02-11 04:09:27')],
       [1102.8144645388556, False, 'JetBeats', ..., 'Hail', 6,
        Timestamp('2018-02-11 08:28:21')],
       [858.1443369038839, False, 'JetBeats', ..., 'Rain', 6,
        Timestamp('2018-02-11 14:54:34')]], dtype=object)
[20]:
try:
    ed_flights.values
except AttributeError as e:
    print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
Indexing, iteration#
DataFrame.head#
[21]:
pd_flights.head()
[21]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows × 27 columns

[22]:
ed_flights.head()
[22]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows × 27 columns

DataFrame.tail#
[23]:
pd_flights.tail()
[23]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 rows × 27 columns

[24]:
ed_flights.tail()
[24]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 rows × 27 columns

DataFrame.keys#
[25]:
pd_flights.keys()
[25]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[26]:
ed_flights.keys()
[26]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
DataFrame.get#
[27]:
pd_flights.get('Carrier')
[27]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13059, dtype: object
[28]:
ed_flights.get('Carrier')
[28]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13059, dtype: object
[29]:
pd_flights.get(['Carrier', 'Origin'])
[29]:
Carrier Origin
0 Kibana Airlines Frankfurt am Main Airport
1 Logstash Airways Cape Town International Airport
2 Logstash Airways Venice Marco Polo Airport
3 Kibana Airlines Naples International Airport
4 Kibana Airlines Licenciado Benito Juarez International Airport
... ... ...
13054 Logstash Airways Pisa International Airport
13055 Logstash Airways Winnipeg / James Armstrong Richardson Internat...
13056 Logstash Airways Licenciado Benito Juarez International Airport
13057 JetBeats Itami Airport
13058 JetBeats Adelaide International Airport

13059 rows × 2 columns

List input not currently supported by eland.DataFrame.get

[30]:
try:
    ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
    print(e)
unhashable type: 'list'
DataFrame.query#
[31]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[31]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

eland.DataFrame.query requires qualifier on bool i.e.

ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') fails

[32]:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[32]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

Boolean indexing query#
[33]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
           (pd_flights.AvgTicketPrice > 900.0) &
           (pd_flights.Cancelled == True)]
[33]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

[34]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
           (ed_flights.AvgTicketPrice > 900.0) &
           (ed_flights.Cancelled == True)]
[34]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

Function application, GroupBy & window#
DataFrame.aggs#
[35]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[35]:
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578438e+03 2.663969e+02

eland.DataFrame.aggregate currently only supported numeric columns

[36]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[36]:
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578614e+03 2.664071e+02
Computations / descriptive stats#
DataFrame.count#
[37]:
pd_flights.count()
[37]:
AvgTicketPrice    13059
Cancelled         13059
Carrier           13059
Dest              13059
DestAirportID     13059
                  ...
OriginLocation    13059
OriginRegion      13059
OriginWeather     13059
dayOfWeek         13059
timestamp         13059
Length: 27, dtype: int64
[38]:
ed_flights.count()
[38]:
AvgTicketPrice    13059
Cancelled         13059
Carrier           13059
Dest              13059
DestAirportID     13059
                  ...
OriginLocation    13059
OriginRegion      13059
OriginWeather     13059
dayOfWeek         13059
timestamp         13059
Length: 27, dtype: int64
DataFrame.describe#
[39]:
pd_flights.describe()
[39]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
count 13059.000000 13059.000000 ... 13059.000000 13059.000000
mean 628.253689 7092.142455 ... 511.127842 2.835975
std 266.396861 4578.438497 ... 334.753952 1.939439
min 100.020528 0.000000 ... 0.000000 0.000000
25% 409.893816 2459.705673 ... 252.333192 1.000000
50% 640.556668 7610.330866 ... 503.045170 3.000000
75% 842.185470 9736.637600 ... 720.416036 4.000000
max 1199.729053 19881.482315 ... 1902.902032 6.000000

8 rows × 7 columns

Values returned from eland.DataFrame.describe may vary due to results of Elasticsearch aggregations.

[40]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[40]:
AvgTicketPrice Cancelled ... FlightTimeMin dayOfWeek
count 13059.000000 13059.000000 ... 13059.000000 13059.000000
mean 628.253689 0.128494 ... 511.127842 2.835975
std 266.407061 0.334664 ... 334.766770 1.939513
min 100.020531 0.000000 ... 0.000000 0.000000
25% 410.008918 0.000000 ... 251.938710 1.000000
50% 640.387285 0.000000 ... 503.148975 3.000000
75% 842.213490 0.000000 ... 720.505705 4.000000
max 1199.729004 1.000000 ... 1902.901978 6.000000

8 rows × 9 columns

DataFrame.info#
[41]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   Cancelled           13059 non-null  bool
 2   Carrier             13059 non-null  object
 3   Dest                13059 non-null  object
 4   DestAirportID       13059 non-null  object
 5   DestCityName        13059 non-null  object
 6   DestCountry         13059 non-null  object
 7   DestLocation        13059 non-null  object
 8   DestRegion          13059 non-null  object
 9   DestWeather         13059 non-null  object
 10  DistanceKilometers  13059 non-null  float64
 11  DistanceMiles       13059 non-null  float64
 12  FlightDelay         13059 non-null  bool
 13  FlightDelayMin      13059 non-null  int64
 14  FlightDelayType     13059 non-null  object
 15  FlightNum           13059 non-null  object
 16  FlightTimeHour      13059 non-null  float64
 17  FlightTimeMin       13059 non-null  float64
 18  Origin              13059 non-null  object
 19  OriginAirportID     13059 non-null  object
 20  OriginCityName      13059 non-null  object
 21  OriginCountry       13059 non-null  object
 22  OriginLocation      13059 non-null  object
 23  OriginRegion        13059 non-null  object
 24  OriginWeather       13059 non-null  object
 25  dayOfWeek           13059 non-null  int64
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[42]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   Cancelled           13059 non-null  bool
 2   Carrier             13059 non-null  object
 3   Dest                13059 non-null  object
 4   DestAirportID       13059 non-null  object
 5   DestCityName        13059 non-null  object
 6   DestCountry         13059 non-null  object
 7   DestLocation        13059 non-null  object
 8   DestRegion          13059 non-null  object
 9   DestWeather         13059 non-null  object
 10  DistanceKilometers  13059 non-null  float64
 11  DistanceMiles       13059 non-null  float64
 12  FlightDelay         13059 non-null  bool
 13  FlightDelayMin      13059 non-null  int64
 14  FlightDelayType     13059 non-null  object
 15  FlightNum           13059 non-null  object
 16  FlightTimeHour      13059 non-null  float64
 17  FlightTimeMin       13059 non-null  float64
 18  Origin              13059 non-null  object
 19  OriginAirportID     13059 non-null  object
 20  OriginCityName      13059 non-null  object
 21  OriginCountry       13059 non-null  object
 22  OriginLocation      13059 non-null  object
 23  OriginRegion        13059 non-null  object
 24  OriginWeather       13059 non-null  object
 25  dayOfWeek           13059 non-null  int64
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 5.786 MB
DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum#
max#
[43]:
pd_flights.max(numeric_only=True)
[43]:
AvgTicketPrice         1199.729053
Cancelled                     True
DistanceKilometers    19881.482315
DistanceMiles         12353.780369
FlightDelay                   True
FlightDelayMin                 360
FlightTimeHour           31.715034
FlightTimeMin          1902.902032
dayOfWeek                        6
dtype: object

eland.DataFrame.max,min,mean,sum only aggregate numeric columns

[44]:
ed_flights.max(numeric_only=True)
[44]:
AvgTicketPrice         1199.729004
Cancelled                 1.000000
DistanceKilometers    19881.482422
DistanceMiles         12353.780273
FlightDelay               1.000000
FlightDelayMin          360.000000
FlightTimeHour           31.715034
FlightTimeMin          1902.901978
dayOfWeek                 6.000000
dtype: float64
min#
[45]:
pd_flights.min(numeric_only=True)
[45]:
AvgTicketPrice        100.020528
Cancelled                  False
DistanceKilometers           0.0
DistanceMiles                0.0
FlightDelay                False
FlightDelayMin                 0
FlightTimeHour               0.0
FlightTimeMin                0.0
dayOfWeek                      0
dtype: object
[46]:
ed_flights.min(numeric_only=True)
[46]:
AvgTicketPrice        100.020531
Cancelled               0.000000
DistanceKilometers      0.000000
DistanceMiles           0.000000
FlightDelay             0.000000
FlightDelayMin          0.000000
FlightTimeHour          0.000000
FlightTimeMin           0.000000
dayOfWeek               0.000000
dtype: float64
mean#
[47]:
pd_flights.mean(numeric_only=True)
[47]:
AvgTicketPrice         628.253689
Cancelled                0.128494
DistanceKilometers    7092.142455
DistanceMiles         4406.853013
FlightDelay              0.251168
FlightDelayMin          47.335171
FlightTimeHour           8.518797
FlightTimeMin          511.127842
dayOfWeek                2.835975
dtype: float64
[48]:
ed_flights.mean(numeric_only=True)
[48]:
AvgTicketPrice         628.253689
Cancelled                0.128494
DistanceKilometers    7092.142457
DistanceMiles         4406.853010
FlightDelay              0.251168
FlightDelayMin          47.335171
FlightTimeHour           8.518797
FlightTimeMin          511.127842
dayOfWeek                2.835975
dtype: float64
sum#
[49]:
pd_flights.sum(numeric_only=True)
[49]:
AvgTicketPrice        8.204365e+06
Cancelled             1.678000e+03
DistanceKilometers    9.261629e+07
DistanceMiles         5.754909e+07
FlightDelay           3.280000e+03
FlightDelayMin        6.181500e+05
FlightTimeHour        1.112470e+05
FlightTimeMin         6.674818e+06
dayOfWeek             3.703500e+04
dtype: float64
[50]:
ed_flights.sum(numeric_only=True)
[50]:
AvgTicketPrice        8.204365e+06
Cancelled             1.678000e+03
DistanceKilometers    9.261629e+07
DistanceMiles         5.754909e+07
FlightDelay           3.280000e+03
FlightDelayMin        6.181500e+05
FlightTimeHour        1.112470e+05
FlightTimeMin         6.674818e+06
dayOfWeek             3.703500e+04
dtype: float64
DataFrame.nunique#
[51]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[51]:
Carrier      4
Origin     156
Dest       156
dtype: int64
[52]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[52]:
Carrier      4
Origin     156
Dest       156
dtype: int64
DataFrame.drop#
[53]:
pd_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[53]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13059 rows × 20 columns

[54]:
ed_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[54]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13059 rows × 20 columns

Plotting#
[55]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
_images/examples_demo_notebook_92_0.png
[56]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
_images/examples_demo_notebook_93_0.png
Elasticsearch utilities#
[57]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[58]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                   es_field_name  is_source es_dtype                  es_date_format        pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
timestamp              timestamp       True     date  strict_date_hour_minute_second  datetime64[ns]           True             True        False                  timestamp
OriginAirportID  OriginAirportID       True  keyword                            None          object           True             True        False            OriginAirportID
DestAirportID      DestAirportID       True  keyword                            None          object           True             True        False              DestAirportID
FlightDelayMin    FlightDelayMin       True  integer                            None           int64           True             True        False             FlightDelayMin
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
 size: 5
 sort_params: {'_doc': 'desc'}
 _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
 body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
 post_processing: [('sort_index')]

Introduction to Eland Webinar#

This Jupyter Notebook goes along with the webinar ‘Introduction to Eland’ which is available on Youtube. To follow along either create an Elasticsearch deployment on Elastic Cloud (free trial available) or start your own Elasticsearch cluster locally.

You’ll need to install the following libraries:

$ python -m pip install eland numpy pandas
DataFrame Demo#
[19]:
# Standard imports
import eland as ed
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch

# Function for pretty-printing JSON
def json(x):
    import json
    print(json.dumps(x, indent=2, sort_keys=True))
[20]:
# Connect to an Elastic Cloud instance
# or another Elasticsearch index below

ELASTIC_CLOUD_ID = "<cloud-id>"
ELASTIC_CLOUD_PASSWORD = "<password>"

es = Elasticsearch(
  cloud_id=ELASTIC_CLOUD_ID,
  basic_auth=("elastic", ELASTIC_CLOUD_PASSWORD)
)
json(es.info())
{
  "cluster_name": "167e473c7bba4bae85004385d4e0ce46",
  "cluster_uuid": "4Y2FwBhRSsWq9uGedb1DmQ",
  "name": "instance-0000000000",
  "tagline": "You Know, for Search",
  "version": {
    "build_date": "2020-06-14T19:35:50.234439Z",
    "build_flavor": "default",
    "build_hash": "757314695644ea9a1dc2fecd26d1a43856725e65",
    "build_snapshot": false,
    "build_type": "docker",
    "lucene_version": "8.5.1",
    "minimum_index_compatibility_version": "6.0.0-beta1",
    "minimum_wire_compatibility_version": "6.8.0",
    "number": "7.8.0"
  }
}
[21]:
# Load the dataset from NYC Open Data and take a look
pd_df = pd.read_csv("nyc-restaurants.csv").dropna()
pd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193197 entries, 0 to 400255
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype
---  ------                 --------------   -----
 0   CAMIS                  193197 non-null  int64
 1   DBA                    193197 non-null  object
 2   BORO                   193197 non-null  object
 3   BUILDING               193197 non-null  object
 4   STREET                 193197 non-null  object
 5   ZIPCODE                193197 non-null  float64
 6   PHONE                  193197 non-null  object
 7   CUISINE DESCRIPTION    193197 non-null  object
 8   INSPECTION DATE        193197 non-null  object
 9   ACTION                 193197 non-null  object
 10  VIOLATION CODE         193197 non-null  object
 11  VIOLATION DESCRIPTION  193197 non-null  object
 12  CRITICAL FLAG          193197 non-null  object
 13  SCORE                  193197 non-null  float64
 14  GRADE                  193197 non-null  object
 15  GRADE DATE             193197 non-null  object
 16  RECORD DATE            193197 non-null  object
 17  INSPECTION TYPE        193197 non-null  object
 18  Latitude               193197 non-null  float64
 19  Longitude              193197 non-null  float64
 20  Community Board        193197 non-null  float64
 21  Council District       193197 non-null  float64
 22  Census Tract           193197 non-null  float64
 23  BIN                    193197 non-null  float64
 24  BBL                    193197 non-null  float64
 25  NTA                    193197 non-null  object
dtypes: float64(9), int64(1), object(16)
memory usage: 39.8+ MB
[22]:
# Rename the columns to be snake_case
pd_df.columns = [x.lower().replace(" ", "_") for x in pd_df.columns]

# Combine the 'latitude' and 'longitude' columns into one column 'location' for 'geo_point'
pd_df["location"] = pd_df[["latitude", "longitude"]].apply(lambda x: ",".join(str(item) for item in x), axis=1)

# Drop the old columns in favor of 'location'
pd_df.drop(["latitude", "longitude"], axis=1, inplace=True)

pd_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193197 entries, 0 to 400255
Data columns (total 25 columns):
 #   Column                 Non-Null Count   Dtype
---  ------                 --------------   -----
 0   camis                  193197 non-null  int64
 1   dba                    193197 non-null  object
 2   boro                   193197 non-null  object
 3   building               193197 non-null  object
 4   street                 193197 non-null  object
 5   zipcode                193197 non-null  float64
 6   phone                  193197 non-null  object
 7   cuisine_description    193197 non-null  object
 8   inspection_date        193197 non-null  object
 9   action                 193197 non-null  object
 10  violation_code         193197 non-null  object
 11  violation_description  193197 non-null  object
 12  critical_flag          193197 non-null  object
 13  score                  193197 non-null  float64
 14  grade                  193197 non-null  object
 15  grade_date             193197 non-null  object
 16  record_date            193197 non-null  object
 17  inspection_type        193197 non-null  object
 18  community_board        193197 non-null  float64
 19  council_district       193197 non-null  float64
 20  census_tract           193197 non-null  float64
 21  bin                    193197 non-null  float64
 22  bbl                    193197 non-null  float64
 23  nta                    193197 non-null  object
 24  location               193197 non-null  object
dtypes: float64(7), int64(1), object(17)
memory usage: 38.3+ MB
[23]:
df = ed.pandas_to_eland(
    pd_df=pd_df,
    es_client=es,

    # Where the data will live in Elasticsearch
    es_dest_index="nyc-restaurants",

    # Type overrides for certain columns, 'location' detected
    # automatically as 'keyword' but we want these interpreted as 'geo_point'.
    es_type_overrides={
        "location": "geo_point",
        "dba": "text",
        "zipcode": "short"
    },

    # If the index already exists what should we do?
    es_if_exists="replace",

    # Wait for data to be indexed before returning
    es_refresh=True,
)
df.info()
<class 'eland.dataframe.DataFrame'>
Index: 193197 entries, 10388 to 398749
Data columns (total 25 columns):
 #   Column                 Non-Null Count   Dtype
---  ------                 --------------   -----
 0   camis                  193197 non-null  int64
 1   dba                    193197 non-null  object
 2   boro                   193197 non-null  object
 3   building               193197 non-null  object
 4   street                 193197 non-null  object
 5   zipcode                193197 non-null  int64
 6   phone                  193197 non-null  object
 7   cuisine_description    193197 non-null  object
 8   inspection_date        193197 non-null  object
 9   action                 193197 non-null  object
 10  violation_code         193197 non-null  object
 11  violation_description  193197 non-null  object
 12  critical_flag          193197 non-null  object
 13  score                  193197 non-null  float64
 14  grade                  193197 non-null  object
 15  grade_date             193197 non-null  object
 16  record_date            193197 non-null  object
 17  inspection_type        193197 non-null  object
 18  community_board        193197 non-null  float64
 19  council_district       193197 non-null  float64
 20  census_tract           193197 non-null  float64
 21  bin                    193197 non-null  float64
 22  bbl                    193197 non-null  float64
 23  nta                    193197 non-null  object
 24  location               193197 non-null  object
dtypes: float64(6), int64(2), object(17)
memory usage: 80.0 bytes
[24]:
json(es.indices.get_mapping(index="nyc-restaurants"))
{
  "nyc-restaurants": {
    "mappings": {
      "properties": {
        "action": {
          "type": "keyword"
        },
        "bbl": {
          "type": "double"
        },
        "bin": {
          "type": "double"
        },
        "boro": {
          "type": "keyword"
        },
        "building": {
          "type": "keyword"
        },
        "camis": {
          "type": "long"
        },
        "census_tract": {
          "type": "double"
        },
        "community_board": {
          "type": "double"
        },
        "council_district": {
          "type": "double"
        },
        "critical_flag": {
          "type": "keyword"
        },
        "cuisine_description": {
          "type": "keyword"
        },
        "dba": {
          "type": "text"
        },
        "grade": {
          "type": "keyword"
        },
        "grade_date": {
          "type": "keyword"
        },
        "inspection_date": {
          "type": "keyword"
        },
        "inspection_type": {
          "type": "keyword"
        },
        "location": {
          "type": "geo_point"
        },
        "nta": {
          "type": "keyword"
        },
        "phone": {
          "type": "keyword"
        },
        "record_date": {
          "type": "keyword"
        },
        "score": {
          "type": "double"
        },
        "street": {
          "type": "keyword"
        },
        "violation_code": {
          "type": "keyword"
        },
        "violation_description": {
          "type": "keyword"
        },
        "zipcode": {
          "type": "short"
        }
      }
    }
  }
}
[25]:
# Shape is determined by using count API
df.shape
[25]:
(193197, 25)
[34]:
# DataFrame has many APIs compatible with Pandas

#df.head(10)
#df.columns
#df.dba
#df["grade"]
#df[df.grade.isin(["A", "B"])]
#print(df[df.grade.isin(["A", "B"])].es_info())
#print(df.tail(10).es_info())
es_index_pattern: nyc-restaurants
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                               es_field_name  is_source   es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
camis                                  camis       True       long           None    int64           True             True        False                      camis
dba                                      dba       True       text           None   object           True            False        False                       None
boro                                    boro       True    keyword           None   object           True             True        False                       boro
building                            building       True    keyword           None   object           True             True        False                   building
street                                street       True    keyword           None   object           True             True        False                     street
zipcode                              zipcode       True      short           None    int64           True             True        False                    zipcode
phone                                  phone       True    keyword           None   object           True             True        False                      phone
cuisine_description      cuisine_description       True    keyword           None   object           True             True        False        cuisine_description
inspection_date              inspection_date       True    keyword           None   object           True             True        False            inspection_date
action                                action       True    keyword           None   object           True             True        False                     action
violation_code                violation_code       True    keyword           None   object           True             True        False             violation_code
violation_description  violation_description       True    keyword           None   object           True             True        False      violation_description
critical_flag                  critical_flag       True    keyword           None   object           True             True        False              critical_flag
score                                  score       True     double           None  float64           True             True        False                      score
grade                                  grade       True    keyword           None   object           True             True        False                      grade
grade_date                        grade_date       True    keyword           None   object           True             True        False                 grade_date
record_date                      record_date       True    keyword           None   object           True             True        False                record_date
inspection_type              inspection_type       True    keyword           None   object           True             True        False            inspection_type
community_board              community_board       True     double           None  float64           True             True        False            community_board
council_district            council_district       True     double           None  float64           True             True        False           council_district
census_tract                    census_tract       True     double           None  float64           True             True        False               census_tract
bin                                      bin       True     double           None  float64           True             True        False                        bin
bbl                                      bbl       True     double           None  float64           True             True        False                        bbl
nta                                      nta       True    keyword           None   object           True             True        False                        nta
location                            location       True  geo_point           None   object           True             True        False                   location
Operations:
 tasks: [('tail': ('sort_field': '_doc', 'count': 10))]
 size: 10
 sort_params: _doc:desc
 _source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location']
 body: {}
 post_processing: [('sort_index')]

[39]:
# Aggregating values
df.describe()
[39]:
camis zipcode score community_board council_district census_tract bin bbl
count 1.931970e+05 193197.000000 193197.000000 193197.000000 193197.000000 193197.000000 1.931970e+05 1.931970e+05
mean 4.605010e+07 10677.212540 12.947680 248.602603 20.020715 28796.048298 2.513373e+06 2.450622e+09
std 4.415232e+06 595.142246 8.180244 130.697014 15.809664 30672.683469 1.351134e+06 1.313578e+09
min 3.011234e+07 10000.000000 -1.000000 101.000000 1.000000 100.000000 1.000000e+06 1.000000e+09
25% 4.138051e+07 10022.000000 9.000000 105.000000 4.000000 7895.605691 1.042708e+06 1.011024e+09
50% 5.000527e+07 10468.006114 12.000000 301.000000 19.747529 16022.917106 3.007191e+06 3.002924e+09
75% 5.005661e+07 11228.624535 13.000000 401.000000 34.000000 40246.000337 4.002294e+06 4.003343e+09
max 5.010416e+07 12345.000000 99.000000 503.000000 51.000000 162100.000000 5.799501e+06 5.270001e+09
[40]:
# Plotting with matplotlib
from matplotlib import pyplot as plt

df[["score"]].hist(figsize=[10,10])
plt.show()
_images/examples_introduction_to_eland_webinar_11_0.png
[42]:
# es_query() allows for the full Elasticsearch querying capabilities
df.es_query({
  "geo_distance": {
    "distance": "50m",
    "location": {
      "lat": 40.643852716573,
      "lon": -74.011628212186
    }
  }
})
[42]:
camis dba boro building street zipcode phone cuisine_description inspection_date action ... grade_date record_date inspection_type community_board council_district census_tract bin bbl nta location
53127 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 12/26/2018 Violations were cited in the following area(s). ... 12/26/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
61268 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 07/20/2017 Violations were cited in the following area(s). ... 07/20/2017 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
20717 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 03/04/2020 Violations were cited in the following area(s). ... 03/04/2020 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
4648 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 05/25/2019 Violations were cited in the following area(s). ... 05/25/2019 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
224 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 05/25/2019 Violations were cited in the following area(s). ... 05/25/2019 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
9465 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 03/04/2020 Violations were cited in the following area(s). ... 03/04/2020 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
104512 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/19/2018 Violations were cited in the following area(s). ... 12/19/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
106728 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 01/25/2018 Violations were cited in the following area(s). ... 01/25/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
62748 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 05/28/2019 Violations were cited in the following area(s). ... 05/28/2019 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
79211 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 11/05/2016 Violations were cited in the following area(s). ... 11/05/2016 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
218545 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 01/10/2018 Violations were cited in the following area(s). ... 01/10/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
238663 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 11/05/2016 Violations were cited in the following area(s). ... 11/05/2016 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
245205 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/19/2018 Violations were cited in the following area(s). ... 12/19/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
245233 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 01/25/2018 Violations were cited in the following area(s). ... 01/25/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
247417 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 05/05/2017 Violations were cited in the following area(s). ... 05/05/2017 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
186874 50099704 MASTER'S PIZZERIA Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 11/18/2019 Violations were cited in the following area(s). ... 11/18/2019 07/07/2020 Pre-permit (Operational) / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
198104 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/28/2017 Violations were cited in the following area(s). ... 12/28/2017 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
213425 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/19/2018 Violations were cited in the following area(s). ... 12/19/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
202363 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 05/28/2019 Violations were cited in the following area(s). ... 05/28/2019 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
158059 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/19/2018 Violations were cited in the following area(s). ... 12/19/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
163672 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 08/13/2018 Violations were cited in the following area(s). ... 08/13/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
138508 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 01/29/2020 Violations were cited in the following area(s). ... 01/29/2020 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
140940 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 07/20/2017 Violations were cited in the following area(s). ... 07/20/2017 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
143157 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 01/10/2018 Violations were cited in the following area(s). ... 01/10/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
149548 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 07/20/2017 Violations were cited in the following area(s). ... 07/20/2017 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
149742 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 05/31/2018 Violations were cited in the following area(s). ... 05/31/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
249994 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 01/25/2018 Violations were cited in the following area(s). ... 01/25/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
257603 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 08/13/2018 Violations were cited in the following area(s). ... 08/13/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
268823 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 01/10/2018 Violations were cited in the following area(s). ... 01/10/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
269521 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 12/17/2019 Violations were cited in the following area(s). ... 12/17/2019 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
277500 50099704 MASTER'S PIZZERIA Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 11/18/2019 Violations were cited in the following area(s). ... 11/18/2019 07/07/2020 Pre-permit (Operational) / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
279503 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 01/29/2020 Violations were cited in the following area(s). ... 01/29/2020 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
299863 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 12/26/2018 Violations were cited in the following area(s). ... 12/26/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
319787 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 05/25/2019 Violations were cited in the following area(s). ... 05/25/2019 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969
336570 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 01/10/2018 Violations were cited in the following area(s). ... 01/10/2018 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
340551 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 04/10/2017 Establishment re-opened by DOHMH ... 04/10/2017 07/07/2020 Cycle Inspection / Reopening Inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
395508 41144258 BURGER KING Brooklyn 5212 5 AVENUE 11220.0 7187650844 Hamburgers 12/17/2019 Violations were cited in the following area(s). ... 12/17/2019 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 7600.0 3329902.0 3.008070e+09 BK32 40.643852716573,-74.011628212186
309366 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 12/28/2017 Violations were cited in the following area(s). ... 12/28/2017 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
340857 40396492 ROYAL KING'S PIZZA Brooklyn 5211 5 AVENUE 11220.0 7184923846 Pizza 01/29/2020 Violations were cited in the following area(s). ... 01/29/2020 07/07/2020 Cycle Inspection / Re-inspection 307.0 38.0 10000.0 3013939.0 3.008080e+09 BK34 40.643849974348996,-74.01160298782
358660 50004330 KFC Brooklyn 5223 5 AVENUE 11220.0 7184922813 Chicken 05/31/2018 Violations were cited in the following area(s). ... 05/31/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013937.0 3.008080e+09 BK34 40.643800563168,-74.01165342693001
393451 41271801 PINO'S Brooklyn 5201 5 AVENUE 11220.0 7184396012 Pizza 06/05/2018 Violations were cited in the following area(s). ... 06/05/2018 07/07/2020 Cycle Inspection / Initial Inspection 307.0 38.0 10000.0 3013942.0 3.008080e+09 BK34 40.643888405293005,-74.011563356969

41 rows × 25 columns

[43]:
# Full-text search example
df.es_query({"match": {"dba": "red"}})
[43]:
camis dba boro building street zipcode phone cuisine_description inspection_date action ... grade_date record_date inspection_type community_board council_district census_tract bin bbl nta location
5765 50033781 RED HOOK LOBSTER POUND Brooklyn 284 VAN BRUNT STREET 11231.0 7188587650 Seafood 04/19/2018 Violations were cited in the following area(s). ... 04/19/2018 07/07/2020 Cycle Inspection / Initial Inspection 306.0 38.0 5900.0 3008365.0 3.005290e+09 BK33 40.67974632809,-74.010098611838
12379 50058053 RED HOT II Brooklyn 349 7 AVENUE 11215.0 7183692577 Chinese 05/17/2018 Violations were cited in the following area(s). ... 05/17/2018 07/07/2020 Cycle Inspection / Re-inspection 306.0 39.0 15100.0 3026127.0 3.010940e+09 BK37 40.666194419994,-73.98214269199799
12978 50059700 RED POKE Manhattan 600 9 AVENUE 10036.0 2129748100 Hawaiian 03/21/2017 Violations were cited in the following area(s). ... 03/21/2017 07/07/2020 Pre-permit (Operational) / Re-inspection 104.0 3.0 12100.0 1088997.0 1.010330e+09 MN15 40.758993434643,-73.992203122611
16759 40365239 DORRIAN'S RED HAND RESTAURANT Manhattan 1616 2 AVENUE 10028.0 2127726660 Irish 11/08/2018 Violations were cited in the following area(s). ... 11/08/2018 07/07/2020 Cycle Inspection / Initial Inspection 108.0 5.0 13800.0 1049947.0 1.015460e+09 MN32 40.776404966262,-73.952802065662
18624 50095340 RED PEONY CHINESE CUISINE Manhattan 24 WEST 56 STREET 10019.0 2123808883 Chinese 11/21/2019 Violations were cited in the following area(s). ... 11/21/2019 07/07/2020 Pre-permit (Operational) / Re-inspection 105.0 4.0 10400.0 1034840.0 1.012710e+09 MN17 40.762699245064,-73.975463733228
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
391229 50061162 CODE RED Bronx 1320 EAST GUN HILL ROAD 10469.0 7188811808 Caribbean 05/14/2018 Violations were cited in the following area(s). ... 05/14/2018 07/07/2020 Cycle Inspection / Initial Inspection 211.0 12.0 35000.0 2056100.0 2.045890e+09 BX31 40.871378316318996,-73.848028279305
393531 50014078 RED LOBSTER Manhattan 5 TIMES SQ 10036.0 2127306706 Seafood 11/08/2017 Violations were cited in the following area(s). ... 11/08/2017 07/07/2020 Cycle Inspection / Re-inspection 105.0 3.0 11300.0 1024656.0 1.010130e+09 MN17 40.755702020307005,-73.987207980138
396171 40368313 RED FLAME DINER Manhattan 67 WEST 44 STREET 10036.0 2128693965 American 02/16/2018 Violations were cited in the following area(s). ... 02/16/2018 07/07/2020 Cycle Inspection / Initial Inspection 105.0 4.0 9600.0 1034217.0 1.012600e+09 MN17 40.755627203336,-73.981938150269
396501 50068499 RED GINGER Staten Island 1650 RICHMOND AVENUE 10314.0 7189828808 Other 09/19/2017 Violations were cited in the following area(s). ... 09/19/2017 07/07/2020 Pre-permit (Operational) / Initial Inspection 502.0 50.0 29103.0 5037014.0 5.022360e+09 SI05 40.608078102502,-74.162260908042
398950 50059700 RED POKE Manhattan 600 9 AVENUE 10036.0 2129748100 Hawaiian 12/08/2017 Violations were cited in the following area(s). ... 12/08/2017 07/07/2020 Cycle Inspection / Re-inspection 104.0 3.0 12100.0 1088997.0 1.010330e+09 MN15 40.758993434643,-73.992203122611

573 rows × 25 columns

[44]:
# Pull a subset of your data for building graphs / operations locally.
sample_df = df[df.grade == "B"].sample(100).to_pandas()
sample_df.info()
print(type(sample_df))
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 107677 to 96813
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   camis                  100 non-null    int64
 1   dba                    100 non-null    object
 2   boro                   100 non-null    object
 3   building               100 non-null    object
 4   street                 100 non-null    object
 5   zipcode                100 non-null    float64
 6   phone                  100 non-null    object
 7   cuisine_description    100 non-null    object
 8   inspection_date        100 non-null    object
 9   action                 100 non-null    object
 10  violation_code         100 non-null    object
 11  violation_description  100 non-null    object
 12  critical_flag          100 non-null    object
 13  score                  100 non-null    float64
 14  grade                  100 non-null    object
 15  grade_date             100 non-null    object
 16  record_date            100 non-null    object
 17  inspection_type        100 non-null    object
 18  community_board        100 non-null    float64
 19  council_district       100 non-null    float64
 20  census_tract           100 non-null    float64
 21  bin                    100 non-null    float64
 22  bbl                    100 non-null    float64
 23  nta                    100 non-null    object
 24  location               100 non-null    object
dtypes: float64(7), int64(1), object(17)
memory usage: 20.3+ KB
<class 'pandas.core.frame.DataFrame'>
Machine Learning Demo#
[45]:
# Import scikit-learn and train a dataset locally
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier

# Train the data locally
digits = datasets.load_wine()
print("Feature Names:", digits.feature_names)
print("Data example:", digits.data[0])

# Save 10, 80, and 140 for testing our model
data = [x for i, x in enumerate(digits.data) if i not in (10, 80, 140)]
target = [x for i, x in enumerate(digits.target) if i not in (10, 80, 140)]

sk_classifier = DecisionTreeClassifier()
sk_classifier.fit(data, target)

# Test out our model against the three targets
print(sk_classifier.predict(digits.data[[10, 80, 140]]))
print(digits.target[[10, 80, 140]])
Feature Names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Data example: [1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
 2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]
[0 1 2]
[0 1 2]
[46]:
from eland.ml import MLModel

# Serialize the scikit-learn model into Elasticsearch
ed_classifier = MLModel.import_model(
  es_client=es,
  model_id="wine-classifier",
  model=sk_classifier,
  feature_names=digits.feature_names,
  overwrite=True
)

# Capture the Elasticsearch API call w/ logging
import logging
logger = logging.getLogger("elasticsearch")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

# Use the same data as before, but now with the model in Elasticsearch
print(ed_classifier.predict(digits.data[[10, 80, 140]].tolist()))
print(digits.target[[10, 80, 140]])

logger.handlers = []
POST https://167e473c7bba4bae85004385d4e0ce46.us-central1.gcp.cloud.es.io/_ingest/pipeline/_simulate [status:200 request:0.053s]
> {"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]}
< {"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]}
[0 1 2]
[0 1 2]
[47]:
json({"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]})
{
  "docs": [
    {
      "_source": {
        "alcalinity_of_ash": 18.0,
        "alcohol": 14.1,
        "ash": 2.3,
        "color_intensity": 5.75,
        "flavanoids": 3.32,
        "hue": 1.25,
        "magnesium": 105.0,
        "malic_acid": 2.16,
        "nonflavanoid_phenols": 0.22,
        "od280/od315_of_diluted_wines": 3.17,
        "proanthocyanins": 2.38,
        "proline": 1510.0,
        "total_phenols": 2.95
      }
    },
    {
      "_source": {
        "alcalinity_of_ash": 19.0,
        "alcohol": 12.0,
        "ash": 2.0,
        "color_intensity": 2.5,
        "flavanoids": 2.26,
        "hue": 1.38,
        "magnesium": 86.0,
        "malic_acid": 0.92,
        "nonflavanoid_phenols": 0.3,
        "od280/od315_of_diluted_wines": 3.12,
        "proanthocyanins": 1.43,
        "proline": 278.0,
        "total_phenols": 2.42
      }
    },
    {
      "_source": {
        "alcalinity_of_ash": 21.0,
        "alcohol": 12.93,
        "ash": 2.7,
        "color_intensity": 4.6,
        "flavanoids": 0.5,
        "hue": 0.77,
        "magnesium": 96.0,
        "malic_acid": 2.81,
        "nonflavanoid_phenols": 0.53,
        "od280/od315_of_diluted_wines": 2.31,
        "proanthocyanins": 0.75,
        "proline": 600.0,
        "total_phenols": 1.54
      }
    }
  ],
  "pipeline": {
    "processors": [
      {
        "inference": {
          "field_map": {},
          "inference_config": {
            "classification": {}
          },
          "model_id": "wine-classifier"
        }
      }
    ]
  }
}
[48]:
json({"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]})
{
  "docs": [
    {
      "doc": {
        "_id": "_id",
        "_index": "_index",
        "_ingest": {
          "timestamp": "2020-07-08T15:35:49.98965Z"
        },
        "_source": {
          "alcalinity_of_ash": 18.0,
          "alcohol": 14.1,
          "ash": 2.3,
          "color_intensity": 5.75,
          "flavanoids": 3.32,
          "hue": 1.25,
          "magnesium": 105.0,
          "malic_acid": 2.16,
          "ml": {
            "inference": {
              "model_id": "wine-classifier",
              "predicted_value": "0"
            }
          },
          "nonflavanoid_phenols": 0.22,
          "od280/od315_of_diluted_wines": 3.17,
          "proanthocyanins": 2.38,
          "proline": 1510.0,
          "total_phenols": 2.95
        },
        "_type": "_doc"
      }
    },
    {
      "doc": {
        "_id": "_id",
        "_index": "_index",
        "_ingest": {
          "timestamp": "2020-07-08T15:35:49.98966Z"
        },
        "_source": {
          "alcalinity_of_ash": 19.0,
          "alcohol": 12.0,
          "ash": 2.0,
          "color_intensity": 2.5,
          "flavanoids": 2.26,
          "hue": 1.38,
          "magnesium": 86.0,
          "malic_acid": 0.92,
          "ml": {
            "inference": {
              "model_id": "wine-classifier",
              "predicted_value": "1"
            }
          },
          "nonflavanoid_phenols": 0.3,
          "od280/od315_of_diluted_wines": 3.12,
          "proanthocyanins": 1.43,
          "proline": 278.0,
          "total_phenols": 2.42
        },
        "_type": "_doc"
      }
    },
    {
      "doc": {
        "_id": "_id",
        "_index": "_index",
        "_ingest": {
          "timestamp": "2020-07-08T15:35:49.989672Z"
        },
        "_source": {
          "alcalinity_of_ash": 21.0,
          "alcohol": 12.93,
          "ash": 2.7,
          "color_intensity": 4.6,
          "flavanoids": 0.5,
          "hue": 0.77,
          "magnesium": 96.0,
          "malic_acid": 2.81,
          "ml": {
            "inference": {
              "model_id": "wine-classifier",
              "predicted_value": "2"
            }
          },
          "nonflavanoid_phenols": 0.53,
          "od280/od315_of_diluted_wines": 2.31,
          "proanthocyanins": 0.75,
          "proline": 600.0,
          "total_phenols": 1.54
        },
        "_type": "_doc"
      }
    }
  ]
}
[50]:
print(df[df["zipcode"] > df["score"]].es_info())
es_index_pattern: nyc-restaurants
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                               es_field_name  is_source   es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
camis                                  camis       True       long           None    int64           True             True        False                      camis
dba                                      dba       True       text           None   object           True            False        False                       None
boro                                    boro       True    keyword           None   object           True             True        False                       boro
building                            building       True    keyword           None   object           True             True        False                   building
street                                street       True    keyword           None   object           True             True        False                     street
zipcode                              zipcode       True      short           None    int64           True             True        False                    zipcode
phone                                  phone       True    keyword           None   object           True             True        False                      phone
cuisine_description      cuisine_description       True    keyword           None   object           True             True        False        cuisine_description
inspection_date              inspection_date       True    keyword           None   object           True             True        False            inspection_date
action                                action       True    keyword           None   object           True             True        False                     action
violation_code                violation_code       True    keyword           None   object           True             True        False             violation_code
violation_description  violation_description       True    keyword           None   object           True             True        False      violation_description
critical_flag                  critical_flag       True    keyword           None   object           True             True        False              critical_flag
score                                  score       True     double           None  float64           True             True        False                      score
grade                                  grade       True    keyword           None   object           True             True        False                      grade
grade_date                        grade_date       True    keyword           None   object           True             True        False                 grade_date
record_date                      record_date       True    keyword           None   object           True             True        False                record_date
inspection_type              inspection_type       True    keyword           None   object           True             True        False            inspection_type
community_board              community_board       True     double           None  float64           True             True        False            community_board
council_district            council_district       True     double           None  float64           True             True        False           council_district
census_tract                    census_tract       True     double           None  float64           True             True        False               census_tract
bin                                      bin       True     double           None  float64           True             True        False                        bin
bbl                                      bbl       True     double           None  float64           True             True        False                        bbl
nta                                      nta       True    keyword           None   object           True             True        False                        nta
location                            location       True  geo_point           None   object           True             True        False                   location
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}))]
 size: None
 sort_params: None
 _source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location']
 body: {'query': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}}
 post_processing: []

[ ]:

[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Fix console size for consistent test results
from eland.conftest import *

Online Retail Analysis#

Getting Started#

To get started, let’s create an eland.DataFrame by reading a csv file. This creates and populates the online-retail index in the local Elasticsearch cluster.

[2]:
df = ed.csv_to_eland("data/online-retail.csv.gz",
                     es_client='http://localhost:9200',
                     es_dest_index='online-retail',
                     es_if_exists='replace',
                     es_dropna=True,
                     es_refresh=True,
                     compression='gzip',
                     index_col=0)

Here we see that the "_id" field was used to index our data frame.

[3]:
df.index.es_index_field
[3]:
'_id'

Next, we can check which field from elasticsearch are available to our eland data frame. columns is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.

[4]:
df.columns
[4]:
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode',
       'UnitPrice'],
      dtype='object')

Now, let’s see the data types of our fields. Running df.dtypes, we can see that elasticsearch field types are mapped to pandas field types.

[5]:
df.dtypes
[5]:
Country         object
CustomerID     float64
Description     object
InvoiceDate     object
InvoiceNo       object
Quantity         int64
StockCode       object
UnitPrice      float64
dtype: object

We also offer a .es_info() data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.

[6]:
print(df.es_info())
es_index_pattern: online-retail
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: []
 size: None
 sort_params: None
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {}
 post_processing: []

Selecting and Indexing Data#

Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.

head and tail#

much like pandas, eland data frames offer .head(n) and .tail(n) methods that return the first and last n rows, respectively.

[7]:
df.head(2)
[7]:
Country CustomerID ... StockCode UnitPrice
0 United Kingdom 17850.0 ... 85123A 2.55
1 United Kingdom 17850.0 ... 71053 3.39

2 rows × 8 columns

[8]:
print(df.tail(2).head(2).tail(2).es_info())
es_index_pattern: online-retail
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))]
 size: 2
 sort_params: {'_doc': 'desc'}
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {}
 post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]

[9]:
df.tail(2)
[9]:
Country CustomerID ... StockCode UnitPrice
12498 United Kingdom 16710.0 ... 20975 0.65
12499 United Kingdom 16710.0 ... 22445 2.95

2 rows × 8 columns

Selecting columns#

you can also pass a list of columns to select columns from the data frame in a specified order.

[10]:
df[['Country', 'InvoiceDate']].head(5)
[10]:
Country InvoiceDate
0 United Kingdom 2010-12-01 08:26:00
1 United Kingdom 2010-12-01 08:26:00
2 United Kingdom 2010-12-01 08:26:00
3 United Kingdom 2010-12-01 08:26:00
4 United Kingdom 2010-12-01 08:26:00

5 rows × 2 columns

Boolean Indexing#

we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms query that is then passed to elasticsearch to filter the index.

[11]:
# the construction of a boolean vector maps directly to an elasticsearch query
print(df['Country']=='Germany')
df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
[11]:
Country CustomerID ... StockCode UnitPrice
1109 Germany 12662.0 ... 22809 2.95
1110 Germany 12662.0 ... 84347 2.55
1111 Germany 12662.0 ... 84945 0.85
1112 Germany 12662.0 ... 22242 1.65
1113 Germany 12662.0 ... 22244 1.95

5 rows × 8 columns

we can also filter the data frame using a list of values.

[12]:
print(df['Country'].isin(['Germany', 'United States']))
df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
[12]:
Country CustomerID ... StockCode UnitPrice
0 United Kingdom 17850.0 ... 85123A 2.55
1 United Kingdom 17850.0 ... 71053 3.39
2 United Kingdom 17850.0 ... 84406B 2.75
3 United Kingdom 17850.0 ... 84029G 3.39
4 United Kingdom 17850.0 ... 84029E 3.39

5 rows × 8 columns

We can also combine boolean vectors to further filter the data frame.

[13]:
df[(df['Country']=='Germany') & (df['Quantity']>90)]
[13]:
Country CustomerID ... StockCode UnitPrice

0 rows × 8 columns

Using this example, let see how eland translates this boolean filter to an elasticsearch bool query.

[14]:
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].es_info())
es_index_pattern: online-retail
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))]
 size: None
 sort_params: None
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}}
 post_processing: []

Aggregation and Descriptive Statistics#

Let’s begin to ask some questions of our data and use eland to get the answers.

How many different countries are there?

[15]:
df['Country'].nunique()
[15]:
16

What is the total sum of products ordered?

[16]:
df['Quantity'].sum()
[16]:
111960

Show me the sum, mean, min, and max of the qunatity and unit_price fields

[17]:
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
[17]:
Quantity UnitPrice
sum 111960.000 61548.490000
mean 7.464 4.103233
max 2880.000 950.990000
min -9360.000 0.000000

Give me descriptive statistics for the entire data frame

[18]:
# NBVAL_IGNORE_OUTPUT
df.describe()
[18]:
CustomerID Quantity UnitPrice
count 10729.000000 15000.000000 15000.000000
mean 15590.776680 7.464000 4.103233
std 1764.189592 85.930116 20.106214
min 12347.000000 -9360.000000 0.000000
25% 14222.249164 1.000000 1.250000
50% 15663.037856 2.000000 2.510000
75% 17219.040670 6.425347 4.210000
max 18239.000000 2880.000000 950.990000

Show me a histogram of numeric columns

[19]:
df[(df['Quantity']>-50) &
   (df['Quantity']<50) &
   (df['UnitPrice']>0) &
   (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30)
plt.show()
_images/examples_online_retail_analysis_37_0.png
[20]:
df[(df['Quantity']>-50) &
   (df['Quantity']<50) &
   (df['UnitPrice']>0) &
   (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True)
plt.show()
_images/examples_online_retail_analysis_38_0.png
[21]:
df.query('Quantity>50 & UnitPrice<100')
[21]:
Country CustomerID ... StockCode UnitPrice
46 United Kingdom 13748.0 ... 22086 2.55
83 United Kingdom 15291.0 ... 21733 2.55
96 United Kingdom 14688.0 ... 21212 0.42
102 United Kingdom 14688.0 ... 85071B 0.38
176 United Kingdom 16029.0 ... 85099C 1.65
... ... ... ... ... ...
11924 United Kingdom 14708.0 ... 84945 0.72
12007 United Kingdom 18113.0 ... 84946 1.06
12015 United Kingdom 17596.0 ... 17003 0.21
12058 United Kingdom 17596.0 ... 84536A 0.42
12448 EIRE 14911.0 ... 84945 0.85

258 rows × 8 columns

Arithmetic Operations#

Numeric values

[22]:
df['Quantity'].head()
[22]:
0    6
1    6
2    8
3    6
4    6
Name: Quantity, dtype: int64
[23]:
df['UnitPrice'].head()
[23]:
0    2.55
1    3.39
2    2.75
3    3.39
4    3.39
Name: UnitPrice, dtype: float64
[24]:
product = df['Quantity'] * df['UnitPrice']
[25]:
product.head()
[25]:
0    15.30
1    20.34
2    22.00
3    20.34
4    20.34
dtype: float64

String concatenation

[26]:
df['Country'] + df['StockCode']
[26]:
0        United Kingdom85123A
1         United Kingdom71053
2        United Kingdom84406B
3        United Kingdom84029G
4        United Kingdom84029E
                 ...
12495     United Kingdom84692
12496     United Kingdom22075
12497     United Kingdom20979
12498     United Kingdom20975
12499     United Kingdom22445
Length: 15000, dtype: object

Development#

Contributing to Eland#

Eland is an open source project and we love to receive contributions from our community — you! There are many ways to contribute, from writing tutorials or blog posts, improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated into eland itself.

Bug reports#

If you think you have found a bug in eland, first make sure that you are testing against the latest version of eland - your issue may already have been fixed. If not, search our issues list on GitHub in case a similar issue has already been opened.

It is very helpful if you can prepare a reproduction of the bug. In other words, provide a small test case which we can run to confirm your bug. It makes it easier to find the problem and to fix it. Test cases should be provided as python scripts, ideally with some details of your Elasticsearch environment and index mappings, and (where appropriate) a pandas example.

Provide as much information as you can. You may think that the problem lies with your query, when actually it depends on how your data is indexed. The easier it is for us to recreate your problem, the faster it is likely to be fixed.

Feature requests#

If you find yourself wishing for a feature that doesn’t exist in eland, you are probably not alone. There are bound to be others out there with similar needs. Many of the features that eland has today have been added because our users saw the need. Open an issue on our issues list on GitHub which describes the feature you would like to see, why you need it, and how it should work.

Contributing code and documentation changes#

If you have a bugfix or new feature that you would like to contribute to eland, please find or open an issue about it first. Talk about what you would like to do. It may be that somebody is already working on it, or that there are particular issues that you should know about before implementing the change.

We enjoy working with contributors to get their code accepted. There are many approaches to fixing a problem and it is important to find the best approach before writing too much code.

Note that it is unlikely the project will merge refactors for the sake of refactoring. These types of pull requests have a high cost to maintainers in reviewing and testing with little to no tangible benefit. This especially includes changes generated by tools.

The process for contributing to any of the Elastic repositories is similar. Details for individual projects can be found below.

Fork and clone the repository#

You will need to fork the main eland code or documentation repository and clone it to your local machine. See Github fork a repo page - for help.

Further instructions for specific projects are given below.

Submitting your changes#

Once your changes and tests are ready to submit for review:

  1. Run the linter and test suite to ensure your changes do not break the existing code: (TODO Add link to the testing document)

    # Run Auto-format, lint, mypy type checker for your changes
    $ nox -s format
    
    # Run the test suite
    $ pytest --doctest-modules eland/ tests/
    $ pytest --nbval tests/notebook/
    
  2. Sign the Contributor License Agreement

    Please make sure you have signed our Contributor License Agreement. We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code. You only need to sign the CLA once.

  3. Rebase your changes

    Update your local repository with the most recent code from the main eland repository, and rebase your branch on top of the latest main branch. We prefer your initial changes to be squashed into a single commit. Later, if we ask you to make changes, add them as separate commits. This makes them easier to review. As a final step before merging we will either ask you to squash all commits yourself or we’ll do it for you.

  4. Submit a pull request

    Push your local changes to your forked copy of the repository and submit a pull request. In the pull request, choose a title which sums up the changes that you have made, and in the body provide more details about what your changes do. Also mention the number of the issue where discussion has taken place, eg “Closes #123”.

Then sit back and wait. There will probably be discussion about the pull request and, if any changes are needed, we would love to work with you to get your pull request merged into eland.

Please adhere to the general guideline that you should never force push to a publicly shared branch. Once you have opened your pull request, you should consider your branch publicly shared. Instead of force pushing you can just add incremental commits; this is generally easier on your reviewers. If you need to pick up changes from main, you can merge main into your branch. A reviewer might ask you to rebase a long-running pull request in which case force pushing is okay for that request. Note that squashing at the end of the review process should also not be done, that can be done when the pull request is integrated via GitHub.

Contributing to the eland codebase#

Repository: https://github.com/elastic/eland

We internally develop using the PyCharm IDE. For PyCharm, we are currently using a minimum version of PyCharm 2019.2.4.

Configuring PyCharm And Running Tests#

(All commands should be run from module root)

  • Create a new project via 'Check out from Version

    Control'->'Git' on the "Welcome to PyCharm" page <or other>

  • Enter the URL to your fork of eland

    <e.g. git@github.com:stevedodson/eland.git>

  • Click 'Yes' for 'Checkout from Version Control'

  • Configure PyCharm environment:

    In 'Preferences' configure a 'Project: eland'->'Project Interpreter'. Generally, we recommend creating a virtual environment. (TODO link to installing for python version support)

  • In 'Preferences' set 'Tools'->'Python Integrated

    Tools'->'Default test runner' to pytest

  • In 'Preferences' set 'Tools'->'Python Integrated

    Tools'->'Docstring format' to numpy

  • Install development requirements. Open terminal in virtual environment and run
    `pip install -r requirements-dev.txt`
    
  • Setup Elasticsearch instance with docker
    ELASTICSEARCH_VERSION=elasticsearch:7.x-SNAPSHOT .ci/run-elasticsearch.sh`
    
  • Check http://localhost:9200 to verify if ElasticSearch Instance is running.

  • Install local eland module <required to execute notebook tests>
    python setup.py install
    
  • To setup test environment -note this modifies Elasticsearch indices run
    python -m tests.setup_tests
    
  • To validate installation, open python console and run
    import eland as ed
    ed_df = ed.DataFrame('localhost', 'flights')
    
  • To run the automatic formatter and check for lint issues
    run `nox -s format`
    
  • To test specific versions of Python run
    nox -s test-3.8
    
Documentation#
  • Install pandoc on your system -
    # For Ubuntu or Debian
    sudo apt-get install -y pandoc
    
  • Install documentation requirements. Open terminal in virtual environment and run
    pip install -r docs/requirements-docs.txt
    
  • To verify/generate documentation run
    nox -s docs
    

Implementation Details#

The goal of an eland.DataFrame is to enable users who are familiar with pandas.DataFrame to access, explore and manipulate data that resides in Elasticsearch.

Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.

Pandas and 3rd Party Storage Systems#

Generally, integrations with 3rd party storage systems (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to Apache Arrow structures.

Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.

An alternative option is to create an API that proxies pandas.DataFrame-like calls to Elasticsearch queries and operations. This could allow the Elasticsearch cluster to perform operations such as aggregations rather than exporting all the data and performing this operation in-core.

Implementation Options#

An option would be to replace the pandas.DataFrame backend in-core memory structures with Elasticsearch accessors. This would allow full access to the pandas.DataFrame APIs. However, this has issues:

  • If a pandas.DataFrame instance maps to an index, typical manipulation of a pandas.DataFrame may involve creating many derived pandas.DataFrame instances. Constructing an index per pandas.DataFrame may result in many Elasticsearch indexes and a significant load on Elasticsearch. For example, df_a = df['a'] should not require Elasticsearch indices df and df_a

  • Not all pandas.DataFrame APIs map to things we may want to do in Elasticsearch. In particular, API calls that involve exporting all data from Elasticsearch into memory e.g. df.to_dict().

  • The backend pandas.DataFrame structures are not easily abstractable and are deeply embedded in the implementation.

Another option is to create a eland.DataFrame API that mimics appropriate aspects of the pandas.DataFrame API. This resolves some of the issues above as:

  • df_a = df['a'] could be implemented as a change to the Elasticsearch query used, rather than a new index

  • Instead of supporting the enitre pandas.DataFrame API we can support a subset appropriate for Elasticsearch. If addition calls are required, we could to create a eland.DataFrame._to_pandas() method which would explicitly export all data to a pandas.DataFrame

  • Creating a new eland.DataFrame API gives us full flexibility in terms of implementation. However, it does create a large amount of work which may duplicate a lot of the pandas code - for example, printing objects etc. - this creates maintenance issues etc.