Date: Oct 27, 2020 Version: 7.9.1a1
Useful links: Source Repository | Issues & Ideas | Q&A Support
Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.
Where possible the package uses existing Python APIs and data structures to make it easy to switch between numpy, pandas, scikit-learn to their Elasticsearch powered equivalents. In general, the data resides in Elasticsearch and not in memory, which allows Eland to access large datasets stored in Elasticsearch.
Eland can be installed from PyPI via pip:
$ python -m pip install eland
Eland can also be installed from Conda Forge with Conda:
$ conda install -c conda-forge eland
If it’s your first time using Eland we recommend looking through the Examples documentation for ideas on what Eland is capable of.
If you’re new to Elasticsearch we recommend reading the documentation.
This page gives an overview of all public eland objects, functions and methods. All classes and functions exposed in eland.* namespace are public.
eland.*
The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not Eland supports the method on the corresponding object.
Note Even if an interface is listed here as “supported” doesn’t mean all parameters are implemented. Sometimes only a subset of parameters can be supported.
Note
Even if an interface is listed here as “supported” doesn’t mean all parameters are implemented. Sometimes only a subset of parameters can be supported.
If you have need of an operation that is listed as not implemented, feel free to open an issue or give a thumbs up to already created issues. Contributions are also welcome!
Note Some Pandas methods are not implementable due to the constraints of the library using Elasticsearch as the backend for the majority of data processing. Functions like DataFrame.iloc[i] or DataFrame.transpose() likely won’t be implementable in Eland due to this constraint.
Some Pandas methods are not implementable due to the constraints of the library using Elasticsearch as the backend for the majority of data processing. Functions like DataFrame.iloc[i] or DataFrame.transpose() likely won’t be implementable in Eland due to this constraint.
DataFrame.iloc[i]
DataFrame.transpose()
There is a list of methods gathered from Kaggle in order of usage for prioritization purposes.
This list should be automatically generated with utils/generate-supported-apis.py script instead of being modified manually.
utils/generate-supported-apis.py
ed.DataFrame.abs()
ed.DataFrame.add()
ed.DataFrame.add_prefix()
ed.DataFrame.add_suffix()
ed.DataFrame.agg()
ed.DataFrame.aggregate()
ed.DataFrame.align()
ed.DataFrame.all()
ed.DataFrame.any()
ed.DataFrame.append()
ed.DataFrame.apply()
ed.DataFrame.applymap()
ed.DataFrame.asfreq()
ed.DataFrame.asof()
ed.DataFrame.assign()
ed.DataFrame.astype()
ed.DataFrame.at
ed.DataFrame.at_time()
ed.DataFrame.attrs
ed.DataFrame.axes
ed.DataFrame.backfill()
ed.DataFrame.between_time()
ed.DataFrame.bfill()
ed.DataFrame.bool()
ed.DataFrame.boxplot()
ed.DataFrame.clip()
ed.DataFrame.columns
ed.DataFrame.combine()
ed.DataFrame.combine_first()
ed.DataFrame.compare()
ed.DataFrame.convert_dtypes()
ed.DataFrame.copy()
ed.DataFrame.corr()
ed.DataFrame.corrwith()
ed.DataFrame.count()
ed.DataFrame.cov()
ed.DataFrame.cummax()
ed.DataFrame.cummin()
ed.DataFrame.cumprod()
ed.DataFrame.cumsum()
ed.DataFrame.describe()
ed.DataFrame.diff()
ed.DataFrame.div()
ed.DataFrame.divide()
ed.DataFrame.dot()
ed.DataFrame.drop()
ed.DataFrame.drop_duplicates()
ed.DataFrame.droplevel()
ed.DataFrame.dropna()
ed.DataFrame.dtypes
ed.DataFrame.duplicated()
ed.DataFrame.empty
ed.DataFrame.eq()
ed.DataFrame.equals()
ed.DataFrame.eval()
ed.DataFrame.ewm()
ed.DataFrame.expanding()
ed.DataFrame.explode()
ed.DataFrame.ffill()
ed.DataFrame.fillna()
ed.DataFrame.filter()
ed.DataFrame.first()
ed.DataFrame.first_valid_index()
ed.DataFrame.floordiv()
ed.DataFrame.from_dict()
ed.DataFrame.from_records()
ed.DataFrame.ge()
ed.DataFrame.get()
ed.DataFrame.groupby()
ed.DataFrame.gt()
ed.DataFrame.head()
ed.DataFrame.hist()
ed.DataFrame.iat
ed.DataFrame.idxmax()
ed.DataFrame.idxmin()
ed.DataFrame.iloc
ed.DataFrame.index
ed.DataFrame.infer_objects()
ed.DataFrame.info()
ed.DataFrame.insert()
ed.DataFrame.interpolate()
ed.DataFrame.isin()
ed.DataFrame.isna()
ed.DataFrame.isnull()
ed.DataFrame.items()
ed.DataFrame.iteritems()
ed.DataFrame.iterrows()
ed.DataFrame.itertuples()
ed.DataFrame.join()
ed.DataFrame.keys()
ed.DataFrame.kurt()
ed.DataFrame.kurtosis()
ed.DataFrame.last()
ed.DataFrame.last_valid_index()
ed.DataFrame.le()
ed.DataFrame.loc
ed.DataFrame.lookup()
ed.DataFrame.lt()
ed.DataFrame.mad()
ed.DataFrame.mask()
ed.DataFrame.max()
ed.DataFrame.mean()
ed.DataFrame.median()
ed.DataFrame.melt()
ed.DataFrame.memory_usage()
ed.DataFrame.merge()
ed.DataFrame.min()
ed.DataFrame.mod()
ed.DataFrame.mode()
ed.DataFrame.mul()
ed.DataFrame.multiply()
ed.DataFrame.ndim
ed.DataFrame.ne()
ed.DataFrame.nlargest()
ed.DataFrame.notna()
ed.DataFrame.notnull()
ed.DataFrame.nsmallest()
ed.DataFrame.nunique()
ed.DataFrame.pad()
ed.DataFrame.pct_change()
ed.DataFrame.pipe()
ed.DataFrame.pivot()
ed.DataFrame.pivot_table()
ed.DataFrame.pop()
ed.DataFrame.pow()
ed.DataFrame.prod()
ed.DataFrame.product()
ed.DataFrame.quantile()
ed.DataFrame.query()
ed.DataFrame.radd()
ed.DataFrame.rank()
ed.DataFrame.rdiv()
ed.DataFrame.reindex()
ed.DataFrame.reindex_like()
ed.DataFrame.rename()
ed.DataFrame.rename_axis()
ed.DataFrame.reorder_levels()
ed.DataFrame.replace()
ed.DataFrame.resample()
ed.DataFrame.reset_index()
ed.DataFrame.rfloordiv()
ed.DataFrame.rmod()
ed.DataFrame.rmul()
ed.DataFrame.rolling()
ed.DataFrame.round()
ed.DataFrame.rpow()
ed.DataFrame.rsub()
ed.DataFrame.rtruediv()
ed.DataFrame.sample()
ed.DataFrame.select_dtypes()
ed.DataFrame.sem()
ed.DataFrame.set_axis()
ed.DataFrame.set_index()
ed.DataFrame.shape
ed.DataFrame.shift()
ed.DataFrame.size
ed.DataFrame.skew()
ed.DataFrame.slice_shift()
ed.DataFrame.sort_index()
ed.DataFrame.sort_values()
ed.DataFrame.squeeze()
ed.DataFrame.stack()
ed.DataFrame.std()
ed.DataFrame.style
ed.DataFrame.sub()
ed.DataFrame.subtract()
ed.DataFrame.sum()
ed.DataFrame.swapaxes()
ed.DataFrame.swaplevel()
ed.DataFrame.T
ed.DataFrame.tail()
ed.DataFrame.take()
ed.DataFrame.to_clipboard()
ed.DataFrame.to_csv()
ed.DataFrame.to_dict()
ed.DataFrame.to_excel()
ed.DataFrame.to_feather()
ed.DataFrame.to_gbq()
ed.DataFrame.to_hdf()
ed.DataFrame.to_html()
ed.DataFrame.to_json()
ed.DataFrame.to_latex()
ed.DataFrame.to_markdown()
ed.DataFrame.to_numpy()
ed.DataFrame.to_parquet()
ed.DataFrame.to_period()
ed.DataFrame.to_pickle()
ed.DataFrame.to_records()
ed.DataFrame.to_sql()
ed.DataFrame.to_stata()
ed.DataFrame.to_string()
ed.DataFrame.to_timestamp()
ed.DataFrame.to_xarray()
ed.DataFrame.transform()
ed.DataFrame.transpose()
ed.DataFrame.truediv()
ed.DataFrame.truncate()
ed.DataFrame.tshift()
ed.DataFrame.tz_convert()
ed.DataFrame.tz_localize()
ed.DataFrame.unstack()
ed.DataFrame.update()
ed.DataFrame.value_counts()
ed.DataFrame.values
ed.DataFrame.var()
ed.DataFrame.where()
ed.DataFrame.xs()
ed.DataFrame.__abs__()
ed.DataFrame.__add__()
ed.DataFrame.__and__()
ed.DataFrame.__annotations__
ed.DataFrame.__array__()
ed.DataFrame.__array_priority__
ed.DataFrame.__array_wrap__()
ed.DataFrame.__bool__()
ed.DataFrame.__contains__()
ed.DataFrame.__copy__()
ed.DataFrame.__deepcopy__()
ed.DataFrame.__delattr__
ed.DataFrame.__delitem__()
ed.DataFrame.__dict__
ed.DataFrame.__dir__()
ed.DataFrame.__div__()
ed.DataFrame.__doc__
ed.DataFrame.__eq__()
ed.DataFrame.__finalize__()
ed.DataFrame.__floordiv__()
ed.DataFrame.__format__
ed.DataFrame.__ge__()
ed.DataFrame.__getattr__()
ed.DataFrame.__getattribute__
ed.DataFrame.__getitem__()
ed.DataFrame.__getstate__()
ed.DataFrame.__gt__()
ed.DataFrame.__hash__()
ed.DataFrame.__iadd__()
ed.DataFrame.__iand__()
ed.DataFrame.__ifloordiv__()
ed.DataFrame.__imod__()
ed.DataFrame.__imul__()
ed.DataFrame.__init__()
ed.DataFrame.__init_subclass__
ed.DataFrame.__invert__()
ed.DataFrame.__ior__()
ed.DataFrame.__ipow__()
ed.DataFrame.__isub__()
ed.DataFrame.__iter__()
ed.DataFrame.__itruediv__()
ed.DataFrame.__ixor__()
ed.DataFrame.__le__()
ed.DataFrame.__len__()
ed.DataFrame.__lt__()
ed.DataFrame.__matmul__()
ed.DataFrame.__mod__()
ed.DataFrame.__module__
ed.DataFrame.__mul__()
ed.DataFrame.__ne__()
ed.DataFrame.__neg__()
ed.DataFrame.__new__
ed.DataFrame.__nonzero__()
ed.DataFrame.__or__()
ed.DataFrame.__pos__()
ed.DataFrame.__pow__()
ed.DataFrame.__radd__()
ed.DataFrame.__rand__()
ed.DataFrame.__rdiv__()
ed.DataFrame.__reduce__
ed.DataFrame.__reduce_ex__
ed.DataFrame.__repr__()
ed.DataFrame.__rfloordiv__()
ed.DataFrame.__rmatmul__()
ed.DataFrame.__rmod__()
ed.DataFrame.__rmul__()
ed.DataFrame.__ror__()
ed.DataFrame.__round__()
ed.DataFrame.__rpow__()
ed.DataFrame.__rsub__()
ed.DataFrame.__rtruediv__()
ed.DataFrame.__rxor__()
ed.DataFrame.__setattr__()
ed.DataFrame.__setitem__()
ed.DataFrame.__setstate__()
ed.DataFrame.__sizeof__()
ed.DataFrame.__str__
ed.DataFrame.__sub__()
ed.DataFrame.__subclasshook__
ed.DataFrame.__truediv__()
ed.DataFrame.__weakref__
ed.DataFrame.__xor__()
ed.Series.abs()
ed.Series.add()
ed.Series.add_prefix()
ed.Series.add_suffix()
ed.Series.agg()
ed.Series.aggregate()
ed.Series.align()
ed.Series.all()
ed.Series.any()
ed.Series.append()
ed.Series.apply()
ed.Series.argmax()
ed.Series.argmin()
ed.Series.argsort()
ed.Series.array
ed.Series.asfreq()
ed.Series.asof()
ed.Series.astype()
ed.Series.at
ed.Series.at_time()
ed.Series.attrs
ed.Series.autocorr()
ed.Series.axes
ed.Series.backfill()
ed.Series.between()
ed.Series.between_time()
ed.Series.bfill()
ed.Series.bool()
ed.Series.clip()
ed.Series.combine()
ed.Series.combine_first()
ed.Series.compare()
ed.Series.convert_dtypes()
ed.Series.copy()
ed.Series.corr()
ed.Series.count()
ed.Series.cov()
ed.Series.cummax()
ed.Series.cummin()
ed.Series.cumprod()
ed.Series.cumsum()
ed.Series.describe()
ed.Series.diff()
ed.Series.div()
ed.Series.divide()
ed.Series.divmod()
ed.Series.dot()
ed.Series.drop()
ed.Series.drop_duplicates()
ed.Series.droplevel()
ed.Series.dropna()
ed.Series.dtype
ed.Series.dtypes
ed.Series.duplicated()
ed.Series.empty
ed.Series.eq()
ed.Series.equals()
ed.Series.ewm()
ed.Series.expanding()
ed.Series.explode()
ed.Series.factorize()
ed.Series.ffill()
ed.Series.fillna()
ed.Series.filter()
ed.Series.first()
ed.Series.first_valid_index()
ed.Series.floordiv()
ed.Series.ge()
ed.Series.get()
ed.Series.groupby()
ed.Series.gt()
ed.Series.hasnans
ed.Series.head()
ed.Series.hist()
ed.Series.iat
ed.Series.idxmax()
ed.Series.idxmin()
ed.Series.iloc
ed.Series.index
ed.Series.infer_objects()
ed.Series.interpolate()
ed.Series.is_monotonic
ed.Series.is_monotonic_decreasing
ed.Series.is_monotonic_increasing
ed.Series.is_unique
ed.Series.isin()
ed.Series.isna()
ed.Series.isnull()
ed.Series.item()
ed.Series.items()
ed.Series.iteritems()
ed.Series.keys()
ed.Series.kurt()
ed.Series.kurtosis()
ed.Series.last()
ed.Series.last_valid_index()
ed.Series.le()
ed.Series.loc
ed.Series.lt()
ed.Series.mad()
ed.Series.map()
ed.Series.mask()
ed.Series.max()
ed.Series.mean()
ed.Series.median()
ed.Series.memory_usage()
ed.Series.min()
ed.Series.mod()
ed.Series.mode()
ed.Series.mul()
ed.Series.multiply()
ed.Series.name
ed.Series.nbytes
ed.Series.ndim
ed.Series.ne()
ed.Series.nlargest()
ed.Series.notna()
ed.Series.notnull()
ed.Series.nsmallest()
ed.Series.nunique()
ed.Series.pad()
ed.Series.pct_change()
ed.Series.pipe()
ed.Series.pop()
ed.Series.pow()
ed.Series.prod()
ed.Series.product()
ed.Series.quantile()
ed.Series.radd()
ed.Series.rank()
ed.Series.ravel()
ed.Series.rdiv()
ed.Series.rdivmod()
ed.Series.reindex()
ed.Series.reindex_like()
ed.Series.rename()
ed.Series.rename_axis()
ed.Series.reorder_levels()
ed.Series.repeat()
ed.Series.replace()
ed.Series.resample()
ed.Series.reset_index()
ed.Series.rfloordiv()
ed.Series.rmod()
ed.Series.rmul()
ed.Series.rolling()
ed.Series.round()
ed.Series.rpow()
ed.Series.rsub()
ed.Series.rtruediv()
ed.Series.sample()
ed.Series.searchsorted()
ed.Series.sem()
ed.Series.set_axis()
ed.Series.shape
ed.Series.shift()
ed.Series.size
ed.Series.skew()
ed.Series.slice_shift()
ed.Series.sort_index()
ed.Series.sort_values()
ed.Series.squeeze()
ed.Series.std()
ed.Series.sub()
ed.Series.subtract()
ed.Series.sum()
ed.Series.swapaxes()
ed.Series.swaplevel()
ed.Series.T
ed.Series.tail()
ed.Series.take()
ed.Series.to_clipboard()
ed.Series.to_csv()
ed.Series.to_dict()
ed.Series.to_excel()
ed.Series.to_frame()
ed.Series.to_hdf()
ed.Series.to_json()
ed.Series.to_latex()
ed.Series.to_list()
ed.Series.to_markdown()
ed.Series.to_numpy()
ed.Series.to_period()
ed.Series.to_pickle()
ed.Series.to_sql()
ed.Series.to_string()
ed.Series.to_timestamp()
ed.Series.to_xarray()
ed.Series.tolist()
ed.Series.transform()
ed.Series.transpose()
ed.Series.truediv()
ed.Series.truncate()
ed.Series.tshift()
ed.Series.tz_convert()
ed.Series.tz_localize()
ed.Series.unique()
ed.Series.unstack()
ed.Series.update()
ed.Series.value_counts()
ed.Series.values
ed.Series.var()
ed.Series.view()
ed.Series.where()
ed.Series.xs()
ed.Series.__abs__()
ed.Series.__add__()
ed.Series.__and__()
ed.Series.__annotations__
ed.Series.__array__()
ed.Series.__array_priority__
ed.Series.__array_ufunc__()
ed.Series.__array_wrap__()
ed.Series.__bool__()
ed.Series.__contains__()
ed.Series.__copy__()
ed.Series.__deepcopy__()
ed.Series.__delattr__
ed.Series.__delitem__()
ed.Series.__dict__
ed.Series.__dir__()
ed.Series.__div__()
ed.Series.__divmod__()
ed.Series.__doc__
ed.Series.__eq__()
ed.Series.__finalize__()
ed.Series.__float__()
ed.Series.__floordiv__()
ed.Series.__format__
ed.Series.__ge__()
ed.Series.__getattr__()
ed.Series.__getattribute__
ed.Series.__getitem__()
ed.Series.__getstate__()
ed.Series.__gt__()
ed.Series.__hash__()
ed.Series.__iadd__()
ed.Series.__iand__()
ed.Series.__ifloordiv__()
ed.Series.__imod__()
ed.Series.__imul__()
ed.Series.__init__()
ed.Series.__init_subclass__
ed.Series.__int__()
ed.Series.__invert__()
ed.Series.__ior__()
ed.Series.__ipow__()
ed.Series.__isub__()
ed.Series.__iter__()
ed.Series.__itruediv__()
ed.Series.__ixor__()
ed.Series.__le__()
ed.Series.__len__()
ed.Series.__long__()
ed.Series.__lt__()
ed.Series.__matmul__()
ed.Series.__mod__()
ed.Series.__module__
ed.Series.__mul__()
ed.Series.__ne__()
ed.Series.__neg__()
ed.Series.__new__
ed.Series.__nonzero__()
ed.Series.__or__()
ed.Series.__pos__()
ed.Series.__pow__()
ed.Series.__radd__()
ed.Series.__rand__()
ed.Series.__rdiv__()
ed.Series.__rdivmod__()
ed.Series.__reduce__
ed.Series.__reduce_ex__
ed.Series.__repr__()
ed.Series.__rfloordiv__()
ed.Series.__rmatmul__()
ed.Series.__rmod__()
ed.Series.__rmul__()
ed.Series.__ror__()
ed.Series.__round__()
ed.Series.__rpow__()
ed.Series.__rsub__()
ed.Series.__rtruediv__()
ed.Series.__rxor__()
ed.Series.__setattr__()
ed.Series.__setitem__()
ed.Series.__setstate__()
ed.Series.__sizeof__()
ed.Series.__str__
ed.Series.__sub__()
ed.Series.__subclasshook__
ed.Series.__truediv__()
ed.Series.__weakref__
ed.Series.__xor__()
DataFrame
DataFrame.index
DataFrame.columns
DataFrame.dtypes
DataFrame.select_dtypes
DataFrame.values
DataFrame.empty
DataFrame.shape
DataFrame.ndim
DataFrame.size
DataFrame.head
DataFrame.keys
DataFrame.tail
DataFrame.get
DataFrame.query
DataFrame.sample
DataFrame.agg
DataFrame.aggregate
DataFrame.count
DataFrame.describe
DataFrame.info
DataFrame.max
DataFrame.mean
DataFrame.min
DataFrame.median
DataFrame.mad
DataFrame.std
DataFrame.var
DataFrame.sum
DataFrame.nunique
DataFrame.drop
DataFrame.filter
DataFrame.hist
DataFrame.es_info
DataFrame.es_query
DataFrame.to_numpy
DataFrame.to_csv
DataFrame.to_html
DataFrame.to_string
DataFrame.to_pandas
Series
Series.index
Series.dtype
Series.dtypes
Series.shape
Series.name
Series.empty
Series.ndim
Series.size
Series.head
Series.tail
Series.sample
Series.add
Series.sub
Series.subtract
Series.mul
Series.multiply
Series.div
Series.divide
Series.truediv
Series.floordiv
Series.mod
Series.pow
Series.radd
Series.rsub
Series.rsubtract
Series.rmul
Series.rmultiply
Series.rdiv
Series.rdivide
Series.rtruediv
Series.rfloordiv
Series.rmod
Series.rpow
Series.describe
Series.max
Series.mean
Series.min
Series.sum
Series.median
Series.mad
Series.std
Series.var
Series.nunique
Series.value_counts
Series.rename
Series.isna
Series.notna
Series.isnull
Series.notnull
Series.isin
Series.filter
Series.hist
Series.to_string
Series.to_numpy
Series.to_pandas
Series.es_info
Machine learning is built into the Elastic Stack and enables users to gain insights into their Elasticsearch data. There are a wide range of capabilities from identifying in anomalies in your data, to training and deploying regression or classification models based on Elasticsearch data.
To use the Elastic Stack machine learning features, you must have the appropriate license and at least one machine learning node in your Elasticsearch cluster. If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges.
The fastest way to get started with machine learning features is to start a free 14-day trial of Elastic Cloud.
See Elasticsearch Machine Learning documentation more details.
MLModel
MLModel.predict
MLModel.import_model
MLModel.exists_model
MLModel.delete_model
Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame) and those should most likely be used before calling these methods directly.
Index
pandas_to_eland
eland_to_pandas
csv_to_eland
[1]:
import eland as ed import pandas as pd import numpy as np import matplotlib.pyplot as plt from elasticsearch import Elasticsearch # Import standard test settings for consistent results from eland.conftest import *
Create an eland.DataFrame from a flights index
flights
[2]:
ed_flights = ed.DataFrame('localhost', 'flights')
[3]:
type(ed_flights)
eland.dataframe.DataFrame
Compare to pandas DataFrame (created from the same data)
[4]:
pd_flights = ed.eland_to_pandas(ed_flights)
[5]:
type(pd_flights)
pandas.core.frame.DataFrame
[6]:
pd_flights.columns
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers', 'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum', 'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName', 'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek', 'timestamp'], dtype='object')
[7]:
ed_flights.columns
[8]:
pd_flights.dtypes
AvgTicketPrice float64 Cancelled bool Carrier object Dest object DestAirportID object ... OriginLocation object OriginRegion object OriginWeather object dayOfWeek int64 timestamp datetime64[ns] Length: 27, dtype: object
[9]:
ed_flights.dtypes
[10]:
pd_flights.select_dtypes(include=np.number)
13059 rows × 7 columns
[11]:
ed_flights.select_dtypes(include=np.number)
[12]:
pd_flights.empty
False
[13]:
ed_flights.empty
[14]:
pd_flights.shape
(13059, 27)
[15]:
ed_flights.shape
Note, eland.DataFrame.index does not mirror pandas.DataFrame.index.
eland.DataFrame.index
pandas.DataFrame.index
[16]:
pd_flights.index
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ... '13049', '13050', '13051', '13052', '13053', '13054', '13055', '13056', '13057', '13058'], dtype='object', length=13059)
[17]:
# NBVAL_IGNORE_OUTPUT ed_flights.index
<eland.index.Index at 0x7fa97243fe80>
[18]:
ed_flights.index.es_index_field
'_id'
Note, eland.DataFrame.values is not supported.
eland.DataFrame.values
[19]:
pd_flights.values
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0, Timestamp('2018-01-01 00:00:00')], [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0, Timestamp('2018-01-01 18:27:00')], [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0, Timestamp('2018-01-01 17:11:14')], ..., [997.7518761454494, False, 'Logstash Airways', ..., 'Sunny', 6, Timestamp('2018-02-11 04:09:27')], [1102.8144645388556, False, 'JetBeats', ..., 'Hail', 6, Timestamp('2018-02-11 08:28:21')], [858.1443369038839, False, 'JetBeats', ..., 'Rain', 6, Timestamp('2018-02-11 14:54:34')]], dtype=object)
[20]:
try: ed_flights.values except AttributeError as e: print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
[21]:
pd_flights.head()
5 rows × 27 columns
[22]:
ed_flights.head()
[23]:
pd_flights.tail()
[24]:
ed_flights.tail()
[25]:
pd_flights.keys()
[26]:
ed_flights.keys()
[27]:
pd_flights.get('Carrier')
0 Kibana Airlines 1 Logstash Airways 2 Logstash Airways 3 Kibana Airlines 4 Kibana Airlines ... 13054 Logstash Airways 13055 Logstash Airways 13056 Logstash Airways 13057 JetBeats 13058 JetBeats Name: Carrier, Length: 13059, dtype: object
[28]:
ed_flights.get('Carrier')
[29]:
pd_flights.get(['Carrier', 'Origin'])
13059 rows × 2 columns
List input not currently supported by eland.DataFrame.get
eland.DataFrame.get
[30]:
try: ed_flights.get(['Carrier', 'Origin']) except TypeError as e: print(e)
unhashable type: 'list'
[31]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
68 rows × 27 columns
eland.DataFrame.query requires qualifier on bool i.e.
eland.DataFrame.query
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') fails
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled')
[32]:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[33]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") & (pd_flights.AvgTicketPrice > 900.0) & (pd_flights.Cancelled == True)]
[34]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") & (ed_flights.AvgTicketPrice > 900.0) & (ed_flights.Cancelled == True)]
[35]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
eland.DataFrame.aggregate currently only supported numeric columns
eland.DataFrame.aggregate
[36]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[37]:
pd_flights.count()
AvgTicketPrice 13059 Cancelled 13059 Carrier 13059 Dest 13059 DestAirportID 13059 ... OriginLocation 13059 OriginRegion 13059 OriginWeather 13059 dayOfWeek 13059 timestamp 13059 Length: 27, dtype: int64
[38]:
ed_flights.count()
[39]:
pd_flights.describe()
8 rows × 7 columns
Values returned from eland.DataFrame.describe may vary due to results of Elasticsearch aggregations.
eland.DataFrame.describe
[40]:
# NBVAL_IGNORE_OUTPUT ed_flights.describe()
[41]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'> Index: 13059 entries, 0 to 13058 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AvgTicketPrice 13059 non-null float64 1 Cancelled 13059 non-null bool 2 Carrier 13059 non-null object 3 Dest 13059 non-null object 4 DestAirportID 13059 non-null object 5 DestCityName 13059 non-null object 6 DestCountry 13059 non-null object 7 DestLocation 13059 non-null object 8 DestRegion 13059 non-null object 9 DestWeather 13059 non-null object 10 DistanceKilometers 13059 non-null float64 11 DistanceMiles 13059 non-null float64 12 FlightDelay 13059 non-null bool 13 FlightDelayMin 13059 non-null int64 14 FlightDelayType 13059 non-null object 15 FlightNum 13059 non-null object 16 FlightTimeHour 13059 non-null float64 17 FlightTimeMin 13059 non-null float64 18 Origin 13059 non-null object 19 OriginAirportID 13059 non-null object 20 OriginCityName 13059 non-null object 21 OriginCountry 13059 non-null object 22 OriginLocation 13059 non-null object 23 OriginRegion 13059 non-null object 24 OriginWeather 13059 non-null object 25 dayOfWeek 13059 non-null int64 26 timestamp 13059 non-null datetime64[ns] dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17) memory usage: 3.2+ MB
[42]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'> Index: 13059 entries, 0 to 13058 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AvgTicketPrice 13059 non-null float64 1 Cancelled 13059 non-null bool 2 Carrier 13059 non-null object 3 Dest 13059 non-null object 4 DestAirportID 13059 non-null object 5 DestCityName 13059 non-null object 6 DestCountry 13059 non-null object 7 DestLocation 13059 non-null object 8 DestRegion 13059 non-null object 9 DestWeather 13059 non-null object 10 DistanceKilometers 13059 non-null float64 11 DistanceMiles 13059 non-null float64 12 FlightDelay 13059 non-null bool 13 FlightDelayMin 13059 non-null int64 14 FlightDelayType 13059 non-null object 15 FlightNum 13059 non-null object 16 FlightTimeHour 13059 non-null float64 17 FlightTimeMin 13059 non-null float64 18 Origin 13059 non-null object 19 OriginAirportID 13059 non-null object 20 OriginCityName 13059 non-null object 21 OriginCountry 13059 non-null object 22 OriginLocation 13059 non-null object 23 OriginRegion 13059 non-null object 24 OriginWeather 13059 non-null object 25 dayOfWeek 13059 non-null int64 26 timestamp 13059 non-null datetime64[ns] dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17) memory usage: 80.0 bytes
[43]:
pd_flights.max(numeric_only=True)
AvgTicketPrice 1199.73 Cancelled True DistanceKilometers 19881.5 DistanceMiles 12353.8 FlightDelay True FlightDelayMin 360 FlightTimeHour 31.715 FlightTimeMin 1902.9 dayOfWeek 6 dtype: object
eland.DataFrame.max,min,mean,sum only aggregate numeric columns
eland.DataFrame.max,min,mean,sum
[44]:
ed_flights.max(numeric_only=True)
[45]:
pd_flights.min(numeric_only=True)
AvgTicketPrice 100.021 Cancelled False DistanceKilometers 0 DistanceMiles 0 FlightDelay False FlightDelayMin 0 FlightTimeHour 0 FlightTimeMin 0 dayOfWeek 0 dtype: object
[46]:
ed_flights.min(numeric_only=True)
[47]:
pd_flights.mean(numeric_only=True)
AvgTicketPrice 628.253689 Cancelled 0.128494 DistanceKilometers 7092.142455 DistanceMiles 4406.853013 FlightDelay 0.251168 FlightDelayMin 47.335171 FlightTimeHour 8.518797 FlightTimeMin 511.127842 dayOfWeek 2.835975 dtype: float64
[48]:
ed_flights.mean(numeric_only=True)
AvgTicketPrice 628.253689 Cancelled 0.128494 DistanceKilometers 7092.142457 DistanceMiles 4406.853010 FlightDelay 0.251168 FlightDelayMin 47.335171 FlightTimeHour 8.518797 FlightTimeMin 511.127842 dayOfWeek 2.835975 dtype: float64
[49]:
pd_flights.sum(numeric_only=True)
AvgTicketPrice 8.204365e+06 Cancelled 1.678000e+03 DistanceKilometers 9.261629e+07 DistanceMiles 5.754909e+07 FlightDelay 3.280000e+03 FlightDelayMin 6.181500e+05 FlightTimeHour 1.112470e+05 FlightTimeMin 6.674818e+06 dayOfWeek 3.703500e+04 dtype: float64
[50]:
ed_flights.sum(numeric_only=True)
[51]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
Carrier 4 Origin 156 Dest 156 dtype: int64
[52]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[53]:
pd_flights.drop(columns=['AvgTicketPrice', 'Cancelled', 'DestLocation', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry'])
13059 rows × 20 columns
[54]:
ed_flights.drop(columns=['AvgTicketPrice', 'Cancelled', 'DestLocation', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry'])
[55]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10]) plt.show()
[56]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10]) plt.show()
[57]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)] ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']] ed_flights2 = ed_flights2.tail()
[58]:
print(ed_flights2.es_info())
es_index_pattern: flights Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name timestamp timestamp True date strict_date_hour_minute_second datetime64[ns] True True False timestamp OriginAirportID OriginAirportID True keyword None object True True False OriginAirportID DestAirportID DestAirportID True keyword None object True True False DestAirportID FlightDelayMin FlightDelayMin True integer None int64 True True False FlightDelayMin Operations: tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))] size: 5 sort_params: _doc:desc _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin'] body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}} post_processing: [('sort_index')]
This Jupyter Notebook goes along with the webinar ‘Introduction to Eland’ which is available on Youtube. To follow along either create an Elasticsearch deployment on Elastic Cloud (free trial available) or start your own Elasticsearch cluster locally.
You’ll need to install the following libraries:
$ python -m pip install eland numpy pandas
# Standard imports import eland as ed import pandas as pd import numpy as np from elasticsearch import Elasticsearch # Function for pretty-printing JSON def json(x): import json print(json.dumps(x, indent=2, sort_keys=True))
# Connect to an Elastic Cloud instance # or another Elasticsearch index below ELASTIC_CLOUD_ID = "<cloud-id>" ELASTIC_CLOUD_PASSWORD = "<password>" es = Elasticsearch( cloud_id=ELASTIC_CLOUD_ID, http_auth=("elastic", ELASTIC_CLOUD_PASSWORD) ) json(es.info())
{ "cluster_name": "167e473c7bba4bae85004385d4e0ce46", "cluster_uuid": "4Y2FwBhRSsWq9uGedb1DmQ", "name": "instance-0000000000", "tagline": "You Know, for Search", "version": { "build_date": "2020-06-14T19:35:50.234439Z", "build_flavor": "default", "build_hash": "757314695644ea9a1dc2fecd26d1a43856725e65", "build_snapshot": false, "build_type": "docker", "lucene_version": "8.5.1", "minimum_index_compatibility_version": "6.0.0-beta1", "minimum_wire_compatibility_version": "6.8.0", "number": "7.8.0" } }
# Load the dataset from NYC Open Data and take a look pd_df = pd.read_csv("nyc-restaurants.csv").dropna() pd_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 193197 entries, 0 to 400255 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CAMIS 193197 non-null int64 1 DBA 193197 non-null object 2 BORO 193197 non-null object 3 BUILDING 193197 non-null object 4 STREET 193197 non-null object 5 ZIPCODE 193197 non-null float64 6 PHONE 193197 non-null object 7 CUISINE DESCRIPTION 193197 non-null object 8 INSPECTION DATE 193197 non-null object 9 ACTION 193197 non-null object 10 VIOLATION CODE 193197 non-null object 11 VIOLATION DESCRIPTION 193197 non-null object 12 CRITICAL FLAG 193197 non-null object 13 SCORE 193197 non-null float64 14 GRADE 193197 non-null object 15 GRADE DATE 193197 non-null object 16 RECORD DATE 193197 non-null object 17 INSPECTION TYPE 193197 non-null object 18 Latitude 193197 non-null float64 19 Longitude 193197 non-null float64 20 Community Board 193197 non-null float64 21 Council District 193197 non-null float64 22 Census Tract 193197 non-null float64 23 BIN 193197 non-null float64 24 BBL 193197 non-null float64 25 NTA 193197 non-null object dtypes: float64(9), int64(1), object(16) memory usage: 39.8+ MB
# Rename the columns to be snake_case pd_df.columns = [x.lower().replace(" ", "_") for x in pd_df.columns] # Combine the 'latitude' and 'longitude' columns into one column 'location' for 'geo_point' pd_df["location"] = pd_df[["latitude", "longitude"]].apply(lambda x: ",".join(str(item) for item in x), axis=1) # Drop the old columns in favor of 'location' pd_df.drop(["latitude", "longitude"], axis=1, inplace=True) pd_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 193197 entries, 0 to 400255 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 camis 193197 non-null int64 1 dba 193197 non-null object 2 boro 193197 non-null object 3 building 193197 non-null object 4 street 193197 non-null object 5 zipcode 193197 non-null float64 6 phone 193197 non-null object 7 cuisine_description 193197 non-null object 8 inspection_date 193197 non-null object 9 action 193197 non-null object 10 violation_code 193197 non-null object 11 violation_description 193197 non-null object 12 critical_flag 193197 non-null object 13 score 193197 non-null float64 14 grade 193197 non-null object 15 grade_date 193197 non-null object 16 record_date 193197 non-null object 17 inspection_type 193197 non-null object 18 community_board 193197 non-null float64 19 council_district 193197 non-null float64 20 census_tract 193197 non-null float64 21 bin 193197 non-null float64 22 bbl 193197 non-null float64 23 nta 193197 non-null object 24 location 193197 non-null object dtypes: float64(7), int64(1), object(17) memory usage: 38.3+ MB
df = ed.pandas_to_eland( pd_df=pd_df, es_client=es, # Where the data will live in Elasticsearch es_dest_index="nyc-restaurants", # Type overrides for certain columns, 'location' detected # automatically as 'keyword' but we want these interpreted as 'geo_point'. es_type_overrides={ "location": "geo_point", "dba": "text", "zipcode": "short" }, # If the index already exists what should we do? es_if_exists="replace", # Wait for data to be indexed before returning es_refresh=True, ) df.info()
<class 'eland.dataframe.DataFrame'> Index: 193197 entries, 10388 to 398749 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 camis 193197 non-null int64 1 dba 193197 non-null object 2 boro 193197 non-null object 3 building 193197 non-null object 4 street 193197 non-null object 5 zipcode 193197 non-null int64 6 phone 193197 non-null object 7 cuisine_description 193197 non-null object 8 inspection_date 193197 non-null object 9 action 193197 non-null object 10 violation_code 193197 non-null object 11 violation_description 193197 non-null object 12 critical_flag 193197 non-null object 13 score 193197 non-null float64 14 grade 193197 non-null object 15 grade_date 193197 non-null object 16 record_date 193197 non-null object 17 inspection_type 193197 non-null object 18 community_board 193197 non-null float64 19 council_district 193197 non-null float64 20 census_tract 193197 non-null float64 21 bin 193197 non-null float64 22 bbl 193197 non-null float64 23 nta 193197 non-null object 24 location 193197 non-null object dtypes: float64(6), int64(2), object(17) memory usage: 80.0 bytes
json(es.indices.get_mapping(index="nyc-restaurants"))
{ "nyc-restaurants": { "mappings": { "properties": { "action": { "type": "keyword" }, "bbl": { "type": "double" }, "bin": { "type": "double" }, "boro": { "type": "keyword" }, "building": { "type": "keyword" }, "camis": { "type": "long" }, "census_tract": { "type": "double" }, "community_board": { "type": "double" }, "council_district": { "type": "double" }, "critical_flag": { "type": "keyword" }, "cuisine_description": { "type": "keyword" }, "dba": { "type": "text" }, "grade": { "type": "keyword" }, "grade_date": { "type": "keyword" }, "inspection_date": { "type": "keyword" }, "inspection_type": { "type": "keyword" }, "location": { "type": "geo_point" }, "nta": { "type": "keyword" }, "phone": { "type": "keyword" }, "record_date": { "type": "keyword" }, "score": { "type": "double" }, "street": { "type": "keyword" }, "violation_code": { "type": "keyword" }, "violation_description": { "type": "keyword" }, "zipcode": { "type": "short" } } } } }
# Shape is determined by using count API df.shape
(193197, 25)
# DataFrame has many APIs compatible with Pandas #df.head(10) #df.columns #df.dba #df["grade"] #df[df.grade.isin(["A", "B"])] #print(df[df.grade.isin(["A", "B"])].es_info()) #print(df.tail(10).es_info())
es_index_pattern: nyc-restaurants Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name camis camis True long None int64 True True False camis dba dba True text None object True False False None boro boro True keyword None object True True False boro building building True keyword None object True True False building street street True keyword None object True True False street zipcode zipcode True short None int64 True True False zipcode phone phone True keyword None object True True False phone cuisine_description cuisine_description True keyword None object True True False cuisine_description inspection_date inspection_date True keyword None object True True False inspection_date action action True keyword None object True True False action violation_code violation_code True keyword None object True True False violation_code violation_description violation_description True keyword None object True True False violation_description critical_flag critical_flag True keyword None object True True False critical_flag score score True double None float64 True True False score grade grade True keyword None object True True False grade grade_date grade_date True keyword None object True True False grade_date record_date record_date True keyword None object True True False record_date inspection_type inspection_type True keyword None object True True False inspection_type community_board community_board True double None float64 True True False community_board council_district council_district True double None float64 True True False council_district census_tract census_tract True double None float64 True True False census_tract bin bin True double None float64 True True False bin bbl bbl True double None float64 True True False bbl nta nta True keyword None object True True False nta location location True geo_point None object True True False location Operations: tasks: [('tail': ('sort_field': '_doc', 'count': 10))] size: 10 sort_params: _doc:desc _source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location'] body: {} post_processing: [('sort_index')]
# Aggregating values df.describe()
# Plotting with matplotlib from matplotlib import pyplot as plt df[["score"]].hist(figsize=[10,10]) plt.show()
# es_query() allows for the full Elasticsearch querying capabilities df.es_query({ "geo_distance": { "distance": "50m", "location": { "lat": 40.643852716573, "lon": -74.011628212186 } } })
41 rows × 25 columns
# Full-text search example df.es_query({"match": {"dba": "red"}})
573 rows × 25 columns
# Pull a subset of your data for building graphs / operations locally. sample_df = df[df.grade == "B"].sample(100).to_pandas() sample_df.info() print(type(sample_df))
<class 'pandas.core.frame.DataFrame'> Index: 100 entries, 107677 to 96813 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 camis 100 non-null int64 1 dba 100 non-null object 2 boro 100 non-null object 3 building 100 non-null object 4 street 100 non-null object 5 zipcode 100 non-null float64 6 phone 100 non-null object 7 cuisine_description 100 non-null object 8 inspection_date 100 non-null object 9 action 100 non-null object 10 violation_code 100 non-null object 11 violation_description 100 non-null object 12 critical_flag 100 non-null object 13 score 100 non-null float64 14 grade 100 non-null object 15 grade_date 100 non-null object 16 record_date 100 non-null object 17 inspection_type 100 non-null object 18 community_board 100 non-null float64 19 council_district 100 non-null float64 20 census_tract 100 non-null float64 21 bin 100 non-null float64 22 bbl 100 non-null float64 23 nta 100 non-null object 24 location 100 non-null object dtypes: float64(7), int64(1), object(17) memory usage: 20.3+ KB <class 'pandas.core.frame.DataFrame'>
# Import scikit-learn and train a dataset locally from sklearn import datasets from sklearn.tree import DecisionTreeClassifier # Train the data locally digits = datasets.load_wine() print("Feature Names:", digits.feature_names) print("Data example:", digits.data[0]) # Save 10, 80, and 140 for testing our model data = [x for i, x in enumerate(digits.data) if i not in (10, 80, 140)] target = [x for i, x in enumerate(digits.target) if i not in (10, 80, 140)] sk_classifier = DecisionTreeClassifier() sk_classifier.fit(data, target) # Test out our model against the three targets print(sk_classifier.predict(digits.data[[10, 80, 140]])) print(digits.target[[10, 80, 140]])
Feature Names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'] Data example: [1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00 2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03] [0 1 2] [0 1 2]
from eland.ml import MLModel # Serialize the scikit-learn model into Elasticsearch ed_classifier = MLModel.import_model( es_client=es, model_id="wine-classifier", model=sk_classifier, feature_names=digits.feature_names, overwrite=True ) # Capture the Elasticsearch API call w/ logging import logging logger = logging.getLogger("elasticsearch") logger.setLevel(logging.DEBUG) logger.addHandler(logging.StreamHandler()) # Use the same data as before, but now with the model in Elasticsearch print(ed_classifier.predict(digits.data[[10, 80, 140]].tolist())) print(digits.target[[10, 80, 140]]) logger.handlers = []
POST https://167e473c7bba4bae85004385d4e0ce46.us-central1.gcp.cloud.es.io/_ingest/pipeline/_simulate [status:200 request:0.053s] > {"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]} < {"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]}
[0 1 2] [0 1 2]
json({"pipeline":{"processors":[{"inference":{"model_id":"wine-classifier","inference_config":{"classification":{}},"field_map":{}}}]},"docs":[{"_source":{"alcohol":14.1,"malic_acid":2.16,"ash":2.3,"alcalinity_of_ash":18.0,"magnesium":105.0,"total_phenols":2.95,"flavanoids":3.32,"nonflavanoid_phenols":0.22,"proanthocyanins":2.38,"color_intensity":5.75,"hue":1.25,"od280/od315_of_diluted_wines":3.17,"proline":1510.0}},{"_source":{"alcohol":12.0,"malic_acid":0.92,"ash":2.0,"alcalinity_of_ash":19.0,"magnesium":86.0,"total_phenols":2.42,"flavanoids":2.26,"nonflavanoid_phenols":0.3,"proanthocyanins":1.43,"color_intensity":2.5,"hue":1.38,"od280/od315_of_diluted_wines":3.12,"proline":278.0}},{"_source":{"alcohol":12.93,"malic_acid":2.81,"ash":2.7,"alcalinity_of_ash":21.0,"magnesium":96.0,"total_phenols":1.54,"flavanoids":0.5,"nonflavanoid_phenols":0.53,"proanthocyanins":0.75,"color_intensity":4.6,"hue":0.77,"od280/od315_of_diluted_wines":2.31,"proline":600.0}}]})
{ "docs": [ { "_source": { "alcalinity_of_ash": 18.0, "alcohol": 14.1, "ash": 2.3, "color_intensity": 5.75, "flavanoids": 3.32, "hue": 1.25, "magnesium": 105.0, "malic_acid": 2.16, "nonflavanoid_phenols": 0.22, "od280/od315_of_diluted_wines": 3.17, "proanthocyanins": 2.38, "proline": 1510.0, "total_phenols": 2.95 } }, { "_source": { "alcalinity_of_ash": 19.0, "alcohol": 12.0, "ash": 2.0, "color_intensity": 2.5, "flavanoids": 2.26, "hue": 1.38, "magnesium": 86.0, "malic_acid": 0.92, "nonflavanoid_phenols": 0.3, "od280/od315_of_diluted_wines": 3.12, "proanthocyanins": 1.43, "proline": 278.0, "total_phenols": 2.42 } }, { "_source": { "alcalinity_of_ash": 21.0, "alcohol": 12.93, "ash": 2.7, "color_intensity": 4.6, "flavanoids": 0.5, "hue": 0.77, "magnesium": 96.0, "malic_acid": 2.81, "nonflavanoid_phenols": 0.53, "od280/od315_of_diluted_wines": 2.31, "proanthocyanins": 0.75, "proline": 600.0, "total_phenols": 1.54 } } ], "pipeline": { "processors": [ { "inference": { "field_map": {}, "inference_config": { "classification": {} }, "model_id": "wine-classifier" } } ] } }
json({"docs":[{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":14.1,"alcalinity_of_ash":18.0,"proanthocyanins":2.38,"od280/od315_of_diluted_wines":3.17,"total_phenols":2.95,"magnesium":105.0,"flavanoids":3.32,"proline":1510.0,"malic_acid":2.16,"ash":2.3,"nonflavanoid_phenols":0.22,"hue":1.25,"color_intensity":5.75,"ml":{"inference":{"predicted_value":"0","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98965Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.0,"alcalinity_of_ash":19.0,"proanthocyanins":1.43,"od280/od315_of_diluted_wines":3.12,"total_phenols":2.42,"magnesium":86.0,"flavanoids":2.26,"proline":278.0,"malic_acid":0.92,"ash":2.0,"nonflavanoid_phenols":0.3,"hue":1.38,"color_intensity":2.5,"ml":{"inference":{"predicted_value":"1","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.98966Z"}}},{"doc":{"_index":"_index","_type":"_doc","_id":"_id","_source":{"alcohol":12.93,"alcalinity_of_ash":21.0,"proanthocyanins":0.75,"od280/od315_of_diluted_wines":2.31,"total_phenols":1.54,"magnesium":96.0,"flavanoids":0.5,"proline":600.0,"malic_acid":2.81,"ash":2.7,"nonflavanoid_phenols":0.53,"hue":0.77,"color_intensity":4.6,"ml":{"inference":{"predicted_value":"2","model_id":"wine-classifier"}}},"_ingest":{"timestamp":"2020-07-08T15:35:49.989672Z"}}}]})
{ "docs": [ { "doc": { "_id": "_id", "_index": "_index", "_ingest": { "timestamp": "2020-07-08T15:35:49.98965Z" }, "_source": { "alcalinity_of_ash": 18.0, "alcohol": 14.1, "ash": 2.3, "color_intensity": 5.75, "flavanoids": 3.32, "hue": 1.25, "magnesium": 105.0, "malic_acid": 2.16, "ml": { "inference": { "model_id": "wine-classifier", "predicted_value": "0" } }, "nonflavanoid_phenols": 0.22, "od280/od315_of_diluted_wines": 3.17, "proanthocyanins": 2.38, "proline": 1510.0, "total_phenols": 2.95 }, "_type": "_doc" } }, { "doc": { "_id": "_id", "_index": "_index", "_ingest": { "timestamp": "2020-07-08T15:35:49.98966Z" }, "_source": { "alcalinity_of_ash": 19.0, "alcohol": 12.0, "ash": 2.0, "color_intensity": 2.5, "flavanoids": 2.26, "hue": 1.38, "magnesium": 86.0, "malic_acid": 0.92, "ml": { "inference": { "model_id": "wine-classifier", "predicted_value": "1" } }, "nonflavanoid_phenols": 0.3, "od280/od315_of_diluted_wines": 3.12, "proanthocyanins": 1.43, "proline": 278.0, "total_phenols": 2.42 }, "_type": "_doc" } }, { "doc": { "_id": "_id", "_index": "_index", "_ingest": { "timestamp": "2020-07-08T15:35:49.989672Z" }, "_source": { "alcalinity_of_ash": 21.0, "alcohol": 12.93, "ash": 2.7, "color_intensity": 4.6, "flavanoids": 0.5, "hue": 0.77, "magnesium": 96.0, "malic_acid": 2.81, "ml": { "inference": { "model_id": "wine-classifier", "predicted_value": "2" } }, "nonflavanoid_phenols": 0.53, "od280/od315_of_diluted_wines": 2.31, "proanthocyanins": 0.75, "proline": 600.0, "total_phenols": 1.54 }, "_type": "_doc" } } ] }
print(df[df["zipcode"] > df["score"]].es_info())
es_index_pattern: nyc-restaurants Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name camis camis True long None int64 True True False camis dba dba True text None object True False False None boro boro True keyword None object True True False boro building building True keyword None object True True False building street street True keyword None object True True False street zipcode zipcode True short None int64 True True False zipcode phone phone True keyword None object True True False phone cuisine_description cuisine_description True keyword None object True True False cuisine_description inspection_date inspection_date True keyword None object True True False inspection_date action action True keyword None object True True False action violation_code violation_code True keyword None object True True False violation_code violation_description violation_description True keyword None object True True False violation_description critical_flag critical_flag True keyword None object True True False critical_flag score score True double None float64 True True False score grade grade True keyword None object True True False grade grade_date grade_date True keyword None object True True False grade_date record_date record_date True keyword None object True True False record_date inspection_type inspection_type True keyword None object True True False inspection_type community_board community_board True double None float64 True True False community_board council_district council_district True double None float64 True True False council_district census_tract census_tract True double None float64 True True False census_tract bin bin True double None float64 True True False bin bbl bbl True double None float64 True True False bbl nta nta True keyword None object True True False nta location location True geo_point None object True True False location Operations: tasks: [('boolean_filter': ('boolean_filter': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}))] size: None sort_params: None _source: ['camis', 'dba', 'boro', 'building', 'street', 'zipcode', 'phone', 'cuisine_description', 'inspection_date', 'action', 'violation_code', 'violation_description', 'critical_flag', 'score', 'grade', 'grade_date', 'record_date', 'inspection_type', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', 'location'] body: {'query': {'script': {'script': {'source': "doc['zipcode'].value > doc['score'].value", 'lang': 'painless'}}}} post_processing: []
[ ]:
import eland as ed import pandas as pd import numpy as np import matplotlib.pyplot as plt # Fix console size for consistent test results from eland.conftest import *
To get started, let’s create an eland.DataFrame by reading a csv file. This creates and populates the online-retail index in the local Elasticsearch cluster.
eland.DataFrame
online-retail
df = ed.csv_to_eland("data/online-retail.csv.gz", es_client='localhost', es_dest_index='online-retail', es_if_exists='replace', es_dropna=True, es_refresh=True, compression='gzip', index_col=0)
Here we see that the "_id" field was used to index our data frame.
"_id"
df.index.es_index_field
Next, we can check which field from elasticsearch are available to our eland data frame. columns is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.
columns
df.columns
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'], dtype='object')
Now, let’s see the data types of our fields. Running df.dtypes, we can see that elasticsearch field types are mapped to pandas field types.
df.dtypes
Country object CustomerID float64 Description object InvoiceDate object InvoiceNo object Quantity int64 StockCode object UnitPrice float64 dtype: object
We also offer a .es_info() data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.
.es_info()
print(df.es_info())
es_index_pattern: online-retail Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: []
Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.
much like pandas, eland data frames offer .head(n) and .tail(n) methods that return the first and last n rows, respectively.
.head(n)
.tail(n)
df.head(2)
2 rows × 8 columns
print(df.tail(2).head(2).tail(2).es_info())
es_index_pattern: online-retail Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))] size: 2 sort_params: _doc:desc _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]
df.tail(2)
you can also pass a list of columns to select columns from the data frame in a specified order.
df[['Country', 'InvoiceDate']].head(5)
5 rows × 2 columns
we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms query that is then passed to elasticsearch to filter the index.
terms
# the construction of a boolean vector maps directly to an elasticsearch query print(df['Country']=='Germany') df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
5 rows × 8 columns
we can also filter the data frame using a list of values.
print(df['Country'].isin(['Germany', 'United States'])) df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
We can also combine boolean vectors to further filter the data frame.
df[(df['Country']=='Germany') & (df['Quantity']>90)]
0 rows × 8 columns
Using this example, let see how eland translates this boolean filter to an elasticsearch bool query.
bool
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].es_info())
es_index_pattern: online-retail Index: es_index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}} post_processing: []
Let’s begin to ask some questions of our data and use eland to get the answers.
How many different countries are there?
df['Country'].nunique()
16
What is the total sum of products ordered?
df['Quantity'].sum()
111960.0
Show me the sum, mean, min, and max of the qunatity and unit_price fields
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
Give me descriptive statistics for the entire data frame
# NBVAL_IGNORE_OUTPUT df.describe()
Show me a histogram of numeric columns
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30) plt.show()
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True) plt.show()
df.query('Quantity>50 & UnitPrice<100')
258 rows × 8 columns
Numeric values
df['Quantity'].head()
1000 1 1001 1 1002 1 1003 1 1004 12 Name: Quantity, dtype: int64
df['UnitPrice'].head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 0.29 Name: UnitPrice, dtype: float64
product = df['Quantity'] * df['UnitPrice']
product.head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 3.48 dtype: float64
String concatenation
df['Country'] + df['StockCode']
1000 United Kingdom21123 1001 United Kingdom21124 1002 United Kingdom21122 1003 United Kingdom84378 1004 United Kingdom21985 ... 14995 United Kingdom72349B 14996 United Kingdom72741 14997 United Kingdom22762 14998 United Kingdom21773 14999 United Kingdom22149 Length: 15000, dtype: object
Eland is an open source project and we love to receive contributions from our community — you! There are many ways to contribute, from writing tutorials or blog posts, improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated into eland itself.
If you think you have found a bug in eland, first make sure that you are testing against the latest version of eland - your issue may already have been fixed. If not, search our issues list on GitHub in case a similar issue has already been opened.
It is very helpful if you can prepare a reproduction of the bug. In other words, provide a small test case which we can run to confirm your bug. It makes it easier to find the problem and to fix it. Test cases should be provided as python scripts, ideally with some details of your Elasticsearch environment and index mappings, and (where appropriate) a pandas example.
Provide as much information as you can. You may think that the problem lies with your query, when actually it depends on how your data is indexed. The easier it is for us to recreate your problem, the faster it is likely to be fixed.
If you find yourself wishing for a feature that doesn’t exist in eland, you are probably not alone. There are bound to be others out there with similar needs. Many of the features that eland has today have been added because our users saw the need. Open an issue on our issues list on GitHub which describes the feature you would like to see, why you need it, and how it should work.
If you have a bugfix or new feature that you would like to contribute to eland, please find or open an issue about it first. Talk about what you would like to do. It may be that somebody is already working on it, or that there are particular issues that you should know about before implementing the change.
We enjoy working with contributors to get their code accepted. There are many approaches to fixing a problem and it is important to find the best approach before writing too much code.
Note that it is unlikely the project will merge refactors for the sake of refactoring. These types of pull requests have a high cost to maintainers in reviewing and testing with little to no tangible benefit. This especially includes changes generated by tools.
The process for contributing to any of the Elastic repositories is similar. Details for individual projects can be found below.
You will need to fork the main eland code or documentation repository and clone it to your local machine. See github help page for help.
Further instructions for specific projects are given below.
Once your changes and tests are ready to submit for review:
Test your changes
Run the test suite to make sure that nothing is broken (TODO add link to testing doc).
Sign the Contributor License Agreement
Please make sure you have signed our Contributor License Agreement. We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code. You only need to sign the CLA once.
Rebase your changes
Update your local repository with the most recent code from the main eland repository, and rebase your branch on top of the latest master branch. We prefer your initial changes to be squashed into a single commit. Later, if we ask you to make changes, add them as separate commits. This makes them easier to review. As a final step before merging we will either ask you to squash all commits yourself or we’ll do it for you.
Submit a pull request
Push your local changes to your forked copy of the repository and submit a pull request. In the pull request, choose a title which sums up the changes that you have made, and in the body provide more details about what your changes do. Also mention the number of the issue where discussion has taken place, eg “Closes #123”.
Then sit back and wait. There will probably be discussion about the pull request and, if any changes are needed, we would love to work with you to get your pull request merged into eland.
Please adhere to the general guideline that you should never force push to a publicly shared branch. Once you have opened your pull request, you should consider your branch publicly shared. Instead of force pushing you can just add incremental commits; this is generally easier on your reviewers. If you need to pick up changes from master, you can merge master into your branch. A reviewer might ask you to rebase a long-running pull request in which case force pushing is okay for that request. Note that squashing at the end of the review process should also not be done, that can be done when the pull request is integrated via GitHub.
Repository: https://github.com/elastic/eland
We internally develop using the PyCharm IDE. For PyCharm, we are currently using a minimum version of PyCharm 2019.2.4.
(All commands should be run from module root)
git@github.com:stevedodson/eland.git
pytest
numpy
pip install -r requirements-dev.txt
localhost:9200
python -m eland.tests.setup_tests
pytest --doctest-modules
nox -s test-3.8
nox -s blacken
The goal of an eland.DataFrame is to enable users who are familiar with pandas.DataFrame to access, explore and manipulate data that resides in Elasticsearch.
pandas.DataFrame
Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.
Generally, integrations with 3rd party storage systems (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to Apache Arrow structures.
Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.
An alternative option is to create an API that proxies pandas.DataFrame-like calls to Elasticsearch queries and operations. This could allow the Elasticsearch cluster to perform operations such as aggregations rather than exporting all the data and performing this operation in-core.
An option would be to replace the pandas.DataFrame backend in-core memory structures with Elasticsearch accessors. This would allow full access to the pandas.DataFrame APIs. However, this has issues:
df_a = df['a']
df
df_a
df.to_dict()
Another option is to create a eland.DataFrame API that mimics appropriate aspects of the pandas.DataFrame API. This resolves some of the issues above as:
eland.DataFrame._to_pandas()
pandas