eland: pandas-like data analysis toolkit backed by Elasticsearch

Date: Apr 14, 2020 Version: 7.6.0a5

Useful links: Source Repository | Issues & Ideas | Q&A Support |

eland is an open source, Apache2-licensed elasticsearch Python client to analyse, explore and manipulate data that resides in elasticsearch. Where possible the package uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their elasticsearch powered equivalents. In general, the data resides in elasticsearch and not in memory, which allows eland to access large datasets stored in elasticsearch.

API reference

This page gives an overview of all public eland objects, functions and methods. All classes and functions exposed in eland.* namespace are public.

Input/Output

Flat File

read_csv(filepath_or_buffer, es_client, …) Read a comma-separated values (csv) file into eland.DataFrame (i.e.

General utility functions

Elasticsearch access

read_es(es_client, es_index_pattern) Utility method to create an eland.Dataframe from an Elasticsearch index_pattern.

Pandas and Eland

pandas_to_eland(pd_df, es_client, es_dest_index) Append a pandas DataFrame to an Elasticsearch index.
eland_to_pandas(ed_df[, show_progress]) Convert an eland.Dataframe to a pandas.DataFrame

DataFrame

Constructor

DataFrame([client, index_pattern, columns, …]) Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) referencing data stored in Elasticsearch indices.

Attributes and underlying data

Axes

DataFrame.index Return eland index referencing Elasticsearch field to index a DataFrame/Series
DataFrame.columns The column labels of the DataFrame.
DataFrame.dtypes Return the pandas dtypes in the DataFrame.
DataFrame.select_dtypes(self[, include, exclude]) Return a subset of the DataFrame’s columns based on the column dtypes.
DataFrame.values Not implemented.
DataFrame.empty Determines if the DataFrame is empty.
DataFrame.shape Return a tuple representing the dimensionality of the DataFrame.

Indexing, iteration

DataFrame.head(self[, n]) Return the first n rows.
DataFrame.keys(self) Return columns
DataFrame.tail(self[, n]) Return the last n rows.
DataFrame.get(self, key[, default]) Get item from object for given key (ex: DataFrame column).
DataFrame.query(self, expr) Query the columns of a DataFrame with a boolean expression.

Function application, GroupBy & window

DataFrame.agg(self, func[, axis]) Aggregate using one or more operations over the specified axis.
DataFrame.aggregate(self, func[, axis]) Aggregate using one or more operations over the specified axis.

Computations / descriptive stats

DataFrame.count(self) Count non-NA cells for each column.
DataFrame.describe(self) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
DataFrame.info(self[, verbose, buf, …]) Print a concise summary of a DataFrame.
DataFrame.max(self[, numeric_only]) Return the maximum value for each numeric column
DataFrame.mean(self[, numeric_only]) Return mean value for each numeric column
DataFrame.min(self[, numeric_only]) Return the minimum value for each numeric column
DataFrame.sum(self[, numeric_only]) Return sum for each numeric column
DataFrame.nunique(self) Return cardinality of each field.

Reindexing / selection / label manipulation

DataFrame.drop(self[, labels, axis, index, …]) Return new object with labels in requested axis removed.

Plotting

DataFrame.hist(data[, column, by, grid, …]) Make a histogram of the DataFrame’s.

Elasticsearch Functions

DataFrame.info_es(self) A debug summary of an eland DataFrame internals.
DataFrame.es_query(self, query) Applies an Elasticsearch DSL query to the current DataFrame.

Serialization / IO / conversion

DataFrame.info(self[, verbose, buf, …]) Print a concise summary of a DataFrame.
DataFrame.to_numpy(self) Not implemented.
DataFrame.to_csv(self[, path_or_buf, sep, …]) Write Elasticsearch data to a comma-separated values (csv) file.
DataFrame.to_html(self[, buf, columns, …]) Render a Elasticsearch data as an HTML table.
DataFrame.to_string(self[, buf, columns, …]) Render a DataFrame to a console-friendly tabular output.

Series

Constructor

Series([client, index_pattern, name, …]) pandas.Series like API that proxies into Elasticsearch index(es).

Attributes and underlying data

Axes

Series.index Return eland index referencing Elasticsearch field to index a DataFrame/Series
Series.shape Return a tuple representing the dimensionality of the Series.
Series.name
Series.empty Determines if the Series is empty.

Indexing, iteration

Series.head(self[, n])
Series.tail(self[, n])

Binary operator functions

Series.add(self, right) Return addition of series and right, element-wise (binary operator add).
Series.sub(self, right) Return subtraction of series and right, element-wise (binary operator sub).
Series.mul(self, right) Return multiplication of series and right, element-wise (binary operator mul).
Series.div(self, right) Return floating division of series and right, element-wise (binary operator truediv).
Series.truediv(self, right) Return floating division of series and right, element-wise (binary operator truediv).
Series.floordiv(self, right) Return integer division of series and right, element-wise (binary operator floordiv //).
Series.mod(self, right) Return modulo of series and right, element-wise (binary operator mod %).
Series.pow(self, right) Return exponential power of series and right, element-wise (binary operator pow).
Series.radd(self, left) Return addition of series and left, element-wise (binary operator add).
Series.rsub(self, left) Return subtraction of series and left, element-wise (binary operator sub).
Series.rmul(self, left) Return multiplication of series and left, element-wise (binary operator mul).
Series.rdiv(self, left) Return division of series and left, element-wise (binary operator div).
Series.rtruediv(self, left) Return division of series and left, element-wise (binary operator div).
Series.rfloordiv(self, left) Return integer division of series and left, element-wise (binary operator floordiv //).
Series.rmod(self, left) Return modulo of series and left, element-wise (binary operator mod %).
Series.rpow(self, left) Return exponential power of series and left, element-wise (binary operator pow).

Computations / descriptive stats

Series.describe(self) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Series.max(self[, numeric_only]) Return the maximum of the Series values
Series.mean(self[, numeric_only]) Return the mean of the Series values
Series.min(self[, numeric_only]) Return the minimum of the Series values
Series.sum(self[, numeric_only]) Return the sum of the Series values
Series.nunique(self) Return the sum of the Series values
Series.value_counts(self[, es_size]) Return the value counts for the specified field.

Reindexing / selection / label manipulation

Series.rename(self, new_name) Rename name of series.

Plotting

Series.hist(self[, by, ax, grid, …]) Draw histogram of the input series using matplotlib.

Serialization / IO / conversion

Series.to_string(self[, buf, na_rep, …]) Render a string representation of the Series.
Series.to_numpy(self) Not implemented.

Elasticsearch utilities

Series.info_es(self)

Index

Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame) and those should most likely be used before calling these methods directly.

Constructor

Index(query_compiler[, index_field]) The index for an eland.DataFrame.

Machine Learning

Machine learning is built into the Elastic Stack and enables users to gain insights into their Elasticsearch data. There are a wide range of capabilities from identifying in anomalies in your data, to training and deploying regression or classification models based on Elasticsearch data.

To use the Elastic Stack machine learning features, you must have the appropriate license and at least one machine learning node in your Elasticsearch cluster. If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges.

The fastest way to get started with machine learning features is to start a free 14-day trial of Elasticsearch Service in the cloud.

See https://www.elastic.co/guide/en/machine-learning/current/setup.html and other documentation for more detail.

ImportedMLModel

Constructor
ImportedMLModel(es_client, model_id, model, …) Transform and serialize a trained 3rd party model into Elasticsearch.
Learning API
ImportedMLModel.predict(self, X) Make a prediction using a trained model stored in Elasticsearch.

Implementation Notes

Implementation Details

The goal of an eland.DataFrame is to enable users who are familiar with pandas.DataFrame to access, explore and manipulate data that resides in Elasticsearch.

Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.

Pandas and 3rd Party Storage Systems

Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html) structures.

Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.

An alternative option is to create an API that proxies pandas.DataFrame-like calls to Elasticsearch queries and operations. This could allow the Elasticsearch cluster to perform operations such as aggregations rather than exporting all the data and performing this operation in-core.

Implementation Options

An option would be to replace the pandas.DataFrame backend in-core memory structures with Elasticsearch accessors. This would allow full access to the pandas.DataFrame APIs. However, this has issues:

  • If a pandas.DataFrame instance maps to an index, typical manipulation of a pandas.DataFrame may involve creating many derived pandas.DataFrame instances. Constructing an index per pandas.DataFrame may result in many Elasticsearch indexes and a significant load on Elasticsearch. For example, df_a = df['a'] should not require Elasticsearch indices df and df_a
  • Not all pandas.DataFrame APIs map to things we may want to do in Elasticsearch. In particular, API calls that involve exporting all data from Elasticsearch into memory e.g. df.to_dict().
  • The backend pandas.DataFrame structures are not easily abstractable and are deeply embedded in the implementation.

Another option is to create a eland.DataFrame API that mimics appropriate aspects of the pandas.DataFrame API. This resolves some of the issues above as:

  • df_a = df['a'] could be implemented as a change to the Elasticsearch query used, rather than a new index
  • Instead of supporting the enitre pandas.DataFrame API we can support a subset appropriate for Elasticsearch. If addition calls are required, we could to create a eland.DataFrame._to_pandas() method which would explicitly export all data to a pandas.DataFrame
  • Creating a new eland.DataFrame API gives us full flexibility in terms of implementation. However, it does create a large amount of work which may duplicate a lot of the pandas code - for example, printing objects etc. - this creates maintenance issues etc.

pandas.DataFrame supported APIs

The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the http://github.com/elastic/eland, or give a thumbs up to already created issues. Contributions are also welcome!

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no.

https://github.com/adgirish/kaggleScape/blob/master/results/annotResults.csv represents a prioritised list.

Method Count Notes
pd.read_csv 1422 y
pd.DataFrame 886 y
df.append 792 n
df.mean 783 y
df.head 783 y
df.drop 761 y
df.sum 755 y
df.to_csv 693 y
df.get 669 y
df.mode 653 n
df.astype 649 n
df.sub 637 n
pd.concat 582 n
df.apply 577 n
df.groupby 557 n
df.join 544 n
df.fillna 543 n
df.max 508 y
df.reset_index 434 n
pd.unique 433 n
df.le 405 n
df.count 399 y
pd.value_counts 397 y
df.sort_values 390 n
df.transform 387 n
df.merge 376 n
df.add 346 n
df.isnull 338 n
df.min 321 y
df.copy 314 n
df.replace 300 n
df.std 261 n
df.hist 246 y
df.filter 234 n
df.describe 220 y
df.ne 218 n
df.corr 217 n
df.median 217 n
df.items 212 n
pd.to_datetime 204 n
df.isin 203 n
df.dropna 195 n
pd.get_dummies 190 n
df.rename 185 n
df.info 180 y
df.set_index 166 n
df.keys 159 y
df.sample 155 n
df.agg 140 y
df.where 138 n
df.boxplot 134 n
df.clip 116 n
df.round 116 n
df.abs 101 n
df.stack 97 n
df.tail 94 y
df.update 92 n
df.iterrows 90 n
df.transpose 87 n
df.any 85 n
df.pipe 80 n
pd.eval 73 n
df.eval 73 n
pd.read_json 72 n
df.nunique 70 y
df.pivot 70 n
df.select 68 n
df.as_matrix 67 n
df.notnull 66 n
df.cumsum 66 n
df.prod 64 n
df.unstack 64 n
df.drop_duplicates 63 n
df.div 63 n
pd.crosstab 59 n
df.select_dtypes 57 y
df.pow 56 n
df.sort_index 56 n
df.product 52 n
df.isna 51 n
df.dot 46 n
pd.cut 45 n
df.bool 44 n
df.to_dict 44 n
df.diff 44 n
df.insert 44 n
df.pop 44 n
df.query 43 y
df.var 43 n
df.__init__ 41 y
pd.to_numeric 39 n
df.squeeze 39 n
df.ge 37 n
df.quantile 37 n
df.reindex 37 n
df.rolling 35 n
pd.factorize 32 n
pd.melt 31 n
df.melt 31 n
df.rank 31 n
pd.read_table 30 n
pd.pivot_table 30 n
df.idxmax 30 n
pd.test 29 n
df.iteritems 29 n
df.shift 28 n
df.mul 28 n
pd.qcut 25 n
df.set_value 25 n
df.all 24 n
df.skew 24 n
df.aggregate 23 y
pd.match 22 n
df.nlargest 22 n
df.multiply 21 n
df.set_axis 19 n
df.eq 18 n
df.resample 18 n
pd.read_sql 17 n
df.duplicated 16 n
pd.date_range 16 n
df.interpolate 15 n
df.memory_usage 15 n
df.divide 14 n
df.cov 13 n
df.assign 12 n
df.subtract 12 n
pd.read_pickle 11 n
df.applymap 11 n
df.first 11 n
df.kurt 10 n
df.truncate 10 n
df.get_value 9 n
pd.read_hdf 9 n
df.to_html 9 y
pd.read_sql_query 9 n
df.take 8 n
df.to_pickle 7 n
df.itertuples 7 n
df.to_string 7 y
df.last 7 n
df.sem 7 n
pd.to_pickle 7 n
df.to_json 7 n
df.idxmin 7 n
df.xs 6 n
df.combine 6 n
pd.rolling_mean 6 n
df.to_period 6 n
df.convert_objects 5 n
df.mask 4 n
df.pct_change 4 n
df.add_prefix 4 n
pd.read_excel 4 n
pd.rolling_std 3 n
df.to_records 3 n
df.corrwith 3 n
df.swapaxes 3 n
df.__iter__ 3 n
df.to_sql 3 n
pd.read_feather 3 n
df.to_feather 3 n
df.__len__ 3 n
df.kurtosis 3 n
df.mod 2 n
df.to_sparse 2 n
df.get_values 2 n
df.__eq__ 2 n
pd.bdate_range 2 n
df.get_dtype_counts 2 n
df.combine_first 2 n
df._get_numeric_data 2 n
df.nsmallest 2 n
pd.scatter_matrix 2 n
df.rename_axis 2 n
df.__setstate__ 2 n
df.cumprod 2 n
df.__getstate__ 2 n
df.equals 2 n
df.__getitem__ 2 y
df.clip_upper 2 n
df.floordiv 2 n
df.to_excel 2 n
df.reindex_axis 1 n
pd.to_timedelta 1 n
df.ewm 1 n
df.tz_localize 1 n
df.tz_convert 1 n
df.to_hdf 1 n
df.lookup 1 n
pd.merge_ordered 1 n
df.swaplevel 1 n
df.first_valid_index 1 n
df.lt 1 n
df.add_suffix 1 n
pd.rolling_median 1 n
df.to_dense 1 n
df.mad 1 n
df.align 1 n
df.__copy__ 1 n
pd.set_eng_float_format 1 n
df.add_suffix 1 n
pd.rolling_median 1 n
df.to_dense 1 n
df.mad 1 n
df.align 1 n
df.__copy__ 1 n
pd.set_eng_float_format 1 n
DataFrame method Eland Implementation? (Y/N/P/D) Notes for Current implementation
T N  
abs N  
add N  
add_prefix N  
add_suffix N  
agg aggregate Y  
align N  
all N  
any N  
append N  
apply N See agg
applymap N  
as_blocks N  
as_matrix N  
asfreq N  
asof N  
assign N  
astype N  
at N  
at_time N  
axes N  
between_time N  
bfill N  
blocks N  
bool N  
boxplot N  
clip N  
clip_lower N  
clip_upper N  
combine N  
combine_first N  
compound N  
consolidate N  
convert_objects N  
copy N  
corr N  
corrwith N  
count Y  
cov N  
cummax N  
cummin N  
cumprod N  
cumsum N  
describe Y  
diff N  
div N  
divide N  
dot N  
drop Y  
drop_duplicates N  
dropna N  
dtypes Y  
duplicated N  
empty Y  
eq N  
equals N  
eval N  
ewm N  
expanding N  
ffill N  
fillna N  
filter N  
first N  
first_valid_index N  
floordiv N  
from_csv N  
from_dict N  
from_items N  
from_records N  
ftypes N  
ge N  
get Y  
get_dtype_counts N  
get_ftype_counts N  
get_value N  
get_values N  
groupby N  
gt N  
head Y  
hist Y  
iat N  
idxmax N  
idxmin N  
iloc N  
infer_objects N  
info Y  
insert N  
interpolate N  
is_copy N  
isin N  
isna N  
isnull N  
items N  
iteritems N  
iterrows N  
itertuples N  
ix N  
join N  
keys Y  
kurt N  
kurtosis N  
last N  
last_valid_index N  
le N  
loc N  
lookup N  
lt N  
mad N  
mask N  
max Y  
mean Y  
median N  
melt N  
memory_usage N  
merge N  
min Y  
mod N  
mode N  
mul N  
multiply N  
ndim N  
ne N  
nlargest N  
notna N  
notnull N  
nsmallest N  
nunique Y  
pct_change N  
pipe N  
pivot N  
pivot_table N  
plot N  
pop N  
pow N  
prod N  
product N  
quantile N  
query Y  
radd N  
rank N  
rdiv N  
reindex N  
reindex_axis N  
reindex_like N  
rename N  
rename_axis N  
reorder_levels N  
replace N  
resample N  
reset_index N  
rfloordiv N  
rmod N  
rmul N  
rolling N  
round N  
rpow N  
rsub N  
rtruediv N  
sample N  
select N  
select_dtypes Y  
sem N  
set_axis N  
set_index N  
set_value N  
shape Y  
shift N  
size N  
skew N  
slice_shift N  
sort_index N  
sort_values N  
sortlevel N  
squeeze N  
stack N  
std N  
style N  
sub N  
subtract N  
sum Y  
swapaxes N  
swaplevel N  
tail Y  
take N  
to_clipboard N  
to_csv Y  
to_dense N  
to_dict N  
to_excel N  
to_feather N  
to_gbq N  
to_hdf N  
to_html Y  
to_json N  
to_latex N  
to_msgpack N  
to_panel N  
to_parquet N  
to_period N  
to_pickle N  
to_records N  
to_sparse N  
to_sql N  
to_stata N  
to_string Y Default sets max_rows=60
to_timestamp N  
to_xarray N  
transform N  
transpose N  
truediv N  
truncate N  
tshift N  
tz_convert N  
tz_localize N  
unstack N  
update N  
values N  
var N  
where N  
xs N Deprecated in pandas

Development

Contributing to eland

Eland is an open source project and we love to receive contributions from our community — you! There are many ways to contribute, from writing tutorials or blog posts, improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated into eland itself.

Bug reports

If you think you have found a bug in eland, first make sure that you are testing against the latest version of eland - your issue may already have been fixed. If not, search our issues list on GitHub in case a similar issue has already been opened.

It is very helpful if you can prepare a reproduction of the bug. In other words, provide a small test case which we can run to confirm your bug. It makes it easier to find the problem and to fix it. Test cases should be provided as python scripts, ideally with some details of your Elasticsearch environment and index mappings, and (where appropriate) a pandas example.

Provide as much information as you can. You may think that the problem lies with your query, when actually it depends on how your data is indexed. The easier it is for us to recreate your problem, the faster it is likely to be fixed.

Feature requests

If you find yourself wishing for a feature that doesn’t exist in eland, you are probably not alone. There are bound to be others out there with similar needs. Many of the features that eland has today have been added because our users saw the need. Open an issue on our issues list on GitHub which describes the feature you would like to see, why you need it, and how it should work.

Contributing code and documentation changes

If you have a bugfix or new feature that you would like to contribute to eland, please find or open an issue about it first. Talk about what you would like to do. It may be that somebody is already working on it, or that there are particular issues that you should know about before implementing the change.

We enjoy working with contributors to get their code accepted. There are many approaches to fixing a problem and it is important to find the best approach before writing too much code.

Note that it is unlikely the project will merge refactors for the sake of refactoring. These types of pull requests have a high cost to maintainers in reviewing and testing with little to no tangible benefit. This especially includes changes generated by tools.

The process for contributing to any of the Elastic repositories is similar. Details for individual projects can be found below.

Fork and clone the repository

You will need to fork the main eland code or documentation repository and clone it to your local machine. See github help page for help.

Further instructions for specific projects are given below.

Submitting your changes

Once your changes and tests are ready to submit for review:

  1. Test your changes

    Run the test suite to make sure that nothing is broken (TODO add link to testing doc).

  2. Sign the Contributor License Agreement

    Please make sure you have signed our Contributor License Agreement. We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code. You only need to sign the CLA once.

  3. Rebase your changes

    Update your local repository with the most recent code from the main eland repository, and rebase your branch on top of the latest master branch. We prefer your initial changes to be squashed into a single commit. Later, if we ask you to make changes, add them as separate commits. This makes them easier to review. As a final step before merging we will either ask you to squash all commits yourself or we’ll do it for you.

  4. Submit a pull request

    Push your local changes to your forked copy of the repository and submit a pull request. In the pull request, choose a title which sums up the changes that you have made, and in the body provide more details about what your changes do. Also mention the number of the issue where discussion has taken place, eg “Closes #123”.

Then sit back and wait. There will probably be discussion about the pull request and, if any changes are needed, we would love to work with you to get your pull request merged into eland.

Please adhere to the general guideline that you should never force push to a publicly shared branch. Once you have opened your pull request, you should consider your branch publicly shared. Instead of force pushing you can just add incremental commits; this is generally easier on your reviewers. If you need to pick up changes from master, you can merge master into your branch. A reviewer might ask you to rebase a long-running pull request in which case force pushing is okay for that request. Note that squashing at the end of the review process should also not be done, that can be done when the pull request is integrated via GitHub.

Contributing to the eland codebase

Repository: https://github.com/elastic/eland

We internally develop using the PyCharm IDE. For PyCharm, we are currently using a minimum version of PyCharm 2019.2.4.

Configuring PyCharm And Running Tests

(All commands should be run from module root)

  • Create a new project via ‘Check out from Version Control’->’Git’ on the “Welcome to PyCharm” page (or other)
  • Enter the URL to your fork of eland (e.g. git@github.com:stevedodson/eland.git)
  • Click ‘Yes’ for ‘Checkout from Version Control’
  • Configure PyCharm environment:
  • In ‘Preferences’ configure a ‘Project: eland’->’Project Interpreter’. Generally, we recommend creating a virtual environment (TODO link to installing for python version support).
  • In ‘Preferences’ set ‘Tools’->’Python Integrated Tools’->’Default test runner’ to pytest
  • In ‘Preferences’ set ‘Tools’->’Python Integrated Tools’->’Docstring format’ to numpy
  • Install development requirements. Open terminal in virtual environment and run pip install -r requirements-dev.txt
  • Setup Elasticsearch instance (assumes localhost:9200), and run python -m eland.tests.setup_tests to setup test environment - note this modifies Elasticsearch indices
  • Run pytest --doctest-modules to validate install
  • To test specific versions of Python use nox -s test-3.8
  • To run the automatic formatter and check for lint issues run nox -s blacken
Documentation
  • Install documentation requirements. Open terminal in virtual environment and run pip install -r requirements-dev.txt

Examples

Eland Demo Notebook

[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from elasticsearch import Elasticsearch

# Import standard test settings for consistent results
from eland.conftest import *

Compare eland DataFrame vs pandas DataFrame

Create an eland.DataFrame from a flights index

[2]:
ed_flights = ed.read_es('localhost', 'flights')
[3]:
type(ed_flights)
[3]:
eland.dataframe.DataFrame

Compare to pandas DataFrame (created from the same data)

[4]:
pd_flights = ed.eland_to_pandas(ed_flights)
[5]:
type(pd_flights)
[5]:
pandas.core.frame.DataFrame

Attributes and underlying data

DataFrame.columns
[6]:
pd_flights.columns
[6]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[7]:
ed_flights.columns
[7]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
DataFrame.dtypes
[8]:
pd_flights.dtypes
[8]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
[9]:
ed_flights.dtypes
[9]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
DataFrame.select_dtypes
[10]:
pd_flights.select_dtypes(include=np.number)
[10]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13059 rows × 7 columns

[11]:
ed_flights.select_dtypes(include=np.number)
[11]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13059 rows × 7 columns

DataFrame.empty
[12]:
pd_flights.empty
[12]:
False
[13]:
ed_flights.empty
[13]:
False
DataFrame.shape
[14]:
pd_flights.shape
[14]:
(13059, 27)
[15]:
ed_flights.shape
[15]:
(13059, 27)
DataFrame.index

Note, eland.DataFrame.index does not mirror pandas.DataFrame.index.

[16]:
pd_flights.index
[16]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '13049', '13050', '13051', '13052', '13053', '13054', '13055', '13056', '13057', '13058'],
      dtype='object', length=13059)
[17]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[17]:
<eland.index.Index at 0x7fc6765aefd0>
[18]:
ed_flights.index.index_field
[18]:
'_id'
DataFrame.values

Note, eland.DataFrame.values is not supported.

[19]:
pd_flights.values
[19]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
        Timestamp('2018-01-01 00:00:00')],
       [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
        Timestamp('2018-01-01 18:27:00')],
       [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
        Timestamp('2018-01-01 17:11:14')],
       ...,
       [997.7518761454494, False, 'Logstash Airways', ..., 'Sunny', 6,
        Timestamp('2018-02-11 04:09:27')],
       [1102.8144645388556, False, 'JetBeats', ..., 'Hail', 6,
        Timestamp('2018-02-11 08:28:21')],
       [858.1443369038839, False, 'JetBeats', ..., 'Rain', 6,
        Timestamp('2018-02-11 14:54:34')]], dtype=object)
[20]:
try:
    ed_flights.values
except AttributeError as e:
    print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`

Indexing, iteration

DataFrame.head
[21]:
pd_flights.head()
[21]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows × 27 columns

[22]:
ed_flights.head()
[22]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows × 27 columns

DataFrame.tail
[23]:
pd_flights.tail()
[23]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 rows × 27 columns

[24]:
ed_flights.tail()
[24]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 rows × 27 columns

DataFrame.keys
[25]:
pd_flights.keys()
[25]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[26]:
ed_flights.keys()
[26]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
DataFrame.get
[27]:
pd_flights.get('Carrier')
[27]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13059, dtype: object
[28]:
ed_flights.get('Carrier')
[28]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13059, dtype: object
[29]:
pd_flights.get(['Carrier', 'Origin'])
[29]:
Carrier Origin
0 Kibana Airlines Frankfurt am Main Airport
1 Logstash Airways Cape Town International Airport
2 Logstash Airways Venice Marco Polo Airport
3 Kibana Airlines Naples International Airport
4 Kibana Airlines Licenciado Benito Juarez International Airport
... ... ...
13054 Logstash Airways Pisa International Airport
13055 Logstash Airways Winnipeg / James Armstrong Richardson Internat...
13056 Logstash Airways Licenciado Benito Juarez International Airport
13057 JetBeats Itami Airport
13058 JetBeats Adelaide International Airport

13059 rows × 2 columns

List input not currently supported by eland.DataFrame.get

[30]:
try:
    ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
    print(e)
unhashable type: 'list'
DataFrame.query
[31]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[31]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

eland.DataFrame.query requires qualifier on bool i.e.

ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') fails

[32]:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[32]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

Boolean indexing query
[33]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
           (pd_flights.AvgTicketPrice > 900.0) &
           (pd_flights.Cancelled == True)]
[33]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

[34]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
           (ed_flights.AvgTicketPrice > 900.0) &
           (ed_flights.Cancelled == True)]
[34]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows × 27 columns

Function application, GroupBy & window

DataFrame.aggs
[35]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[35]:
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578438e+03 2.663969e+02

eland.DataFrame.aggregate currently only supported numeric columns

[36]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[36]:
DistanceKilometers AvgTicketPrice
sum 9.261629e+07 8.204365e+06
min 0.000000e+00 1.000205e+02
std 4.578263e+03 2.663867e+02

Computations / descriptive stats

DataFrame.count
[37]:
pd_flights.count()
[37]:
AvgTicketPrice    13059
Cancelled         13059
Carrier           13059
Dest              13059
DestAirportID     13059
                  ...
OriginLocation    13059
OriginRegion      13059
OriginWeather     13059
dayOfWeek         13059
timestamp         13059
Length: 27, dtype: int64
[38]:
ed_flights.count()
[38]:
AvgTicketPrice    13059
Cancelled         13059
Carrier           13059
Dest              13059
DestAirportID     13059
                  ...
OriginLocation    13059
OriginRegion      13059
OriginWeather     13059
dayOfWeek         13059
timestamp         13059
Length: 27, dtype: int64
DataFrame.describe
[39]:
pd_flights.describe()
[39]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
count 13059.000000 13059.000000 ... 13059.000000 13059.000000
mean 628.253689 7092.142455 ... 511.127842 2.835975
std 266.396861 4578.438497 ... 334.753952 1.939439
min 100.020528 0.000000 ... 0.000000 0.000000
25% 409.893816 2459.705673 ... 252.333192 1.000000
50% 640.556668 7610.330866 ... 503.045170 3.000000
75% 842.185470 9736.637600 ... 720.416036 4.000000
max 1199.729053 19881.482315 ... 1902.902032 6.000000

8 rows × 7 columns

Values returned from eland.DataFrame.describe may vary due to results of Elasticsearch aggregations.

[40]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[40]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
count 13059.000000 13059.000000 ... 13059.000000 13059.000000
mean 628.253689 7092.142457 ... 511.127842 2.835975
std 266.386661 4578.263193 ... 334.741135 1.939365
min 100.020531 0.000000 ... 0.000000 0.000000
25% 410.012798 2470.545974 ... 251.682199 1.000000
50% 640.362667 7612.072403 ... 503.148975 3.000000
75% 842.233478 9735.660463 ... 720.572969 4.271242
max 1199.729004 19881.482422 ... 1902.901978 6.000000

8 rows × 7 columns

DataFrame.info
[41]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   Cancelled           13059 non-null  bool
 2   Carrier             13059 non-null  object
 3   Dest                13059 non-null  object
 4   DestAirportID       13059 non-null  object
 5   DestCityName        13059 non-null  object
 6   DestCountry         13059 non-null  object
 7   DestLocation        13059 non-null  object
 8   DestRegion          13059 non-null  object
 9   DestWeather         13059 non-null  object
 10  DistanceKilometers  13059 non-null  float64
 11  DistanceMiles       13059 non-null  float64
 12  FlightDelay         13059 non-null  bool
 13  FlightDelayMin      13059 non-null  int64
 14  FlightDelayType     13059 non-null  object
 15  FlightNum           13059 non-null  object
 16  FlightTimeHour      13059 non-null  float64
 17  FlightTimeMin       13059 non-null  float64
 18  Origin              13059 non-null  object
 19  OriginAirportID     13059 non-null  object
 20  OriginCityName      13059 non-null  object
 21  OriginCountry       13059 non-null  object
 22  OriginLocation      13059 non-null  object
 23  OriginRegion        13059 non-null  object
 24  OriginWeather       13059 non-null  object
 25  dayOfWeek           13059 non-null  int64
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.2+ MB
[42]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   Cancelled           13059 non-null  bool
 2   Carrier             13059 non-null  object
 3   Dest                13059 non-null  object
 4   DestAirportID       13059 non-null  object
 5   DestCityName        13059 non-null  object
 6   DestCountry         13059 non-null  object
 7   DestLocation        13059 non-null  object
 8   DestRegion          13059 non-null  object
 9   DestWeather         13059 non-null  object
 10  DistanceKilometers  13059 non-null  float64
 11  DistanceMiles       13059 non-null  float64
 12  FlightDelay         13059 non-null  bool
 13  FlightDelayMin      13059 non-null  int64
 14  FlightDelayType     13059 non-null  object
 15  FlightNum           13059 non-null  object
 16  FlightTimeHour      13059 non-null  float64
 17  FlightTimeMin       13059 non-null  float64
 18  Origin              13059 non-null  object
 19  OriginAirportID     13059 non-null  object
 20  OriginCityName      13059 non-null  object
 21  OriginCountry       13059 non-null  object
 22  OriginLocation      13059 non-null  object
 23  OriginRegion        13059 non-null  object
 24  OriginWeather       13059 non-null  object
 25  dayOfWeek           13059 non-null  int64
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 80.0 bytes
DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum
max
[43]:
pd_flights.max(numeric_only=True)
[43]:
AvgTicketPrice        1199.73
Cancelled                True
DistanceKilometers    19881.5
DistanceMiles         12353.8
FlightDelay              True
FlightDelayMin            360
FlightTimeHour         31.715
FlightTimeMin          1902.9
dayOfWeek                   6
dtype: object

eland.DataFrame.max,min,mean,sum only aggregate numeric columns

[44]:
ed_flights.max(numeric_only=True)
[44]:
AvgTicketPrice        1199.73
Cancelled                True
DistanceKilometers    19881.5
DistanceMiles         12353.8
FlightDelay              True
FlightDelayMin            360
FlightTimeHour         31.715
FlightTimeMin          1902.9
dayOfWeek                   6
dtype: object
min
[45]:
pd_flights.min(numeric_only=True)
[45]:
AvgTicketPrice        100.021
Cancelled               False
DistanceKilometers          0
DistanceMiles               0
FlightDelay             False
FlightDelayMin              0
FlightTimeHour              0
FlightTimeMin               0
dayOfWeek                   0
dtype: object
[46]:
ed_flights.min(numeric_only=True)
[46]:
AvgTicketPrice        100.021
Cancelled               False
DistanceKilometers          0
DistanceMiles               0
FlightDelay             False
FlightDelayMin              0
FlightTimeHour              0
FlightTimeMin               0
dayOfWeek                   0
dtype: object
mean
[47]:
pd_flights.mean(numeric_only=True)
[47]:
AvgTicketPrice         628.253689
Cancelled                0.128494
DistanceKilometers    7092.142455
DistanceMiles         4406.853013
FlightDelay              0.251168
FlightDelayMin          47.335171
FlightTimeHour           8.518797
FlightTimeMin          511.127842
dayOfWeek                2.835975
dtype: float64
[48]:
ed_flights.mean(numeric_only=True)
[48]:
AvgTicketPrice         628.253689
Cancelled                0.128494
DistanceKilometers    7092.142457
DistanceMiles         4406.853010
FlightDelay              0.251168
FlightDelayMin          47.335171
FlightTimeHour           8.518797
FlightTimeMin          511.127842
dayOfWeek                2.835975
dtype: float64
sum
[49]:
pd_flights.sum(numeric_only=True)
[49]:
AvgTicketPrice        8.204365e+06
Cancelled             1.678000e+03
DistanceKilometers    9.261629e+07
DistanceMiles         5.754909e+07
FlightDelay           3.280000e+03
FlightDelayMin        6.181500e+05
FlightTimeHour        1.112470e+05
FlightTimeMin         6.674818e+06
dayOfWeek             3.703500e+04
dtype: float64
[50]:
ed_flights.sum(numeric_only=True)
[50]:
AvgTicketPrice        8.204365e+06
Cancelled             1.678000e+03
DistanceKilometers    9.261629e+07
DistanceMiles         5.754909e+07
FlightDelay           3.280000e+03
FlightDelayMin        6.181500e+05
FlightTimeHour        1.112470e+05
FlightTimeMin         6.674818e+06
dayOfWeek             3.703500e+04
dtype: float64
DataFrame.nunique
[51]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[51]:
Carrier      4
Origin     156
Dest       156
dtype: int64
[52]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[52]:
Carrier      4
Origin     156
Dest       156
dtype: int64
DataFrame.drop
[53]:
pd_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[53]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13059 rows × 20 columns

[54]:
ed_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[54]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13059 rows × 20 columns

Plotting
[55]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
/home/daniel/PycharmProjects/eland/venv/lib/python3.6/site-packages/pandas/plotting/_matplotlib/tools.py:298: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
/home/daniel/PycharmProjects/eland/venv/lib/python3.6/site-packages/pandas/plotting/_matplotlib/tools.py:298: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  layout[ax.rowNum, ax.colNum] = ax.get_visible()
/home/daniel/PycharmProjects/eland/venv/lib/python3.6/site-packages/pandas/plotting/_matplotlib/tools.py:304: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:
/home/daniel/PycharmProjects/eland/venv/lib/python3.6/site-packages/pandas/plotting/_matplotlib/tools.py:304: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead.
  if not layout[ax.rowNum + 1, ax.colNum]:
_images/examples_demo_notebook_92_1.png
[56]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
_images/examples_demo_notebook_93_0.png
Elasticsearch utilities
[57]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[59]:
print(ed_flights2.info_es())
index_pattern: flights
Index:
 index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                   es_field_name  is_source es_dtype                  es_date_format        pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
timestamp              timestamp       True     date  strict_date_hour_minute_second  datetime64[ns]           True             True        False                  timestamp
OriginAirportID  OriginAirportID       True  keyword                            None          object           True             True        False            OriginAirportID
DestAirportID      DestAirportID       True  keyword                            None          object           True             True        False              DestAirportID
FlightDelayMin    FlightDelayMin       True  integer                            None           int64           True             True        False             FlightDelayMin
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
 size: 5
 sort_params: _doc:desc
 _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
 body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
 post_processing: [('sort_index')]

[1]:
import eland as ed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Fix console size for consistent test results
from eland.conftest import *

Online Retail Analysis

Getting Started

To get started, let’s create an eland.DataFrame by reading a csv file. This creates and populates the online-retail index in the local Elasticsearch cluster.

[2]:
df = ed.read_csv("data/online-retail.csv.gz",
                 es_client='localhost',
                 es_dest_index='online-retail',
                 es_if_exists='replace',
                 es_dropna=True,
                 es_refresh=True,
                 compression='gzip',
                 index_col=0)

Here we see that the "_id" field was used to index our data frame.

[3]:
df.index.index_field
[3]:
'_id'

Next, we can check which field from elasticsearch are available to our eland data frame. columns is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.

[4]:
df.columns
[4]:
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode',
       'UnitPrice'],
      dtype='object')

Now, let’s see the data types of our fields. Running df.dtypes, we can see that elasticsearch field types are mapped to pandas field types.

[5]:
df.dtypes
[5]:
Country         object
CustomerID     float64
Description     object
InvoiceDate     object
InvoiceNo       object
Quantity         int64
StockCode       object
UnitPrice      float64
dtype: object

We also offer a .info_es() data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.

[6]:
print(df.info_es())
index_pattern: online-retail
Index:
 index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: []
 size: None
 sort_params: None
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {}
 post_processing: []

Selecting and Indexing Data

Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.

head and tail

much like pandas, eland data frames offer .head(n) and .tail(n) methods that return the first and last n rows, respectively.

[7]:
df.head(2)
[7]:
Country CustomerID ... StockCode UnitPrice
1000 United Kingdom 14729.0 ... 21123 1.25
1001 United Kingdom 14729.0 ... 21124 1.25

2 rows × 8 columns

[8]:
print(df.tail(2).head(2).tail(2).info_es())
index_pattern: online-retail
Index:
 index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))]
 size: 2
 sort_params: _doc:desc
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {}
 post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]

[9]:
df.tail(2)
[9]:
Country CustomerID ... StockCode UnitPrice
14998 United Kingdom 17419.0 ... 21773 1.25
14999 United Kingdom 17419.0 ... 22149 2.10

2 rows × 8 columns

selecting columns

you can also pass a list of columns to select columns from the data frame in a specified order.

[10]:
df[['Country', 'InvoiceDate']].head(5)
[10]:
Country InvoiceDate
1000 United Kingdom 2010-12-01 12:43:00
1001 United Kingdom 2010-12-01 12:43:00
1002 United Kingdom 2010-12-01 12:43:00
1003 United Kingdom 2010-12-01 12:43:00
1004 United Kingdom 2010-12-01 12:43:00

5 rows × 2 columns

Boolean Indexing

we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms query that is then passed to elasticsearch to filter the index.

[11]:
# the construction of a boolean vector maps directly to an elasticsearch query
print(df['Country']=='Germany')
df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
[11]:
Country CustomerID ... StockCode UnitPrice
1109 Germany 12662.0 ... 22809 2.95
1110 Germany 12662.0 ... 84347 2.55
1111 Germany 12662.0 ... 84945 0.85
1112 Germany 12662.0 ... 22242 1.65
1113 Germany 12662.0 ... 22244 1.95

5 rows × 8 columns

we can also filter the data frame using a list of values.

[12]:
print(df['Country'].isin(['Germany', 'United States']))
df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
[12]:
Country CustomerID ... StockCode UnitPrice
1000 United Kingdom 14729.0 ... 21123 1.25
1001 United Kingdom 14729.0 ... 21124 1.25
1002 United Kingdom 14729.0 ... 21122 1.25
1003 United Kingdom 14729.0 ... 84378 1.25
1004 United Kingdom 14729.0 ... 21985 0.29

5 rows × 8 columns

We can also combine boolean vectors to further filter the data frame.

[13]:
df[(df['Country']=='Germany') & (df['Quantity']>90)]
[13]:
Country CustomerID ... StockCode UnitPrice

0 rows × 8 columns

Using this example, let see how eland translates this boolean filter to an elasticsearch bool query.

[14]:
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].info_es())
index_pattern: online-retail
Index:
 index_field: _id
 is_source_field: False
Mappings:
 capabilities:
            es_field_name  is_source es_dtype es_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
Country           Country       True  keyword           None   object           True             True        False                    Country
CustomerID     CustomerID       True   double           None  float64           True             True        False                 CustomerID
Description   Description       True  keyword           None   object           True             True        False                Description
InvoiceDate   InvoiceDate       True  keyword           None   object           True             True        False                InvoiceDate
InvoiceNo       InvoiceNo       True  keyword           None   object           True             True        False                  InvoiceNo
Quantity         Quantity       True     long           None    int64           True             True        False                   Quantity
StockCode       StockCode       True  keyword           None   object           True             True        False                  StockCode
UnitPrice       UnitPrice       True   double           None  float64           True             True        False                  UnitPrice
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))]
 size: None
 sort_params: None
 _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice']
 body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}}
 post_processing: []

Aggregation and Descriptive Statistics

Let’s begin to ask some questions of our data and use eland to get the answers.

How many different countries are there?

[15]:
df['Country'].nunique()
[15]:
16

What is the total sum of products ordered?

[16]:
df['Quantity'].sum()
[16]:
111960.0

Show me the sum, mean, min, and max of the qunatity and unit_price fields

[17]:
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
[17]:
Quantity UnitPrice
sum 111960.000 61548.490000
mean 7.464 4.103233
max 2880.000 950.990000
min -9360.000 0.000000

Give me descriptive statistics for the entire data frame

[18]:
# NBVAL_IGNORE_OUTPUT
df.describe()
[18]:
CustomerID Quantity UnitPrice
count 10729.000000 15000.000000 15000.000000
mean 15590.776680 7.464000 4.103233
std 1764.025160 85.924387 20.104873
min 12347.000000 -9360.000000 0.000000
25% 14227.934845 1.000000 1.250000
50% 15669.138235 2.000000 2.510000
75% 17212.690092 6.610262 4.211297
max 18239.000000 2880.000000 950.990000

Show me a histogram of numeric columns

[19]:
df[(df['Quantity']>-50) &
   (df['Quantity']<50) &
   (df['UnitPrice']>0) &
   (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30)
plt.show()
_images/examples_online_retail_analysis_37_0.png
[20]:
df[(df['Quantity']>-50) &
   (df['Quantity']<50) &
   (df['UnitPrice']>0) &
   (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True)
plt.show()
_images/examples_online_retail_analysis_38_0.png
[21]:
df.query('Quantity>50 & UnitPrice<100')
[21]:
Country CustomerID ... StockCode UnitPrice
1228 United Kingdom 15485.0 ... 22086 2.55
1237 Norway 12433.0 ... 22444 1.06
1286 Norway 12433.0 ... 84050 1.25
1293 Norway 12433.0 ... 22197 0.85
1333 United Kingdom 18144.0 ... 84879 1.69
... ... ... ... ... ...
14784 United Kingdom 15061.0 ... 22423 10.95
14785 United Kingdom 15061.0 ... 22075 1.45
14788 United Kingdom 15061.0 ... 17038 0.07
14974 United Kingdom 14739.0 ... 21704 0.72
14980 United Kingdom 14739.0 ... 22178 1.06

258 rows × 8 columns

Arithmetic Operations

Numeric values

[22]:
df['Quantity'].head()
[22]:
1000     1
1001     1
1002     1
1003     1
1004    12
Name: Quantity, dtype: int64
[23]:
df['UnitPrice'].head()
[23]:
1000    1.25
1001    1.25
1002    1.25
1003    1.25
1004    0.29
Name: UnitPrice, dtype: float64
[24]:
product = df['Quantity'] * df['UnitPrice']
[25]:
product.head()
[25]:
1000    1.25
1001    1.25
1002    1.25
1003    1.25
1004    3.48
dtype: float64

String concatenation

[26]:
df['Country'] + df['StockCode']
[26]:
1000      United Kingdom21123
1001      United Kingdom21124
1002      United Kingdom21122
1003      United Kingdom84378
1004      United Kingdom21985
                 ...
14995    United Kingdom72349B
14996     United Kingdom72741
14997     United Kingdom22762
14998     United Kingdom21773
14999     United Kingdom22149
Length: 15000, dtype: object