Date: Feb 15, 2020 Version: 7.6.0a3
Useful links: Source Repository | Issues & Ideas | Q&A Support |
eland is an open source, Apache2-licensed elasticsearch Python client to analyse, explore and manipulate data that resides in elasticsearch. Where possible the package uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their elasticsearch powered equivalents. In general, the data resides in elasticsearch and not in memory, which allows eland to access large datasets stored in elasticsearch.
eland
This page gives an overview of all public eland objects, functions and methods. All classes and functions exposed in eland.* namespace are public.
eland.*
read_csv
read_es
pandas_to_eland
eland_to_pandas
DataFrame
Axes
DataFrame.index
DataFrame.columns
DataFrame.dtypes
DataFrame.select_dtypes
DataFrame.values
DataFrame.empty
DataFrame.shape
DataFrame.head
DataFrame.keys
DataFrame.tail
DataFrame.get
DataFrame.query
DataFrame.agg
DataFrame.aggregate
DataFrame.count
DataFrame.describe
DataFrame.info
DataFrame.max
DataFrame.mean
DataFrame.min
DataFrame.sum
DataFrame.nunique
DataFrame.drop
DataFrame.hist
DataFrame.to_numpy
DataFrame.to_csv
DataFrame.to_html
DataFrame.to_string
DataFrame.info_es
Series
Series.index
Series.shape
Series.name
Series.empty
Series.head
Series.tail
Series.add
Series.sub
Series.mul
Series.div
Series.truediv
Series.floordiv
Series.mod
Series.pow
Series.radd
Series.rsub
Series.rmul
Series.rdiv
Series.rtruediv
Series.rfloordiv
Series.rmod
Series.rpow
Series.describe
Series.max
Series.mean
Series.min
Series.sum
Series.nunique
Series.value_counts
Series.rename
Series.hist
Series.to_string
Series.to_numpy
Series.info_es
Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame) and those should most likely be used before calling these methods directly.
Index
Machine learning is built into the Elastic Stack and enables users to gain insights into their Elasticsearch data. There are a wide range of capabilities from identifying in anomalies in your data, to training and deploying regression or classification models based on Elasticsearch data.
To use the Elastic Stack machine learning features, you must have the appropriate license and at least one machine learning node in your Elasticsearch cluster. If Elastic Stack security features are enabled, you must also ensure your users have the necessary privileges.
The fastest way to get started with machine learning features is to start a free 14-day trial of Elasticsearch Service in the cloud.
See https://www.elastic.co/guide/en/machine-learning/current/setup.html and other documentation for more detail.
ImportedMLModel
ImportedMLModel.predict
The goal of an eland.DataFrame is to enable users who are familiar with pandas.DataFrame to access, explore and manipulate data that resides in Elasticsearch.
eland.DataFrame
pandas.DataFrame
Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.
Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html) structures.
Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.
An alternative option is to create an API that proxies pandas.DataFrame-like calls to Elasticsearch queries and operations. This could allow the Elasticsearch cluster to perform operations such as aggregations rather than exporting all the data and performing this operation in-core.
An option would be to replace the pandas.DataFrame backend in-core memory structures with Elasticsearch accessors. This would allow full access to the pandas.DataFrame APIs. However, this has issues:
df_a = df['a']
df
df_a
df.to_dict()
Another option is to create a eland.DataFrame API that mimics appropriate aspects of the pandas.DataFrame API. This resolves some of the issues above as:
eland.DataFrame._to_pandas()
pandas
The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the http://github.com/elastic/eland, or give a thumbs up to already created issues. Contributions are also welcome!
The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no.
Y
N
https://github.com/adgirish/kaggleScape/blob/master/results/annotResults.csv represents a prioritised list.
T
abs
add
add_prefix
add_suffix
agg
aggregate
align
all
any
append
apply
applymap
as_blocks
as_matrix
asfreq
asof
assign
astype
at
at_time
axes
between_time
bfill
blocks
bool
boxplot
clip
clip_lower
clip_upper
combine
combine_first
compound
consolidate
convert_objects
copy
corr
corrwith
count
cov
cummax
cummin
cumprod
cumsum
describe
diff
div
divide
dot
drop
drop_duplicates
dropna
dtypes
duplicated
empty
eq
equals
eval
ewm
expanding
ffill
fillna
filter
first
first_valid_index
floordiv
from_csv
from_dict
from_items
from_records
ftypes
ge
get
get_dtype_counts
get_ftype_counts
get_value
get_values
groupby
gt
head
hist
iat
idxmax
idxmin
iloc
infer_objects
info
insert
interpolate
is_copy
isin
isna
isnull
items
iteritems
iterrows
itertuples
ix
join
keys
kurt
kurtosis
last
last_valid_index
le
loc
lookup
lt
mad
mask
max
mean
median
melt
memory_usage
merge
min
mod
mode
mul
multiply
ndim
ne
nlargest
notna
notnull
nsmallest
nunique
pct_change
pipe
pivot
pivot_table
plot
pop
pow
prod
product
quantile
query
radd
rank
rdiv
reindex
reindex_axis
reindex_like
rename
rename_axis
reorder_levels
replace
resample
reset_index
rfloordiv
rmod
rmul
rolling
round
rpow
rsub
rtruediv
sample
select
select_dtypes
sem
set_axis
set_index
set_value
shape
shift
size
skew
slice_shift
sort_index
sort_values
sortlevel
squeeze
stack
std
style
sub
subtract
sum
swapaxes
swaplevel
tail
take
to_clipboard
to_csv
to_dense
to_dict
to_excel
to_feather
to_gbq
to_hdf
to_html
to_json
to_latex
to_msgpack
to_panel
to_parquet
to_period
to_pickle
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_xarray
transform
transpose
truediv
truncate
tshift
tz_convert
tz_localize
unstack
update
values
var
where
xs
Eland is an open source project and we love to receive contributions from our community — you! There are many ways to contribute, from writing tutorials or blog posts, improving the documentation, submitting bug reports and feature requests or writing code which can be incorporated into eland itself.
If you think you have found a bug in eland, first make sure that you are testing against the latest version of eland - your issue may already have been fixed. If not, search our issues list on GitHub in case a similar issue has already been opened.
It is very helpful if you can prepare a reproduction of the bug. In other words, provide a small test case which we can run to confirm your bug. It makes it easier to find the problem and to fix it. Test cases should be provided as python scripts, ideally with some details of your Elasticsearch environment and index mappings, and (where appropriate) a pandas example.
Provide as much information as you can. You may think that the problem lies with your query, when actually it depends on how your data is indexed. The easier it is for us to recreate your problem, the faster it is likely to be fixed.
If you find yourself wishing for a feature that doesn’t exist in eland, you are probably not alone. There are bound to be others out there with similar needs. Many of the features that eland has today have been added because our users saw the need. Open an issue on our issues list on GitHub which describes the feature you would like to see, why you need it, and how it should work.
If you have a bugfix or new feature that you would like to contribute to eland, please find or open an issue about it first. Talk about what you would like to do. It may be that somebody is already working on it, or that there are particular issues that you should know about before implementing the change.
We enjoy working with contributors to get their code accepted. There are many approaches to fixing a problem and it is important to find the best approach before writing too much code.
Note that it is unlikely the project will merge refactors for the sake of refactoring. These types of pull requests have a high cost to maintainers in reviewing and testing with little to no tangible benefit. This especially includes changes generated by tools.
The process for contributing to any of the Elastic repositories is similar. Details for individual projects can be found below.
You will need to fork the main eland code or documentation repository and clone it to your local machine. See github help page for help.
Further instructions for specific projects are given below.
Once your changes and tests are ready to submit for review:
Test your changes
Run the test suite to make sure that nothing is broken (TODO add link to testing doc).
Sign the Contributor License Agreement
Please make sure you have signed our Contributor License Agreement. We are not asking you to assign copyright to us, but to give us the right to distribute your code without restriction. We ask this of all contributors in order to assure our users of the origin and continuing existence of the code. You only need to sign the CLA once.
Rebase your changes
Update your local repository with the most recent code from the main eland repository, and rebase your branch on top of the latest master branch. We prefer your initial changes to be squashed into a single commit. Later, if we ask you to make changes, add them as separate commits. This makes them easier to review. As a final step before merging we will either ask you to squash all commits yourself or we’ll do it for you.
Submit a pull request
Push your local changes to your forked copy of the repository and submit a pull request. In the pull request, choose a title which sums up the changes that you have made, and in the body provide more details about what your changes do. Also mention the number of the issue where discussion has taken place, eg “Closes #123”.
Then sit back and wait. There will probably be discussion about the pull request and, if any changes are needed, we would love to work with you to get your pull request merged into eland.
Please adhere to the general guideline that you should never force push to a publicly shared branch. Once you have opened your pull request, you should consider your branch publicly shared. Instead of force pushing you can just add incremental commits; this is generally easier on your reviewers. If you need to pick up changes from master, you can merge master into your branch. A reviewer might ask you to rebase a long-running pull request in which case force pushing is okay for that request. Note that squashing at the end of the review process should also not be done, that can be done when the pull request is integrated via GitHub.
Repository: https://github.com/elastic/eland
We internally develop using the PyCharm IDE. For PyCharm, we are currently using a minimum version of PyCharm 2019.2.4.
(All commands should be run from module root)
git@github.com:stevedodson/eland.git
pytest
numpy
pip install -r requirements-dev.txt
localhost:9200
python -m eland.tests.setup_tests
pytest --doctest-modules
[1]:
import eland as ed import pandas as pd import numpy as np import matplotlib.pyplot as plt from elasticsearch import Elasticsearch # Import standard test settings for consistent results from eland.conftest import *
Create an eland.DataFrame from a flights index
flights
[2]:
ed_flights = ed.read_es('localhost', 'flights')
[3]:
type(ed_flights)
eland.dataframe.DataFrame
Compare to pandas DataFrame (created from the same data)
[4]:
pd_flights = ed.eland_to_pandas(ed_flights)
[5]:
type(pd_flights)
pandas.core.frame.DataFrame
[6]:
pd_flights.columns
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers', 'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum', 'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName', 'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek', 'timestamp'], dtype='object')
[7]:
ed_flights.columns
[8]:
pd_flights.dtypes
AvgTicketPrice float64 Cancelled bool Carrier object Dest object DestAirportID object ... OriginLocation object OriginRegion object OriginWeather object dayOfWeek int64 timestamp datetime64[ns] Length: 27, dtype: object
[9]:
ed_flights.dtypes
[10]:
pd_flights.select_dtypes(include=np.number)
13059 rows × 7 columns
[11]:
ed_flights.select_dtypes(include=np.number)
[12]:
pd_flights.empty
False
[13]:
ed_flights.empty
[14]:
pd_flights.shape
(13059, 27)
[15]:
ed_flights.shape
Note, eland.DataFrame.index does not mirror pandas.DataFrame.index.
eland.DataFrame.index
pandas.DataFrame.index
[16]:
pd_flights.index
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ... '13049', '13050', '13051', '13052', '13053', '13054', '13055', '13056', '13057', '13058'], dtype='object', length=13059)
[17]:
# NBVAL_IGNORE_OUTPUT ed_flights.index
<eland.index.Index at 0x116b3efd0>
[18]:
ed_flights.index.index_field
'_id'
Note, eland.DataFrame.values is not supported.
eland.DataFrame.values
[19]:
pd_flights.values
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0, Timestamp('2018-01-01 00:00:00')], [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0, Timestamp('2018-01-01 18:27:00')], [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0, Timestamp('2018-01-01 17:11:14')], ..., [997.7518761454494, False, 'Logstash Airways', ..., 'Sunny', 6, Timestamp('2018-02-11 04:09:27')], [1102.8144645388556, False, 'JetBeats', ..., 'Hail', 6, Timestamp('2018-02-11 08:28:21')], [858.1443369038839, False, 'JetBeats', ..., 'Rain', 6, Timestamp('2018-02-11 14:54:34')]], dtype=object)
[20]:
try: ed_flights.values except AttributeError as e: print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
[21]:
pd_flights.head()
5 rows × 27 columns
[22]:
ed_flights.head()
[23]:
pd_flights.tail()
[24]:
ed_flights.tail()
[25]:
pd_flights.keys()
[26]:
ed_flights.keys()
[27]:
pd_flights.get('Carrier')
0 Kibana Airlines 1 Logstash Airways 2 Logstash Airways 3 Kibana Airlines 4 Kibana Airlines ... 13054 Logstash Airways 13055 Logstash Airways 13056 Logstash Airways 13057 JetBeats 13058 JetBeats Name: Carrier, Length: 13059, dtype: object
[28]:
ed_flights.get('Carrier')
[29]:
pd_flights.get(['Carrier', 'Origin'])
13059 rows × 2 columns
List input not currently supported by eland.DataFrame.get
eland.DataFrame.get
[30]:
try: ed_flights.get(['Carrier', 'Origin']) except TypeError as e: print(e)
unhashable type: 'list'
[31]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
68 rows × 27 columns
eland.DataFrame.query requires qualifier on bool i.e.
eland.DataFrame.query
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') fails
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled')
[32]:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
[33]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") & (pd_flights.AvgTicketPrice > 900.0) & (pd_flights.Cancelled == True)]
[34]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") & (ed_flights.AvgTicketPrice > 900.0) & (ed_flights.Cancelled == True)]
[35]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
eland.DataFrame.aggregate currently only supported numeric columns
eland.DataFrame.aggregate
[36]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[37]:
pd_flights.count()
AvgTicketPrice 13059 Cancelled 13059 Carrier 13059 Dest 13059 DestAirportID 13059 ... OriginLocation 13059 OriginRegion 13059 OriginWeather 13059 dayOfWeek 13059 timestamp 13059 Length: 27, dtype: int64
[38]:
ed_flights.count()
[39]:
pd_flights.describe()
8 rows × 7 columns
Values returned from eland.DataFrame.describe may vary due to results of Elasticsearch aggregations.
eland.DataFrame.describe
[40]:
# NBVAL_IGNORE_OUTPUT ed_flights.describe()
[41]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'> Index: 13059 entries, 0 to 13058 Data columns (total 27 columns): AvgTicketPrice 13059 non-null float64 Cancelled 13059 non-null bool Carrier 13059 non-null object Dest 13059 non-null object DestAirportID 13059 non-null object DestCityName 13059 non-null object DestCountry 13059 non-null object DestLocation 13059 non-null object DestRegion 13059 non-null object DestWeather 13059 non-null object DistanceKilometers 13059 non-null float64 DistanceMiles 13059 non-null float64 FlightDelay 13059 non-null bool FlightDelayMin 13059 non-null int64 FlightDelayType 13059 non-null object FlightNum 13059 non-null object FlightTimeHour 13059 non-null float64 FlightTimeMin 13059 non-null float64 Origin 13059 non-null object OriginAirportID 13059 non-null object OriginCityName 13059 non-null object OriginCountry 13059 non-null object OriginLocation 13059 non-null object OriginRegion 13059 non-null object OriginWeather 13059 non-null object dayOfWeek 13059 non-null int64 timestamp 13059 non-null datetime64[ns] dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17) memory usage: 3.2+ MB
[42]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'> Index: 13059 entries, 0 to 13058 Data columns (total 27 columns): AvgTicketPrice 13059 non-null float64 Cancelled 13059 non-null bool Carrier 13059 non-null object Dest 13059 non-null object DestAirportID 13059 non-null object DestCityName 13059 non-null object DestCountry 13059 non-null object DestLocation 13059 non-null object DestRegion 13059 non-null object DestWeather 13059 non-null object DistanceKilometers 13059 non-null float64 DistanceMiles 13059 non-null float64 FlightDelay 13059 non-null bool FlightDelayMin 13059 non-null int64 FlightDelayType 13059 non-null object FlightNum 13059 non-null object FlightTimeHour 13059 non-null float64 FlightTimeMin 13059 non-null float64 Origin 13059 non-null object OriginAirportID 13059 non-null object OriginCityName 13059 non-null object OriginCountry 13059 non-null object OriginLocation 13059 non-null object OriginRegion 13059 non-null object OriginWeather 13059 non-null object dayOfWeek 13059 non-null int64 timestamp 13059 non-null datetime64[ns] dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17) memory usage: 96.0 bytes
[43]:
pd_flights.max(numeric_only=True)
AvgTicketPrice 1199.729053 Cancelled 1.000000 DistanceKilometers 19881.482315 DistanceMiles 12353.780369 FlightDelay 1.000000 FlightDelayMin 360.000000 FlightTimeHour 31.715034 FlightTimeMin 1902.902032 dayOfWeek 6.000000 dtype: float64
eland.DataFrame.max,min,mean,sum only aggregate numeric columns
eland.DataFrame.max,min,mean,sum
[44]:
ed_flights.max(numeric_only=True)
AvgTicketPrice 1199.729004 Cancelled 1.000000 DistanceKilometers 19881.482422 DistanceMiles 12353.780273 FlightDelay 1.000000 FlightDelayMin 360.000000 FlightTimeHour 31.715034 FlightTimeMin 1902.901978 dayOfWeek 6.000000 dtype: float64
[45]:
pd_flights.min(numeric_only=True)
AvgTicketPrice 100.020528 Cancelled 0.000000 DistanceKilometers 0.000000 DistanceMiles 0.000000 FlightDelay 0.000000 FlightDelayMin 0.000000 FlightTimeHour 0.000000 FlightTimeMin 0.000000 dayOfWeek 0.000000 dtype: float64
[46]:
ed_flights.min(numeric_only=True)
AvgTicketPrice 100.020531 Cancelled 0.000000 DistanceKilometers 0.000000 DistanceMiles 0.000000 FlightDelay 0.000000 FlightDelayMin 0.000000 FlightTimeHour 0.000000 FlightTimeMin 0.000000 dayOfWeek 0.000000 dtype: float64
[47]:
pd_flights.mean(numeric_only=True)
AvgTicketPrice 628.253689 Cancelled 0.128494 DistanceKilometers 7092.142455 DistanceMiles 4406.853013 FlightDelay 0.251168 FlightDelayMin 47.335171 FlightTimeHour 8.518797 FlightTimeMin 511.127842 dayOfWeek 2.835975 dtype: float64
[48]:
ed_flights.mean(numeric_only=True)
AvgTicketPrice 628.253689 Cancelled 0.128494 DistanceKilometers 7092.142457 DistanceMiles 4406.853010 FlightDelay 0.251168 FlightDelayMin 47.335171 FlightTimeHour 8.518797 FlightTimeMin 511.127842 dayOfWeek 2.835975 dtype: float64
[49]:
pd_flights.sum(numeric_only=True)
AvgTicketPrice 8.204365e+06 Cancelled 1.678000e+03 DistanceKilometers 9.261629e+07 DistanceMiles 5.754909e+07 FlightDelay 3.280000e+03 FlightDelayMin 6.181500e+05 FlightTimeHour 1.112470e+05 FlightTimeMin 6.674818e+06 dayOfWeek 3.703500e+04 dtype: float64
[50]:
ed_flights.sum(numeric_only=True)
[51]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
Carrier 4 Origin 156 Dest 156 dtype: int64
[52]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[53]:
pd_flights.drop(columns=['AvgTicketPrice', 'Cancelled', 'DestLocation', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry'])
13059 rows × 20 columns
[54]:
ed_flights.drop(columns=['AvgTicketPrice', 'Cancelled', 'DestLocation', 'Dest', 'DestAirportID', 'DestCityName', 'DestCountry'])
[55]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10]) plt.show()
[56]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10]) plt.show()
[57]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)] ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']] ed_flights2 = ed_flights2.tail()
[58]:
print(ed_flights2.info_es())
index_pattern: flights Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name timestamp timestamp True date strict_date_hour_minute_second datetime64[ns] True True False timestamp OriginAirportID OriginAirportID True keyword None object True True False OriginAirportID DestAirportID DestAirportID True keyword None object True True False DestAirportID FlightDelayMin FlightDelayMin True integer None int64 True True False FlightDelayMin Operations: tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))] size: 5 sort_params: _doc:desc _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin'] body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}} post_processing: [('sort_index')]
import eland as ed import pandas as pd import numpy as np import matplotlib.pyplot as plt # Fix console size for consistent test results from eland.conftest import *
To get started, let’s create an eland.DataFrame by reading a csv file. This creates and populates the online-retail index in the local Elasticsearch cluster.
online-retail
df = ed.read_csv("data/online-retail.csv.gz", es_client='localhost', es_dest_index='online-retail', es_if_exists='replace', es_dropna=True, es_refresh=True, compression='gzip', index_col=0)
Here we see that the "_id" field was used to index our data frame.
"_id"
df.index.index_field
Next, we can check which field from elasticsearch are available to our eland data frame. columns is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.
columns
df.columns
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'], dtype='object')
Now, let’s see the data types of our fields. Running df.dtypes, we can see that elasticsearch field types are mapped to pandas field types.
df.dtypes
Country object CustomerID float64 Description object InvoiceDate object InvoiceNo object Quantity int64 StockCode object UnitPrice float64 dtype: object
We also offer a .info_es() data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.
.info_es()
print(df.info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: []
Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.
much like pandas, eland data frames offer .head(n) and .tail(n) methods that return the first and last n rows, respectively.
.head(n)
.tail(n)
df.head(2)
2 rows × 8 columns
print(df.tail(2).head(2).tail(2).info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))] size: 2 sort_params: _doc:desc _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]
df.tail(2)
you can also pass a list of columns to select columns from the data frame in a specified order.
df[['Country', 'InvoiceDate']].head(5)
5 rows × 2 columns
we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms query that is then passed to elasticsearch to filter the index.
terms
# the construction of a boolean vector maps directly to an elasticsearch query print(df['Country']=='Germany') df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
5 rows × 8 columns
we can also filter the data frame using a list of values.
print(df['Country'].isin(['Germany', 'United States'])) df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
We can also combine boolean vectors to further filter the data frame.
df[(df['Country']=='Germany') & (df['Quantity']>90)]
0 rows × 8 columns
Using this example, let see how eland translates this boolean filter to an elasticsearch bool query.
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}} post_processing: []
Let’s begin to ask some questions of our data and use eland to get the answers.
How many different countries are there?
df['Country'].nunique()
16
What is the total sum of products ordered?
df['Quantity'].sum()
111960.0
Show me the sum, mean, min, and max of the qunatity and unit_price fields
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
Give me descriptive statistics for the entire data frame
# NBVAL_IGNORE_OUTPUT df.describe()
Show me a histogram of numeric columns
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30) plt.show()
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True) plt.show()
df.query('Quantity>50 & UnitPrice<100')
258 rows × 8 columns
Numeric values
df['Quantity'].head()
1000 1 1001 1 1002 1 1003 1 1004 12 Name: Quantity, dtype: int64
df['UnitPrice'].head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 0.29 Name: UnitPrice, dtype: float64
product = df['Quantity'] * df['UnitPrice']
product.head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 3.48 dtype: float64
String concatenation
df['Country'] + df['StockCode']
1000 United Kingdom21123 1001 United Kingdom21124 1002 United Kingdom21122 1003 United Kingdom84378 1004 United Kingdom21985 ... 14995 United Kingdom72349B 14996 United Kingdom72741 14997 United Kingdom22762 14998 United Kingdom21773 14999 United Kingdom22149 Length: 15000, dtype: object