[1]:
import eland as ed import pandas as pd import numpy as np import matplotlib.pyplot as plt # Fix console size for consistent test results from eland.conftest import *
To get started, let’s create an eland.DataFrame by reading a csv file. This creates and populates the online-retail index in the local Elasticsearch cluster.
eland.DataFrame
online-retail
[2]:
df = ed.read_csv("data/online-retail.csv.gz", es_client='localhost', es_dest_index='online-retail', es_if_exists='replace', es_dropna=True, es_refresh=True, compression='gzip', index_col=0)
Here we see that the "_id" field was used to index our data frame.
"_id"
[3]:
df.index.index_field
'_id'
Next, we can check which field from elasticsearch are available to our eland data frame. columns is available as a parameter when instantiating the data frame which allows one to choose only a subset of fields from your index to be included in the data frame. Since we didn’t set this parameter, we have access to all fields.
columns
[4]:
df.columns
Index(['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'], dtype='object')
Now, let’s see the data types of our fields. Running df.dtypes, we can see that elasticsearch field types are mapped to pandas field types.
df.dtypes
[5]:
Country object CustomerID float64 Description object InvoiceDate object InvoiceNo object Quantity int64 StockCode object UnitPrice float64 dtype: object
We also offer a .info_es() data frame method that shows all info about the underlying index. It also contains information about operations being passed from data frame methods to elasticsearch. More on this later.
.info_es()
[6]:
print(df.info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: []
Now that we understand how to create a data frame and get access to it’s underlying attributes, let’s see how we can select subsets of our data.
much like pandas, eland data frames offer .head(n) and .tail(n) methods that return the first and last n rows, respectively.
.head(n)
.tail(n)
[7]:
df.head(2)
2 rows × 8 columns
[8]:
print(df.tail(2).head(2).tail(2).info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('tail': ('sort_field': '_doc', 'count': 2)), ('head': ('sort_field': '_doc', 'count': 2)), ('tail': ('sort_field': '_doc', 'count': 2))] size: 2 sort_params: _doc:desc _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {} post_processing: [('sort_index'), ('head': ('count': 2)), ('tail': ('count': 2))]
[9]:
df.tail(2)
you can also pass a list of columns to select columns from the data frame in a specified order.
[10]:
df[['Country', 'InvoiceDate']].head(5)
5 rows × 2 columns
we also allow you to filter the data frame using boolean indexing. Under the hood, a boolean index maps to a terms query that is then passed to elasticsearch to filter the index.
terms
[11]:
# the construction of a boolean vector maps directly to an elasticsearch query print(df['Country']=='Germany') df[(df['Country']=='Germany')].head(5)
{'term': {'Country': 'Germany'}}
5 rows × 8 columns
we can also filter the data frame using a list of values.
[12]:
print(df['Country'].isin(['Germany', 'United States'])) df[df['Country'].isin(['Germany', 'United Kingdom'])].head(5)
{'terms': {'Country': ['Germany', 'United States']}}
We can also combine boolean vectors to further filter the data frame.
[13]:
df[(df['Country']=='Germany') & (df['Quantity']>90)]
0 rows × 8 columns
Using this example, let see how eland translates this boolean filter to an elasticsearch bool query.
bool
[14]:
print(df[(df['Country']=='Germany') & (df['Quantity']>90)].info_es())
index_pattern: online-retail Index: index_field: _id is_source_field: False Mappings: capabilities: es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name Country Country True keyword None object True True False Country CustomerID CustomerID True double None float64 True True False CustomerID Description Description True keyword None object True True False Description InvoiceDate InvoiceDate True keyword None object True True False InvoiceDate InvoiceNo InvoiceNo True keyword None object True True False InvoiceNo Quantity Quantity True long None int64 True True False Quantity StockCode StockCode True keyword None object True True False StockCode UnitPrice UnitPrice True double None float64 True True False UnitPrice Operations: tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}))] size: None sort_params: None _source: ['Country', 'CustomerID', 'Description', 'InvoiceDate', 'InvoiceNo', 'Quantity', 'StockCode', 'UnitPrice'] body: {'query': {'bool': {'must': [{'term': {'Country': 'Germany'}}, {'range': {'Quantity': {'gt': 90}}}]}}} post_processing: []
Let’s begin to ask some questions of our data and use eland to get the answers.
How many different countries are there?
[15]:
df['Country'].nunique()
16
What is the total sum of products ordered?
[16]:
df['Quantity'].sum()
111960.0
Show me the sum, mean, min, and max of the qunatity and unit_price fields
[17]:
df[['Quantity','UnitPrice']].agg(['sum', 'mean', 'max', 'min'])
Give me descriptive statistics for the entire data frame
[18]:
# NBVAL_IGNORE_OUTPUT df.describe()
Show me a histogram of numeric columns
[19]:
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30) plt.show()
[20]:
df[(df['Quantity']>-50) & (df['Quantity']<50) & (df['UnitPrice']>0) & (df['UnitPrice']<100)][['Quantity', 'UnitPrice']].hist(figsize=[12,4], bins=30, log=True) plt.show()
[21]:
df.query('Quantity>50 & UnitPrice<100')
258 rows × 8 columns
Numeric values
[22]:
df['Quantity'].head()
1000 1 1001 1 1002 1 1003 1 1004 12 Name: Quantity, dtype: int64
[23]:
df['UnitPrice'].head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 0.29 Name: UnitPrice, dtype: float64
[24]:
product = df['Quantity'] * df['UnitPrice']
[25]:
product.head()
1000 1.25 1001 1.25 1002 1.25 1003 1.25 1004 3.48 dtype: float64
String concatenation
[26]:
df['Country'] + df['StockCode']
1000 United Kingdom21123 1001 United Kingdom21124 1002 United Kingdom21122 1003 United Kingdom84378 1004 United Kingdom21985 ... 14995 United Kingdom72349B 14996 United Kingdom72741 14997 United Kingdom22762 14998 United Kingdom21773 14999 United Kingdom22149 Length: 15000, dtype: object