Eland Demo Notebook#
In this notebook we鈥檒l show you how to explore and analyze data in Elasticsearch using Eland鈥檚 familiar Pandas-compatible API.
Requirements 馃О#
For this example, you will need:
Python 3.8 or later
An Elastic deployment
We鈥檒l be using Elastic Cloud for this example (available with a free trial)
Create Elastic Cloud deployment#
If you don鈥檛 have an Elastic Cloud deployment, follow these steps to create one.
Go to https://cloud.elastic.co/registration and sign up for a free trial
Select Create Deployment and follow the instructions
Install packages 馃摝#
First we pip install
and import the packages we need for this example.
[8]:
!pip install pandas numpy matplotlib requests eland elasticsearch
# Standard library imports
import os
import tempfile
import csv
import gzip
import json
from getpass import getpass
# Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from elasticsearch.helpers import bulk
# Eland imports for Elasticsearch and pandas integration
import eland as ed
from eland.conftest import *
# Third-party imports for data handling
import pandas as pd
import numpy as np
# Visualization library
import matplotlib.pyplot as plt
# HTTP requests library
import requests
from google.colab import data_table
data_table.enable_dataframe_formatter()
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.23.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)
Requirement already satisfied: eland in /usr/local/lib/python3.10/dist-packages (8.11.0)
Requirement already satisfied: elasticsearch in /usr/local/lib/python3.10/dist-packages (8.10.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.44.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2023.7.22)
Requirement already satisfied: elastic-transport<9,>=8 in /usr/local/lib/python3.10/dist-packages (from elasticsearch) (8.10.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Initialize client 馃攲#
Next we input credentials with getpass
. getpass
is part of the Python standard library and is used to securely prompt for credentials.
[9]:
if os.environ.get("DOCS_BUILD"):
# Building the docs relies on a local cluster
# This is for CI purposes, unrelated to notebook
es_kwargs = {"hosts": "http://localhost:9200"}
else:
cloud_id = getpass("Elastic Cloud ID: ")
username = getpass("Elastic username: ")
password = getpass("Elastic password: ")
es_kwargs = {
"cloud_id": cloud_id,
"basic_auth": (username, password)
}
es = Elasticsearch(**es_kwargs)
es_info = es.info()
print(f"Connected to cluster named '{es_info['cluster_name']}' (version: {es_info['version']['number']})")
Elastic Cloud ID: 路路路路路路路路路路
Elastic username: 路路路路路路路路路路
Elastic password: 路路路路路路路路路路
Connected to cluster named 'db96688aa6f340828c41d9bc3f61684e' (version: 8.11.0)
Load and process documents 馃搫#
Time to load some data! We鈥檒l be using the Eland test data, which is a fake flights dataset. First let鈥檚 create an Elasticsearch index with the correct mappings.
Create index with mappings#
[10]:
FLIGHTS_INDEX_NAME = "flights"
FLIGHTS_MAPPING = {
"mappings": {
"properties": {
"AvgTicketPrice": {"type": "float"},
"Cancelled": {"type": "boolean"},
"Carrier": {"type": "keyword"},
"Dest": {"type": "keyword"},
"DestAirportID": {"type": "keyword"},
"DestCityName": {"type": "keyword"},
"DestCountry": {"type": "keyword"},
"DestLocation": {"type": "geo_point"},
"DestRegion": {"type": "keyword"},
"DestWeather": {"type": "keyword"},
"DistanceKilometers": {"type": "float"},
"DistanceMiles": {"type": "float"},
"FlightDelay": {"type": "boolean"},
"FlightDelayMin": {"type": "integer"},
"FlightDelayType": {"type": "keyword"},
"FlightNum": {"type": "keyword"},
"FlightTimeHour": {"type": "float"},
"FlightTimeMin": {"type": "float"},
"Origin": {"type": "keyword"},
"OriginAirportID": {"type": "keyword"},
"OriginCityName": {"type": "keyword"},
"OriginCountry": {"type": "keyword"},
"OriginLocation": {"type": "geo_point"},
"OriginRegion": {"type": "keyword"},
"OriginWeather": {"type": "keyword"},
"dayOfWeek": {"type": "byte"},
"timestamp": {"type": "date", "format": "strict_date_hour_minute_second"},
}
}
}
# Check if the index already exists
if not es.indices.exists(index=FLIGHTS_INDEX_NAME):
# Create the index with the specified mappings
es.indices.create(index=FLIGHTS_INDEX_NAME, body=FLIGHTS_MAPPING)
print(f"Index '{FLIGHTS_INDEX_NAME}' created.")
else:
print(f"Index '{FLIGHTS_INDEX_NAME}' already exists.")
Index 'flights' created.
Download sample data and index into Elasticsearch#
We鈥檒l manage indexing errors to ensure that a single failure won鈥檛 halt the entire batch. Errors will be logged and the process will continue. In this case a subset of our documents have an incorrect timestamp format. If this was your dataset, you could update the source data or else change the mapping.
[11]:
if not os.environ.get("DOCS_BUILD"):
with tempfile.TemporaryDirectory() as tmpdir:
r = requests.get("https://github.com/elastic/eland/raw/main/tests/flights.json.gz")
path = os.path.join(tmpdir, "flights.json.gz")
with open(path, "wb") as f:
f.write(r.content)
with gzip.open(path, 'rt', encoding='utf-8') as gfile:
# Read the JSON objects into a list
# Each line in the file is a separate JSON object
data = [json.loads(line) for line in gfile]
# Create dataframe
pd_flights = pd.DataFrame(data)
# Function to yield formatted actions to index into Elasticsearch with incremental integer IDs
def generate_actions(data):
for idx, entry in enumerate(data, start=0): # Use monotonically increasing IDs starting from 0
yield {
"_index": FLIGHTS_INDEX_NAME,
"_id": idx, # Use the current index as the document ID
"_source": entry
}
# Perform bulk indexing
try:
# Set raise_on_error to False so the bulk operation continues after errors
success, errors = helpers.bulk(es, generate_actions(data), raise_on_error=False, stats_only=True)
print(f"Indexing completed. Successfully indexed {success} documents.")
if errors:
print(f"Some errors occurred. {errors} documents were not indexed.")
else:
print("All documents were indexed successfully without errors.")
except Exception as e:
print(f"An unexpected error occurred while indexing: {e}. Please check the Elasticsearch logs for more details.")
Indexing completed. Successfully indexed 13014 documents.
Some errors occurred. 45 documents were not indexed.
Compare Eland DataFrame vs pandas DataFrame#
Create an eland.DataFrame from a flights
index
[12]:
ed_flights = ed.DataFrame(es, 'flights')
[13]:
type(ed_flights)
[13]:
eland.dataframe.DataFrame
Compare to pandas DataFrame (created from the same data)
[14]:
pd_flights = ed.eland_to_pandas(ed_flights)
[15]:
type(pd_flights)
[15]:
pandas.core.frame.DataFrame
Attributes and underlying data#
DataFrame.columns#
[16]:
pd_flights.columns
[16]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[17]:
ed_flights.columns
[17]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.dtypes#
[18]:
pd_flights.dtypes
[18]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
[19]:
ed_flights.dtypes
[19]:
AvgTicketPrice float64
Cancelled bool
Carrier object
Dest object
DestAirportID object
...
OriginLocation object
OriginRegion object
OriginWeather object
dayOfWeek int64
timestamp datetime64[ns]
Length: 27, dtype: object
DataFrame.select_dtypes#
[20]:
pd_flights.select_dtypes(include=np.number)
[20]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
12995 | 210.208911 | 0.000000 | ... | 0.000000 | 6 |
12996 | 830.500355 | 8551.966642 | ... | 1122.451513 | 6 |
12997 | 867.231029 | 9287.129863 | ... | 515.951659 | 6 |
12998 | 594.054018 | 3566.185736 | ... | 589.321980 | 6 |
12999 | 781.065911 | 11142.680199 | ... | 619.037789 | 6 |
12955 rows 脳 7 columns
[21]:
ed_flights.select_dtypes(include=np.number)
[21]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
0 | 841.265642 | 16492.326654 | ... | 1030.770416 | 0 |
1 | 882.982662 | 8823.400140 | ... | 464.389481 | 0 |
2 | 190.636904 | 0.000000 | ... | 0.000000 | 0 |
3 | 181.694216 | 555.737767 | ... | 222.749059 | 0 |
4 | 730.041778 | 13358.244200 | ... | 785.779071 | 0 |
... | ... | ... | ... | ... | ... |
13054 | 1080.446279 | 8058.581753 | ... | 402.929088 | 6 |
13055 | 646.612941 | 7088.598322 | ... | 644.418029 | 6 |
13056 | 997.751876 | 10920.652972 | ... | 937.540811 | 6 |
13057 | 1102.814465 | 18748.859647 | ... | 1697.404971 | 6 |
13058 | 858.144337 | 16809.141923 | ... | 1610.761827 | 6 |
13014 rows 脳 7 columns
DataFrame.empty#
[22]:
pd_flights.empty
[22]:
False
[23]:
ed_flights.empty
[23]:
False
DataFrame.shape#
[24]:
pd_flights.shape
[24]:
(12955, 27)
[25]:
ed_flights.shape
[25]:
(13014, 27)
DataFrame.index#
Note, eland.DataFrame.index
does not mirror pandas.DataFrame.index
.
[26]:
pd_flights.index
[26]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
...
'12990', '12991', '12992', '12993', '12994', '12995', '12996', '12997', '12998', '12999'],
dtype='object', length=12955)
[27]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[27]:
<eland.index.Index at 0x79e8c1ef9d50>
[28]:
ed_flights.index.es_index_field
[28]:
'_id'
DataFrame.values#
Note, eland.DataFrame.values
is not supported.
[29]:
pd_flights.values
[29]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
Timestamp('2018-01-01 00:00:00')],
[882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
Timestamp('2018-01-01 18:27:00')],
[190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
Timestamp('2018-01-01 17:11:14')],
...,
[867.231028821183, False, 'ES-Air', ..., 'Clear', 6,
Timestamp('2018-02-11 20:51:31')],
[594.0540183976791, False, 'Kibana Airlines', ..., 'Clear', 6,
Timestamp('2018-02-11 12:27:52')],
[781.0659105148637, False, 'Kibana Airlines', ..., 'Clear', 6,
Timestamp('2018-02-11 03:54:56')]], dtype=object)
[30]:
try:
ed_flights.values
except AttributeError as e:
print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`
Indexing, iteration#
DataFrame.head#
[31]:
pd_flights.head()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[31]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 rows 脳 27 columns
[32]:
ed_flights.head()
[32]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | 841.265642 | False | ... | 0 | 2018-01-01 00:00:00 |
1 | 882.982662 | False | ... | 0 | 2018-01-01 18:27:00 |
2 | 190.636904 | False | ... | 0 | 2018-01-01 17:11:14 |
3 | 181.694216 | True | ... | 0 | 2018-01-01 10:33:28 |
4 | 730.041778 | False | ... | 0 | 2018-01-01 05:13:00 |
5 rows 脳 27 columns
DataFrame.tail#
[33]:
pd_flights.tail()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[33]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
12995 | 210.208911 | False | ... | 6 | 2018-02-11 02:26:04 |
12996 | 830.500355 | False | ... | 6 | 2018-02-11 03:52:31 |
12997 | 867.231029 | False | ... | 6 | 2018-02-11 20:51:31 |
12998 | 594.054018 | False | ... | 6 | 2018-02-11 12:27:52 |
12999 | 781.065911 | False | ... | 6 | 2018-02-11 03:54:56 |
5 rows 脳 27 columns
[34]:
ed_flights.tail()
[34]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
13054 | 1080.446279 | False | ... | 6 | 2018-02-11 20:42:25 |
13055 | 646.612941 | False | ... | 6 | 2018-02-11 01:41:57 |
13056 | 997.751876 | False | ... | 6 | 2018-02-11 04:09:27 |
13057 | 1102.814465 | False | ... | 6 | 2018-02-11 08:28:21 |
13058 | 858.144337 | False | ... | 6 | 2018-02-11 14:54:34 |
5 rows 脳 27 columns
DataFrame.keys#
[35]:
pd_flights.keys()
[35]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
[36]:
ed_flights.keys()
[36]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
'timestamp'],
dtype='object')
DataFrame.get#
[37]:
pd_flights.get('Carrier')
[37]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
12995 JetBeats
12996 Logstash Airways
12997 ES-Air
12998 Kibana Airlines
12999 Kibana Airlines
Name: Carrier, Length: 12955, dtype: object
[38]:
ed_flights.get('Carrier')
[38]:
0 Kibana Airlines
1 Logstash Airways
2 Logstash Airways
3 Kibana Airlines
4 Kibana Airlines
...
13054 Logstash Airways
13055 Logstash Airways
13056 Logstash Airways
13057 JetBeats
13058 JetBeats
Name: Carrier, Length: 13014, dtype: object
[39]:
pd_flights.get(['Carrier', 'Origin'])
[39]:
Carrier | Origin | |
---|---|---|
0 | Kibana Airlines | Frankfurt am Main Airport |
1 | Logstash Airways | Cape Town International Airport |
2 | Logstash Airways | Venice Marco Polo Airport |
3 | Kibana Airlines | Naples International Airport |
4 | Kibana Airlines | Licenciado Benito Juarez International Airport |
... | ... | ... |
12995 | JetBeats | Ottawa Macdonald-Cartier International Airport |
12996 | Logstash Airways | Denver International Airport |
12997 | ES-Air | Charles de Gaulle International Airport |
12998 | Kibana Airlines | Rajiv Gandhi International Airport |
12999 | Kibana Airlines | Abu Dhabi International Airport |
12955 rows 脳 2 columns
List input not currently supported by eland.DataFrame.get
[40]:
try:
ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
print(e)
unhashable type: 'list'
DataFrame.query#
[41]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[41]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12798 | 1083.517299 | True | ... | 5 | 2018-02-10 19:20:10 |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
67 rows 脳 27 columns
eland.DataFrame.query
requires qualifier on bool i.e.
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled')
fails
[42]:
try:
ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
except Exception as e:
print(f"query() failed as expected: {repr(e)}.")
query() failed as expected: TypeError("Cannot interpret '{'terms': {'Carrier': ['Kibana Airlines']}}' as a data type").
Boolean indexing query#
[43]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
(pd_flights.AvgTicketPrice > 900.0) &
(pd_flights.Cancelled == True)]
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[43]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12798 | 1083.517299 | True | ... | 5 | 2018-02-10 19:20:10 |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
67 rows 脳 27 columns
[44]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
(ed_flights.AvgTicketPrice > 900.0) &
(ed_flights.Cancelled == True)]
[44]:
AvgTicketPrice | Cancelled | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
8 | 960.869736 | True | ... | 0 | 2018-01-01 12:09:35 |
26 | 975.812632 | True | ... | 0 | 2018-01-01 15:38:32 |
311 | 946.358410 | True | ... | 0 | 2018-01-01 11:51:12 |
651 | 975.383864 | True | ... | 2 | 2018-01-03 21:13:17 |
950 | 907.836523 | True | ... | 2 | 2018-01-03 05:14:51 |
... | ... | ... | ... | ... | ... |
12820 | 909.973606 | True | ... | 5 | 2018-02-10 05:11:35 |
12906 | 983.429244 | True | ... | 6 | 2018-02-11 06:19:58 |
12918 | 1136.678150 | True | ... | 6 | 2018-02-11 16:03:10 |
12919 | 1105.211803 | True | ... | 6 | 2018-02-11 05:36:05 |
13013 | 1055.350213 | True | ... | 6 | 2018-02-11 13:20:16 |
68 rows 脳 27 columns
Function application, GroupBy & window#
DataFrame.aggs#
[45]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[45]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.185214e+07 | 8.130904e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.575501e+03 | 2.663747e+02 |
eland.DataFrame.aggregate
currently only supported numeric columns
[46]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[46]:
DistanceKilometers | AvgTicketPrice | |
---|---|---|
sum | 9.229730e+07 | 8.177881e+06 |
min | 0.000000e+00 | 1.000205e+02 |
std | 4.578132e+03 | 2.664956e+02 |
Computations / descriptive stats#
DataFrame.count#
[47]:
pd_flights.count()
[47]:
AvgTicketPrice 12955
Cancelled 12955
Carrier 12955
Dest 12955
DestAirportID 12955
...
OriginLocation 12955
OriginRegion 12955
OriginWeather 12955
dayOfWeek 12955
timestamp 12955
Length: 27, dtype: int64
[48]:
ed_flights.count()
[48]:
AvgTicketPrice 13014
Cancelled 13014
Carrier 13014
Dest 13014
DestAirportID 13014
...
OriginLocation 13014
OriginRegion 13014
OriginWeather 13014
dayOfWeek 13014
timestamp 13014
Length: 27, dtype: int64
DataFrame.describe#
[49]:
pd_flights.describe()
[49]:
AvgTicketPrice | DistanceKilometers | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 12955.000000 | 12955.000000 | ... | 12955.000000 | 12955.000000 |
mean | 627.626743 | 7090.092034 | ... | 510.931002 | 2.820455 |
std | 266.374679 | 4575.500916 | ... | 334.378895 | 1.932079 |
min | 100.020528 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 409.522973 | 2458.696332 | ... | 252.658936 | 1.000000 |
50% | 640.169085 | 7610.330866 | ... | 503.142831 | 3.000000 |
75% | 841.437465 | 9733.565957 | ... | 720.183608 | 4.000000 |
max | 1199.729053 | 19881.482315 | ... | 1902.902032 | 6.000000 |
8 rows 脳 7 columns
Values returned from eland.DataFrame.describe
may vary due to results of Elasticsearch aggregations.
[50]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[50]:
AvgTicketPrice | Cancelled | ... | FlightTimeMin | dayOfWeek | |
---|---|---|---|---|---|
count | 13014.000000 | 13014.000000 | ... | 13014.000000 | 13014.000000 |
mean | 628.391013 | 0.128323 | ... | 511.106340 | 2.834870 |
std | 266.495605 | 0.334475 | ... | 334.674968 | 1.939568 |
min | 100.020531 | 0.000000 | ... | 0.000000 | 0.000000 |
25% | 411.187230 | 0.000000 | ... | 250.141903 | 1.061057 |
50% | 639.776657 | 0.000000 | ... | 502.902212 | 2.932961 |
75% | 842.653771 | 0.000000 | ... | 721.186244 | 4.445456 |
max | 1199.729004 | 1.000000 | ... | 1902.901978 | 6.000000 |
8 rows 脳 9 columns
DataFrame.info#
[51]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12955 entries, 0 to 12999
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 12955 non-null float64
1 Cancelled 12955 non-null bool
2 Carrier 12955 non-null object
3 Dest 12955 non-null object
4 DestAirportID 12955 non-null object
5 DestCityName 12955 non-null object
6 DestCountry 12955 non-null object
7 DestLocation 12955 non-null object
8 DestRegion 12955 non-null object
9 DestWeather 12955 non-null object
10 DistanceKilometers 12955 non-null float64
11 DistanceMiles 12955 non-null float64
12 FlightDelay 12955 non-null bool
13 FlightDelayMin 12955 non-null int64
14 FlightDelayType 12955 non-null object
15 FlightNum 12955 non-null object
16 FlightTimeHour 12955 non-null float64
17 FlightTimeMin 12955 non-null float64
18 Origin 12955 non-null object
19 OriginAirportID 12955 non-null object
20 OriginCityName 12955 non-null object
21 OriginCountry 12955 non-null object
22 OriginLocation 12955 non-null object
23 OriginRegion 12955 non-null object
24 OriginWeather 12955 non-null object
25 dayOfWeek 12955 non-null int64
26 timestamp 12955 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[52]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13014 entries, 0 to 13058
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AvgTicketPrice 13014 non-null float64
1 Cancelled 13014 non-null bool
2 Carrier 13014 non-null object
3 Dest 13014 non-null object
4 DestAirportID 13014 non-null object
5 DestCityName 13014 non-null object
6 DestCountry 13014 non-null object
7 DestLocation 13014 non-null object
8 DestRegion 13014 non-null object
9 DestWeather 13014 non-null object
10 DistanceKilometers 13014 non-null float64
11 DistanceMiles 13014 non-null float64
12 FlightDelay 13014 non-null bool
13 FlightDelayMin 13014 non-null int64
14 FlightDelayType 13014 non-null object
15 FlightNum 13014 non-null object
16 FlightTimeHour 13014 non-null float64
17 FlightTimeMin 13014 non-null float64
18 Origin 13014 non-null object
19 OriginAirportID 13014 non-null object
20 OriginCityName 13014 non-null object
21 OriginCountry 13014 non-null object
22 OriginLocation 13014 non-null object
23 OriginRegion 13014 non-null object
24 OriginWeather 13014 non-null object
25 dayOfWeek 13014 non-null int64
26 timestamp 13014 non-null datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 11.097 MB
DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum#
max#
[53]:
pd_flights.max(numeric_only=True)
[53]:
AvgTicketPrice 1199.729053
Cancelled True
DistanceKilometers 19881.482315
DistanceMiles 12353.780369
FlightDelay True
FlightDelayMin 360
FlightTimeHour 31.715034
FlightTimeMin 1902.902032
dayOfWeek 6
dtype: object
eland.DataFrame.max,min,mean,sum
only aggregate numeric columns
[54]:
ed_flights.max(numeric_only=True)
[54]:
AvgTicketPrice 1199.729004
Cancelled 1.000000
DistanceKilometers 19881.482422
DistanceMiles 12353.780273
FlightDelay 1.000000
FlightDelayMin 360.000000
FlightTimeHour 31.715034
FlightTimeMin 1902.901978
dayOfWeek 6.000000
dtype: float64
min#
[55]:
pd_flights.min(numeric_only=True)
[55]:
AvgTicketPrice 100.020528
Cancelled False
DistanceKilometers 0.0
DistanceMiles 0.0
FlightDelay False
FlightDelayMin 0
FlightTimeHour 0.0
FlightTimeMin 0.0
dayOfWeek 0
dtype: object
[56]:
ed_flights.min(numeric_only=True)
[56]:
AvgTicketPrice 100.020531
Cancelled 0.000000
DistanceKilometers 0.000000
DistanceMiles 0.000000
FlightDelay 0.000000
FlightDelayMin 0.000000
FlightTimeHour 0.000000
FlightTimeMin 0.000000
dayOfWeek 0.000000
dtype: float64
mean#
[57]:
pd_flights.mean(numeric_only=True)
[57]:
AvgTicketPrice 627.626743
Cancelled 0.128059
DistanceKilometers 7090.092034
DistanceMiles 4405.578940
FlightDelay 0.251023
FlightDelayMin 47.247395
FlightTimeHour 8.515517
FlightTimeMin 510.931002
dayOfWeek 2.820455
dtype: float64
[58]:
ed_flights.mean(numeric_only=True)
[58]:
AvgTicketPrice 628.391013
Cancelled 0.128323
DistanceKilometers 7092.154679
DistanceMiles 4406.860604
FlightDelay 0.251268
FlightDelayMin 47.349009
FlightTimeHour 8.518439
FlightTimeMin 511.106340
dayOfWeek 2.834870
dtype: float64
sum#
[59]:
pd_flights.sum(numeric_only=True)
[59]:
AvgTicketPrice 8.130904e+06
Cancelled 1.659000e+03
DistanceKilometers 9.185214e+07
DistanceMiles 5.707428e+07
FlightDelay 3.252000e+03
FlightDelayMin 6.120900e+05
FlightTimeHour 1.103185e+05
FlightTimeMin 6.619111e+06
dayOfWeek 3.653900e+04
dtype: float64
[60]:
ed_flights.sum(numeric_only=True)
[60]:
AvgTicketPrice 8.177881e+06
Cancelled 1.670000e+03
DistanceKilometers 9.229730e+07
DistanceMiles 5.735088e+07
FlightDelay 3.270000e+03
FlightDelayMin 6.162000e+05
FlightTimeHour 1.108590e+05
FlightTimeMin 6.651538e+06
dayOfWeek 3.689300e+04
dtype: float64
DataFrame.nunique#
[61]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[61]:
Carrier 4
Origin 156
Dest 156
dtype: int64
[62]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[62]:
Carrier 4
Origin 156
Dest 156
dtype: int64
DataFrame.drop#
[63]:
pd_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[63]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
12995 | JetBeats | CA-ON | ... | 6 | 2018-02-11 02:26:04 |
12996 | Logstash Airways | IT-42 | ... | 6 | 2018-02-11 03:52:31 |
12997 | ES-Air | SE-BD | ... | 6 | 2018-02-11 20:51:31 |
12998 | Kibana Airlines | SE-BD | ... | 6 | 2018-02-11 12:27:52 |
12999 | Kibana Airlines | CA-ON | ... | 6 | 2018-02-11 03:54:56 |
12955 rows 脳 20 columns
[64]:
ed_flights.drop(columns=['AvgTicketPrice',
'Cancelled',
'DestLocation',
'Dest',
'DestAirportID',
'DestCityName',
'DestCountry'])
[64]:
Carrier | DestRegion | ... | dayOfWeek | timestamp | |
---|---|---|---|---|---|
0 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 00:00:00 |
1 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 18:27:00 |
2 | Logstash Airways | IT-34 | ... | 0 | 2018-01-01 17:11:14 |
3 | Kibana Airlines | IT-34 | ... | 0 | 2018-01-01 10:33:28 |
4 | Kibana Airlines | SE-BD | ... | 0 | 2018-01-01 05:13:00 |
... | ... | ... | ... | ... | ... |
13054 | Logstash Airways | SE-BD | ... | 6 | 2018-02-11 20:42:25 |
13055 | Logstash Airways | CH-ZH | ... | 6 | 2018-02-11 01:41:57 |
13056 | Logstash Airways | RU-AMU | ... | 6 | 2018-02-11 04:09:27 |
13057 | JetBeats | SE-BD | ... | 6 | 2018-02-11 08:28:21 |
13058 | JetBeats | US-DC | ... | 6 | 2018-02-11 14:54:34 |
13014 rows 脳 20 columns
Plotting#
[65]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
[66]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
Elasticsearch utilities#
[67]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[68]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
es_index_field: _id
is_source_field: False
Mappings:
capabilities:
es_field_name is_source es_dtype es_date_format pd_dtype is_searchable is_aggregatable is_scripted aggregatable_es_field_name
timestamp timestamp True date strict_date_hour_minute_second datetime64[ns] True True False timestamp
OriginAirportID OriginAirportID True keyword None object True True False OriginAirportID
DestAirportID DestAirportID True keyword None object True True False DestAirportID
FlightDelayMin FlightDelayMin True integer None int64 True True False FlightDelayMin
Operations:
tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
size: 5
sort_params: {'_doc': 'desc'}
_source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
post_processing: [('sort_index')]