Eland Demo Notebook#

Open in Colab

In this notebook we鈥檒l show you how to explore and analyze data in Elasticsearch using Eland鈥檚 familiar Pandas-compatible API.

Requirements 馃О#

For this example, you will need:

  • Python 3.8 or later

  • An Elastic deployment

Create Elastic Cloud deployment#

If you don鈥檛 have an Elastic Cloud deployment, follow these steps to create one.

  1. Go to https://cloud.elastic.co/registration and sign up for a free trial

  2. Select Create Deployment and follow the instructions

Install packages 馃摝#

First we pip install and import the packages we need for this example.

[8]:
!pip install pandas numpy matplotlib requests eland elasticsearch

# Standard library imports
import os
import tempfile
import csv
import gzip
import json
from getpass import getpass

# Elasticsearch client
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from elasticsearch.helpers import bulk

# Eland imports for Elasticsearch and pandas integration
import eland as ed
from eland.conftest import *

# Third-party imports for data handling
import pandas as pd
import numpy as np

# Visualization library
import matplotlib.pyplot as plt

# HTTP requests library
import requests

from google.colab import data_table
data_table.enable_dataframe_formatter()

Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.23.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)
Requirement already satisfied: eland in /usr/local/lib/python3.10/dist-packages (8.11.0)
Requirement already satisfied: elasticsearch in /usr/local/lib/python3.10/dist-packages (8.10.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.44.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2023.7.22)
Requirement already satisfied: elastic-transport<9,>=8 in /usr/local/lib/python3.10/dist-packages (from elasticsearch) (8.10.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

Initialize client 馃攲#

Next we input credentials with getpass. getpass is part of the Python standard library and is used to securely prompt for credentials.

[9]:
if os.environ.get("DOCS_BUILD"):
  # Building the docs relies on a local cluster
  # This is for CI purposes, unrelated to notebook
  es_kwargs = {"hosts": "http://localhost:9200"}
else:
  cloud_id = getpass("Elastic Cloud ID: ")
  username = getpass("Elastic username: ")
  password = getpass("Elastic password: ")
  es_kwargs = {
      "cloud_id": cloud_id,
      "basic_auth": (username, password)
  }

es = Elasticsearch(**es_kwargs)
es_info = es.info()
print(f"Connected to cluster named '{es_info['cluster_name']}' (version: {es_info['version']['number']})")

Elastic Cloud ID: 路路路路路路路路路路
Elastic username: 路路路路路路路路路路
Elastic password: 路路路路路路路路路路
Connected to cluster named 'db96688aa6f340828c41d9bc3f61684e' (version: 8.11.0)

Load and process documents 馃搫#

Time to load some data! We鈥檒l be using the Eland test data, which is a fake flights dataset. First let鈥檚 create an Elasticsearch index with the correct mappings.

Create index with mappings#

[10]:
FLIGHTS_INDEX_NAME = "flights"
FLIGHTS_MAPPING = {
    "mappings": {
        "properties": {
            "AvgTicketPrice": {"type": "float"},
            "Cancelled": {"type": "boolean"},
            "Carrier": {"type": "keyword"},
            "Dest": {"type": "keyword"},
            "DestAirportID": {"type": "keyword"},
            "DestCityName": {"type": "keyword"},
            "DestCountry": {"type": "keyword"},
            "DestLocation": {"type": "geo_point"},
            "DestRegion": {"type": "keyword"},
            "DestWeather": {"type": "keyword"},
            "DistanceKilometers": {"type": "float"},
            "DistanceMiles": {"type": "float"},
            "FlightDelay": {"type": "boolean"},
            "FlightDelayMin": {"type": "integer"},
            "FlightDelayType": {"type": "keyword"},
            "FlightNum": {"type": "keyword"},
            "FlightTimeHour": {"type": "float"},
            "FlightTimeMin": {"type": "float"},
            "Origin": {"type": "keyword"},
            "OriginAirportID": {"type": "keyword"},
            "OriginCityName": {"type": "keyword"},
            "OriginCountry": {"type": "keyword"},
            "OriginLocation": {"type": "geo_point"},
            "OriginRegion": {"type": "keyword"},
            "OriginWeather": {"type": "keyword"},
            "dayOfWeek": {"type": "byte"},
            "timestamp": {"type": "date", "format": "strict_date_hour_minute_second"},
        }
    }
}

# Check if the index already exists
if not es.indices.exists(index=FLIGHTS_INDEX_NAME):
    # Create the index with the specified mappings
    es.indices.create(index=FLIGHTS_INDEX_NAME, body=FLIGHTS_MAPPING)
    print(f"Index '{FLIGHTS_INDEX_NAME}' created.")
else:
    print(f"Index '{FLIGHTS_INDEX_NAME}' already exists.")

Index 'flights' created.

Download sample data and index into Elasticsearch#

We鈥檒l manage indexing errors to ensure that a single failure won鈥檛 halt the entire batch. Errors will be logged and the process will continue. In this case a subset of our documents have an incorrect timestamp format. If this was your dataset, you could update the source data or else change the mapping.

[11]:
if not os.environ.get("DOCS_BUILD"):
  with tempfile.TemporaryDirectory() as tmpdir:
    r = requests.get("https://github.com/elastic/eland/raw/main/tests/flights.json.gz")
    path = os.path.join(tmpdir, "flights.json.gz")
    with open(path, "wb") as f:
      f.write(r.content)

      with gzip.open(path, 'rt', encoding='utf-8') as gfile:
            # Read the JSON objects into a list
            # Each line in the file is a separate JSON object
            data = [json.loads(line) for line in gfile]

# Create dataframe
pd_flights = pd.DataFrame(data)

# Function to yield formatted actions to index into Elasticsearch with incremental integer IDs
def generate_actions(data):
    for idx, entry in enumerate(data, start=0):  # Use monotonically increasing IDs starting from 0
        yield {
            "_index": FLIGHTS_INDEX_NAME,
            "_id": idx,  # Use the current index as the document ID
            "_source": entry
        }

# Perform bulk indexing
try:
    # Set raise_on_error to False so the bulk operation continues after errors
    success, errors = helpers.bulk(es, generate_actions(data), raise_on_error=False, stats_only=True)
    print(f"Indexing completed. Successfully indexed {success} documents.")

    if errors:
        print(f"Some errors occurred. {errors} documents were not indexed.")
    else:
        print("All documents were indexed successfully without errors.")
except Exception as e:
    print(f"An unexpected error occurred while indexing: {e}. Please check the Elasticsearch logs for more details.")


Indexing completed. Successfully indexed 13014 documents.
Some errors occurred. 45 documents were not indexed.

Compare Eland DataFrame vs pandas DataFrame#

Create an eland.DataFrame from a flights index

[12]:
ed_flights = ed.DataFrame(es, 'flights')
[13]:
type(ed_flights)
[13]:
eland.dataframe.DataFrame

Compare to pandas DataFrame (created from the same data)

[14]:
pd_flights = ed.eland_to_pandas(ed_flights)
[15]:
type(pd_flights)
[15]:
pandas.core.frame.DataFrame

Attributes and underlying data#

DataFrame.columns#

[16]:
pd_flights.columns
[16]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[17]:
ed_flights.columns
[17]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')

DataFrame.dtypes#

[18]:
pd_flights.dtypes
[18]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object
[19]:
ed_flights.dtypes
[19]:
AvgTicketPrice           float64
Cancelled                   bool
Carrier                   object
Dest                      object
DestAirportID             object
                       ...
OriginLocation            object
OriginRegion              object
OriginWeather             object
dayOfWeek                  int64
timestamp         datetime64[ns]
Length: 27, dtype: object

DataFrame.select_dtypes#

[20]:
pd_flights.select_dtypes(include=np.number)
[20]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
12995 210.208911 0.000000 ... 0.000000 6
12996 830.500355 8551.966642 ... 1122.451513 6
12997 867.231029 9287.129863 ... 515.951659 6
12998 594.054018 3566.185736 ... 589.321980 6
12999 781.065911 11142.680199 ... 619.037789 6

12955 rows 脳 7 columns

[21]:
ed_flights.select_dtypes(include=np.number)
[21]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
0 841.265642 16492.326654 ... 1030.770416 0
1 882.982662 8823.400140 ... 464.389481 0
2 190.636904 0.000000 ... 0.000000 0
3 181.694216 555.737767 ... 222.749059 0
4 730.041778 13358.244200 ... 785.779071 0
... ... ... ... ... ...
13054 1080.446279 8058.581753 ... 402.929088 6
13055 646.612941 7088.598322 ... 644.418029 6
13056 997.751876 10920.652972 ... 937.540811 6
13057 1102.814465 18748.859647 ... 1697.404971 6
13058 858.144337 16809.141923 ... 1610.761827 6

13014 rows 脳 7 columns

DataFrame.empty#

[22]:
pd_flights.empty
[22]:
False
[23]:
ed_flights.empty
[23]:
False

DataFrame.shape#

[24]:
pd_flights.shape
[24]:
(12955, 27)
[25]:
ed_flights.shape
[25]:
(13014, 27)

DataFrame.index#

Note, eland.DataFrame.index does not mirror pandas.DataFrame.index.

[26]:
pd_flights.index
[26]:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '12990', '12991', '12992', '12993', '12994', '12995', '12996', '12997', '12998', '12999'],
      dtype='object', length=12955)
[27]:
# NBVAL_IGNORE_OUTPUT
ed_flights.index
[27]:
<eland.index.Index at 0x79e8c1ef9d50>
[28]:
ed_flights.index.es_index_field
[28]:
'_id'

DataFrame.values#

Note, eland.DataFrame.values is not supported.

[29]:
pd_flights.values
[29]:
array([[841.2656419677076, False, 'Kibana Airlines', ..., 'Sunny', 0,
        Timestamp('2018-01-01 00:00:00')],
       [882.9826615595518, False, 'Logstash Airways', ..., 'Clear', 0,
        Timestamp('2018-01-01 18:27:00')],
       [190.6369038508356, False, 'Logstash Airways', ..., 'Rain', 0,
        Timestamp('2018-01-01 17:11:14')],
       ...,
       [867.231028821183, False, 'ES-Air', ..., 'Clear', 6,
        Timestamp('2018-02-11 20:51:31')],
       [594.0540183976791, False, 'Kibana Airlines', ..., 'Clear', 6,
        Timestamp('2018-02-11 12:27:52')],
       [781.0659105148637, False, 'Kibana Airlines', ..., 'Clear', 6,
        Timestamp('2018-02-11 03:54:56')]], dtype=object)
[30]:
try:
    ed_flights.values
except AttributeError as e:
    print(e)
This method would scan/scroll the entire Elasticsearch index(s) into memory. If this is explicitly required, and there is sufficient memory, call `ed.eland_to_pandas(ed_df).values`

Indexing, iteration#

DataFrame.head#

[31]:
pd_flights.head()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[31]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows 脳 27 columns

[32]:
ed_flights.head()
[32]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
0 841.265642 False ... 0 2018-01-01 00:00:00
1 882.982662 False ... 0 2018-01-01 18:27:00
2 190.636904 False ... 0 2018-01-01 17:11:14
3 181.694216 True ... 0 2018-01-01 10:33:28
4 730.041778 False ... 0 2018-01-01 05:13:00

5 rows 脳 27 columns

DataFrame.tail#

[33]:
pd_flights.tail()
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[33]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
12995 210.208911 False ... 6 2018-02-11 02:26:04
12996 830.500355 False ... 6 2018-02-11 03:52:31
12997 867.231029 False ... 6 2018-02-11 20:51:31
12998 594.054018 False ... 6 2018-02-11 12:27:52
12999 781.065911 False ... 6 2018-02-11 03:54:56

5 rows 脳 27 columns

[34]:
ed_flights.tail()
[34]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
13054 1080.446279 False ... 6 2018-02-11 20:42:25
13055 646.612941 False ... 6 2018-02-11 01:41:57
13056 997.751876 False ... 6 2018-02-11 04:09:27
13057 1102.814465 False ... 6 2018-02-11 08:28:21
13058 858.144337 False ... 6 2018-02-11 14:54:34

5 rows 脳 27 columns

DataFrame.keys#

[35]:
pd_flights.keys()
[35]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')
[36]:
ed_flights.keys()
[36]:
Index(['AvgTicketPrice', 'Cancelled', 'Carrier', 'Dest', 'DestAirportID', 'DestCityName',
       'DestCountry', 'DestLocation', 'DestRegion', 'DestWeather', 'DistanceKilometers',
       'DistanceMiles', 'FlightDelay', 'FlightDelayMin', 'FlightDelayType', 'FlightNum',
       'FlightTimeHour', 'FlightTimeMin', 'Origin', 'OriginAirportID', 'OriginCityName',
       'OriginCountry', 'OriginLocation', 'OriginRegion', 'OriginWeather', 'dayOfWeek',
       'timestamp'],
      dtype='object')

DataFrame.get#

[37]:
pd_flights.get('Carrier')
[37]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
12995            JetBeats
12996    Logstash Airways
12997              ES-Air
12998     Kibana Airlines
12999     Kibana Airlines
Name: Carrier, Length: 12955, dtype: object
[38]:
ed_flights.get('Carrier')
[38]:
0         Kibana Airlines
1        Logstash Airways
2        Logstash Airways
3         Kibana Airlines
4         Kibana Airlines
               ...
13054    Logstash Airways
13055    Logstash Airways
13056    Logstash Airways
13057            JetBeats
13058            JetBeats
Name: Carrier, Length: 13014, dtype: object
[39]:
pd_flights.get(['Carrier', 'Origin'])
[39]:
Carrier Origin
0 Kibana Airlines Frankfurt am Main Airport
1 Logstash Airways Cape Town International Airport
2 Logstash Airways Venice Marco Polo Airport
3 Kibana Airlines Naples International Airport
4 Kibana Airlines Licenciado Benito Juarez International Airport
... ... ...
12995 JetBeats Ottawa Macdonald-Cartier International Airport
12996 Logstash Airways Denver International Airport
12997 ES-Air Charles de Gaulle International Airport
12998 Kibana Airlines Rajiv Gandhi International Airport
12999 Kibana Airlines Abu Dhabi International Airport

12955 rows 脳 2 columns

List input not currently supported by eland.DataFrame.get

[40]:
try:
    ed_flights.get(['Carrier', 'Origin'])
except TypeError as e:
    print(e)
unhashable type: 'list'

DataFrame.query#

[41]:
pd_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[41]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12798 1083.517299 True ... 5 2018-02-10 19:20:10
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05

67 rows 脳 27 columns

eland.DataFrame.query requires qualifier on bool i.e.

ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled') fails

[42]:
try:
  ed_flights.query('Carrier == "Kibana Airlines" & AvgTicketPrice > 900.0 & Cancelled == True')
except Exception as e:
  print(f"query() failed as expected: {repr(e)}.")
query() failed as expected: TypeError("Cannot interpret '{'terms': {'Carrier': ['Kibana Airlines']}}' as a data type").

Boolean indexing query#

[43]:
pd_flights[(pd_flights.Carrier=="Kibana Airlines") &
           (pd_flights.AvgTicketPrice > 900.0) &
           (pd_flights.Cancelled == True)]
Warning: Total number of columns (27) exceeds max_columns (20). Falling back to pandas display.
[43]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12798 1083.517299 True ... 5 2018-02-10 19:20:10
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05

67 rows 脳 27 columns

[44]:
ed_flights[(ed_flights.Carrier=="Kibana Airlines") &
           (ed_flights.AvgTicketPrice > 900.0) &
           (ed_flights.Cancelled == True)]
[44]:
AvgTicketPrice Cancelled ... dayOfWeek timestamp
8 960.869736 True ... 0 2018-01-01 12:09:35
26 975.812632 True ... 0 2018-01-01 15:38:32
311 946.358410 True ... 0 2018-01-01 11:51:12
651 975.383864 True ... 2 2018-01-03 21:13:17
950 907.836523 True ... 2 2018-01-03 05:14:51
... ... ... ... ... ...
12820 909.973606 True ... 5 2018-02-10 05:11:35
12906 983.429244 True ... 6 2018-02-11 06:19:58
12918 1136.678150 True ... 6 2018-02-11 16:03:10
12919 1105.211803 True ... 6 2018-02-11 05:36:05
13013 1055.350213 True ... 6 2018-02-11 13:20:16

68 rows 脳 27 columns

Function application, GroupBy & window#

DataFrame.aggs#

[45]:
pd_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[45]:
DistanceKilometers AvgTicketPrice
sum 9.185214e+07 8.130904e+06
min 0.000000e+00 1.000205e+02
std 4.575501e+03 2.663747e+02

eland.DataFrame.aggregate currently only supported numeric columns

[46]:
ed_flights[['DistanceKilometers', 'AvgTicketPrice']].aggregate(['sum', 'min', 'std'])
[46]:
DistanceKilometers AvgTicketPrice
sum 9.229730e+07 8.177881e+06
min 0.000000e+00 1.000205e+02
std 4.578132e+03 2.664956e+02

Computations / descriptive stats#

DataFrame.count#

[47]:
pd_flights.count()
[47]:
AvgTicketPrice    12955
Cancelled         12955
Carrier           12955
Dest              12955
DestAirportID     12955
                  ...
OriginLocation    12955
OriginRegion      12955
OriginWeather     12955
dayOfWeek         12955
timestamp         12955
Length: 27, dtype: int64
[48]:
ed_flights.count()
[48]:
AvgTicketPrice    13014
Cancelled         13014
Carrier           13014
Dest              13014
DestAirportID     13014
                  ...
OriginLocation    13014
OriginRegion      13014
OriginWeather     13014
dayOfWeek         13014
timestamp         13014
Length: 27, dtype: int64

DataFrame.describe#

[49]:
pd_flights.describe()
[49]:
AvgTicketPrice DistanceKilometers ... FlightTimeMin dayOfWeek
count 12955.000000 12955.000000 ... 12955.000000 12955.000000
mean 627.626743 7090.092034 ... 510.931002 2.820455
std 266.374679 4575.500916 ... 334.378895 1.932079
min 100.020528 0.000000 ... 0.000000 0.000000
25% 409.522973 2458.696332 ... 252.658936 1.000000
50% 640.169085 7610.330866 ... 503.142831 3.000000
75% 841.437465 9733.565957 ... 720.183608 4.000000
max 1199.729053 19881.482315 ... 1902.902032 6.000000

8 rows 脳 7 columns

Values returned from eland.DataFrame.describe may vary due to results of Elasticsearch aggregations.

[50]:
# NBVAL_IGNORE_OUTPUT
ed_flights.describe()
[50]:
AvgTicketPrice Cancelled ... FlightTimeMin dayOfWeek
count 13014.000000 13014.000000 ... 13014.000000 13014.000000
mean 628.391013 0.128323 ... 511.106340 2.834870
std 266.495605 0.334475 ... 334.674968 1.939568
min 100.020531 0.000000 ... 0.000000 0.000000
25% 411.187230 0.000000 ... 250.141903 1.061057
50% 639.776657 0.000000 ... 502.902212 2.932961
75% 842.653771 0.000000 ... 721.186244 4.445456
max 1199.729004 1.000000 ... 1902.901978 6.000000

8 rows 脳 9 columns

DataFrame.info#

[51]:
pd_flights.info()
<class 'pandas.core.frame.DataFrame'>
Index: 12955 entries, 0 to 12999
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      12955 non-null  float64
 1   Cancelled           12955 non-null  bool
 2   Carrier             12955 non-null  object
 3   Dest                12955 non-null  object
 4   DestAirportID       12955 non-null  object
 5   DestCityName        12955 non-null  object
 6   DestCountry         12955 non-null  object
 7   DestLocation        12955 non-null  object
 8   DestRegion          12955 non-null  object
 9   DestWeather         12955 non-null  object
 10  DistanceKilometers  12955 non-null  float64
 11  DistanceMiles       12955 non-null  float64
 12  FlightDelay         12955 non-null  bool
 13  FlightDelayMin      12955 non-null  int64
 14  FlightDelayType     12955 non-null  object
 15  FlightNum           12955 non-null  object
 16  FlightTimeHour      12955 non-null  float64
 17  FlightTimeMin       12955 non-null  float64
 18  Origin              12955 non-null  object
 19  OriginAirportID     12955 non-null  object
 20  OriginCityName      12955 non-null  object
 21  OriginCountry       12955 non-null  object
 22  OriginLocation      12955 non-null  object
 23  OriginRegion        12955 non-null  object
 24  OriginWeather       12955 non-null  object
 25  dayOfWeek           12955 non-null  int64
 26  timestamp           12955 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 3.1+ MB
[52]:
ed_flights.info()
<class 'eland.dataframe.DataFrame'>
Index: 13014 entries, 0 to 13058
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13014 non-null  float64
 1   Cancelled           13014 non-null  bool
 2   Carrier             13014 non-null  object
 3   Dest                13014 non-null  object
 4   DestAirportID       13014 non-null  object
 5   DestCityName        13014 non-null  object
 6   DestCountry         13014 non-null  object
 7   DestLocation        13014 non-null  object
 8   DestRegion          13014 non-null  object
 9   DestWeather         13014 non-null  object
 10  DistanceKilometers  13014 non-null  float64
 11  DistanceMiles       13014 non-null  float64
 12  FlightDelay         13014 non-null  bool
 13  FlightDelayMin      13014 non-null  int64
 14  FlightDelayType     13014 non-null  object
 15  FlightNum           13014 non-null  object
 16  FlightTimeHour      13014 non-null  float64
 17  FlightTimeMin       13014 non-null  float64
 18  Origin              13014 non-null  object
 19  OriginAirportID     13014 non-null  object
 20  OriginCityName      13014 non-null  object
 21  OriginCountry       13014 non-null  object
 22  OriginLocation      13014 non-null  object
 23  OriginRegion        13014 non-null  object
 24  OriginWeather       13014 non-null  object
 25  dayOfWeek           13014 non-null  int64
 26  timestamp           13014 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(5), int64(2), object(17)
memory usage: 64.000 bytes
Elasticsearch storage usage: 11.097 MB

DataFrame.max, DataFrame.min, DataFrame.mean, DataFrame.sum#

max#

[53]:
pd_flights.max(numeric_only=True)
[53]:
AvgTicketPrice         1199.729053
Cancelled                     True
DistanceKilometers    19881.482315
DistanceMiles         12353.780369
FlightDelay                   True
FlightDelayMin                 360
FlightTimeHour           31.715034
FlightTimeMin          1902.902032
dayOfWeek                        6
dtype: object

eland.DataFrame.max,min,mean,sum only aggregate numeric columns

[54]:
ed_flights.max(numeric_only=True)
[54]:
AvgTicketPrice         1199.729004
Cancelled                 1.000000
DistanceKilometers    19881.482422
DistanceMiles         12353.780273
FlightDelay               1.000000
FlightDelayMin          360.000000
FlightTimeHour           31.715034
FlightTimeMin          1902.901978
dayOfWeek                 6.000000
dtype: float64

min#

[55]:
pd_flights.min(numeric_only=True)
[55]:
AvgTicketPrice        100.020528
Cancelled                  False
DistanceKilometers           0.0
DistanceMiles                0.0
FlightDelay                False
FlightDelayMin                 0
FlightTimeHour               0.0
FlightTimeMin                0.0
dayOfWeek                      0
dtype: object
[56]:
ed_flights.min(numeric_only=True)
[56]:
AvgTicketPrice        100.020531
Cancelled               0.000000
DistanceKilometers      0.000000
DistanceMiles           0.000000
FlightDelay             0.000000
FlightDelayMin          0.000000
FlightTimeHour          0.000000
FlightTimeMin           0.000000
dayOfWeek               0.000000
dtype: float64

mean#

[57]:
pd_flights.mean(numeric_only=True)
[57]:
AvgTicketPrice         627.626743
Cancelled                0.128059
DistanceKilometers    7090.092034
DistanceMiles         4405.578940
FlightDelay              0.251023
FlightDelayMin          47.247395
FlightTimeHour           8.515517
FlightTimeMin          510.931002
dayOfWeek                2.820455
dtype: float64
[58]:
ed_flights.mean(numeric_only=True)
[58]:
AvgTicketPrice         628.391013
Cancelled                0.128323
DistanceKilometers    7092.154679
DistanceMiles         4406.860604
FlightDelay              0.251268
FlightDelayMin          47.349009
FlightTimeHour           8.518439
FlightTimeMin          511.106340
dayOfWeek                2.834870
dtype: float64

sum#

[59]:
pd_flights.sum(numeric_only=True)
[59]:
AvgTicketPrice        8.130904e+06
Cancelled             1.659000e+03
DistanceKilometers    9.185214e+07
DistanceMiles         5.707428e+07
FlightDelay           3.252000e+03
FlightDelayMin        6.120900e+05
FlightTimeHour        1.103185e+05
FlightTimeMin         6.619111e+06
dayOfWeek             3.653900e+04
dtype: float64
[60]:
ed_flights.sum(numeric_only=True)
[60]:
AvgTicketPrice        8.177881e+06
Cancelled             1.670000e+03
DistanceKilometers    9.229730e+07
DistanceMiles         5.735088e+07
FlightDelay           3.270000e+03
FlightDelayMin        6.162000e+05
FlightTimeHour        1.108590e+05
FlightTimeMin         6.651538e+06
dayOfWeek             3.689300e+04
dtype: float64

DataFrame.nunique#

[61]:
pd_flights[['Carrier', 'Origin', 'Dest']].nunique()
[61]:
Carrier      4
Origin     156
Dest       156
dtype: int64
[62]:
ed_flights[['Carrier', 'Origin', 'Dest']].nunique()
[62]:
Carrier      4
Origin     156
Dest       156
dtype: int64

DataFrame.drop#

[63]:
pd_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[63]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
12995 JetBeats CA-ON ... 6 2018-02-11 02:26:04
12996 Logstash Airways IT-42 ... 6 2018-02-11 03:52:31
12997 ES-Air SE-BD ... 6 2018-02-11 20:51:31
12998 Kibana Airlines SE-BD ... 6 2018-02-11 12:27:52
12999 Kibana Airlines CA-ON ... 6 2018-02-11 03:54:56

12955 rows 脳 20 columns

[64]:
ed_flights.drop(columns=['AvgTicketPrice',
                         'Cancelled',
                         'DestLocation',
                         'Dest',
                         'DestAirportID',
                         'DestCityName',
                         'DestCountry'])
[64]:
Carrier DestRegion ... dayOfWeek timestamp
0 Kibana Airlines SE-BD ... 0 2018-01-01 00:00:00
1 Logstash Airways IT-34 ... 0 2018-01-01 18:27:00
2 Logstash Airways IT-34 ... 0 2018-01-01 17:11:14
3 Kibana Airlines IT-34 ... 0 2018-01-01 10:33:28
4 Kibana Airlines SE-BD ... 0 2018-01-01 05:13:00
... ... ... ... ... ...
13054 Logstash Airways SE-BD ... 6 2018-02-11 20:42:25
13055 Logstash Airways CH-ZH ... 6 2018-02-11 01:41:57
13056 Logstash Airways RU-AMU ... 6 2018-02-11 04:09:27
13057 JetBeats SE-BD ... 6 2018-02-11 08:28:21
13058 JetBeats US-DC ... 6 2018-02-11 14:54:34

13014 rows 脳 20 columns

Plotting#

[65]:
pd_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
../_images/examples_demo_notebook_100_0.png
[66]:
ed_flights.select_dtypes(include=np.number).hist(figsize=[10,10])
plt.show()
../_images/examples_demo_notebook_101_0.png

Elasticsearch utilities#

[67]:
ed_flights2 = ed_flights[(ed_flights.OriginAirportID == 'AMS') & (ed_flights.FlightDelayMin > 60)]
ed_flights2 = ed_flights2[['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']]
ed_flights2 = ed_flights2.tail()
[68]:
print(ed_flights2.es_info())
es_index_pattern: flights
Index:
 es_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
                   es_field_name  is_source es_dtype                  es_date_format        pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_es_field_name
timestamp              timestamp       True     date  strict_date_hour_minute_second  datetime64[ns]           True             True        False                  timestamp
OriginAirportID  OriginAirportID       True  keyword                            None          object           True             True        False            OriginAirportID
DestAirportID      DestAirportID       True  keyword                            None          object           True             True        False              DestAirportID
FlightDelayMin    FlightDelayMin       True  integer                            None           int64           True             True        False             FlightDelayMin
Operations:
 tasks: [('boolean_filter': ('boolean_filter': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}})), ('tail': ('sort_field': '_doc', 'count': 5))]
 size: 5
 sort_params: {'_doc': 'desc'}
 _source: ['timestamp', 'OriginAirportID', 'DestAirportID', 'FlightDelayMin']
 body: {'query': {'bool': {'must': [{'term': {'OriginAirportID': 'AMS'}}, {'range': {'FlightDelayMin': {'gt': 60}}}]}}}
 post_processing: [('sort_index')]