The goal of an eland.DataFrame is to enable users who are familiar with pandas.DataFrame to access, explore and manipulate data that resides in Elasticsearch.
eland.DataFrame
pandas.DataFrame
Ideally, all data should reside in Elasticsearch and not to reside in memory. This restricts the API, but allows access to huge data sets that do not fit into memory, and allows use of powerful Elasticsearch features such as aggregations.
Generally, integrations with [3rd party storage systems](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (SQL, Google Big Query etc.) involve accessing these systems and reading all external data into an in-core pandas data structure. This also applies to [Apache Arrow](https://arrow.apache.org/docs/python/pandas.html) structures.
Whilst this provides access to data in these systems, for large datasets this can require significant in-core memory, and for systems such as Elasticsearch, bulk export of data can be an inefficient way of exploring the data.
An alternative option is to create an API that proxies pandas.DataFrame-like calls to Elasticsearch queries and operations. This could allow the Elasticsearch cluster to perform operations such as aggregations rather than exporting all the data and performing this operation in-core.
An option would be to replace the pandas.DataFrame backend in-core memory structures with Elasticsearch accessors. This would allow full access to the pandas.DataFrame APIs. However, this has issues:
df_a = df['a']
df
df_a
df.to_dict()
Another option is to create a eland.DataFrame API that mimics appropriate aspects of the pandas.DataFrame API. This resolves some of the issues above as:
eland.DataFrame._to_pandas()
pandas