-
Notifications
You must be signed in to change notification settings - Fork 433
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: delta lake arrow integration page
- Loading branch information
1 parent
733b5ff
commit 17071a6
Showing
2 changed files
with
109 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Delta Lake Arrow Integrations | ||
|
||
Delta Lake tables can be exposed as Arrow tables and Arrow datasets, which allows for interoperability with a variety of query engines. | ||
|
||
This page shows you how to convert Delta tables to Arrow data structures and teaches you the difference between Arrow tables and Arrow datasets. | ||
|
||
## Delta Lake to Arrow Dataset | ||
|
||
Delta tables can easily be exposed as Arrow datasets. This makes it easy for any query engine that can read Arrow datasets to read a Delta table. | ||
|
||
Let's take a look at the h2o groupby dataset that contains 9 columns of data. Here are three representative rows of data: | ||
|
||
``` | ||
+-------+-------+--------------+-------+-------+--------+------+------+---------+ | ||
| id1 | id2 | id3 | id4 | id5 | id6 | v1 | v2 | v3 | | ||
|-------+-------+--------------+-------+-------+--------+------+------+---------| | ||
| id016 | id046 | id0000109363 | 88 | 13 | 146094 | 4 | 6 | 18.8377 | | ||
| id039 | id087 | id0000466766 | 14 | 30 | 111330 | 4 | 14 | 46.7973 | | ||
| id047 | id098 | id0000307804 | 85 | 23 | 187639 | 3 | 5 | 47.5773 | | ||
+-------+-------+--------------+-------+-------+--------+------+------+---------+ | ||
``` | ||
|
||
Here's how to expose the Delta table as a PyArrow dataset and run a query with DuckDB: | ||
|
||
```python | ||
import duckdb | ||
from deltalake import DeltaTable | ||
|
||
table = DeltaTable("delta/G1_1e9_1e2_0_0") | ||
dataset = table.to_pyarrow_dataset() | ||
quack = duckdb.arrow(dataset) | ||
quack.filter("id1 = 'id016' and v2 > 10") | ||
``` | ||
|
||
Here's the result: | ||
|
||
``` | ||
┌─────────┬─────────┬──────────────┬───────┬───────┬─────────┬───────┬───────┬───────────┐ | ||
│ id1 │ id2 │ id3 │ id4 │ id5 │ id6 │ v1 │ v2 │ v3 │ | ||
│ varchar │ varchar │ varchar │ int32 │ int32 │ int32 │ int32 │ int32 │ double │ | ||
├─────────┼─────────┼──────────────┼───────┼───────┼─────────┼───────┼───────┼───────────┤ | ||
│ id016 │ id054 │ id0002309114 │ 62 │ 95 │ 7180859 │ 4 │ 13 │ 7.750173 │ | ||
│ id016 │ id044 │ id0003968533 │ 63 │ 98 │ 2356363 │ 4 │ 14 │ 3.942417 │ | ||
│ id016 │ id034 │ id0001082839 │ 58 │ 73 │ 8039808 │ 5 │ 12 │ 76.820135 │ | ||
├─────────┴─────────┴──────────────┴───────┴───────┴─────────┴───────┴───────┴───────────┤ | ||
│ ? rows (>9999 rows, 3 shown) 9 columns │ | ||
└────────────────────────────────────────────────────────────────────────────────────────┘ | ||
``` | ||
|
||
Arrow datasets allow for the predicates to get pushed down to the query engine, so the query is executed quickly. | ||
|
||
## Delta Lake to Arrow Table | ||
|
||
You can also run the same query with DuckDB on an Arrow table: | ||
|
||
```python | ||
quack = duckdb.arrow(table.to_pyarrow_table()) | ||
quack.filter("id1 = 'id016' and v2 > 10") | ||
``` | ||
|
||
This returns the same result, but it runs slower. | ||
|
||
## Difference between Arrow Dataset and Arrow Table | ||
|
||
Arrow Datasets are lazy and allow for full predicate pushdown unlike Arrow tables which are eagerly loaded into memory. | ||
|
||
The previous DuckDB queries were run on a 1 billion row dataset that's roughly 50 GB when stored as an uncompressed CSV file. Here are the runtimes when the data is stored in a Delta table and the queries are executed on a 2021 Macbook M1 with 64 GB of RAM: | ||
|
||
* Arrow table: 17.1 seconds | ||
* Arrow dataset: 0.01 seconds | ||
|
||
The query runs much faster on an Arrow dataset because the predicates can be pushed down to the query engine and lots of data can be skipped. | ||
|
||
Arrow tables are eagerly materialized in memory and don't allow for the same amount of data skipping. | ||
|
||
## Multiple query engines can query Arrow Datasets | ||
|
||
Other query engines like DataFusion can also query Arrow datasets, see the following example: | ||
|
||
```python | ||
from datafusion import SessionContext | ||
|
||
ctx = SessionContext() | ||
ctx.register_dataset("my_dataset", table.to_pyarrow_dataset()) | ||
ctx.sql("select * from my_dataset where v2 > 5") | ||
``` | ||
|
||
Here's the result: | ||
|
||
``` | ||
+-------+-------+--------------+-----+-----+--------+----+----+-----------+ | ||
| id1 | id2 | id3 | id4 | id5 | id6 | v1 | v2 | v3 | | ||
+-------+-------+--------------+-----+-----+--------+----+----+-----------+ | ||
| id082 | id049 | id0000022715 | 97 | 55 | 756924 | 2 | 11 | 74.161136 | | ||
| id053 | id052 | id0000113549 | 19 | 56 | 139048 | 1 | 10 | 95.178444 | | ||
| id090 | id043 | id0000637409 | 94 | 50 | 12448 | 3 | 12 | 60.21896 | | ||
+-------+-------+--------------+-----+-----+--------+----+----+-----------+ | ||
``` | ||
|
||
Any query engine that's capable of reading an Arrow table/dataset can read a Delta table. | ||
|
||
## Conclusion | ||
|
||
Delta tables can easily be exposed as Arrow tables/datasets. | ||
|
||
Therefore any query engine that can read an Arrow table/dataset can also read a Delta table. | ||
|
||
Arrow datasets allow for more predicates to be pushed down to the query engine, so they can perform better performance than Arrow tables. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters