Extension to read and ingest Delta Lake tables #15755

abhishekrb19 · 2024-01-24T17:29:21Z

Delta Lake is an open source storage layer that brings reliability to data lakes. Users can ingest data stored in a Delta Lake table into Apache Druid via a new input source delta. This is available via the Delta Lake extension, add the druid-deltalake-extensions to the list of loaded extensions. The Delta input source reads the configured Delta Lake table and extracts all the underlying delta files in the table's latest snapshot. Delta Lake files are versioned Parquet format.

This is joint work with @abhishekagarwal87, @LakshSingla and @AmatyaAvadhanula.

Example of a DML query using MSQ:

REPLACE INTO "delta-employee-datasource" OVERWRITE ALL
SELECT * FROM TABLE(
  EXTERN(
    '{"type":"delta","tablePath":"s3a://your-bucket/employee-delta-table"}',
    '{"type":"json"}'
  )
) EXTEND ("timestamp" VARCHAR, "name" VARCHAR, "age" BIGINT, "salary" DOUBLE, "bonus" FLOAT)
PARTITIONED BY ALL

Delta ioConfig in native batch spec :

    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "delta",
        "tablePath": "s3a://your-bucket/employee-delta-table"
      },
    }

Web-console screenshots:

Sample and ingestion of a mock dataset from the cloud:

Delta 3.1.0 just released. Can look into upgrading the delta dependency as a follow-up.

Release note

Enhanced ingestion capabilities to support ingestion of Delta Lake tables.

This PR has:

… transitive deps from delta-lake Will need to sort out the dependencies later.

…wal87/druid into delta_lake_what_not

…uid into delta_lake_what_not

…uid into delta_lake_connector

…xtension classes

vogievetsky

Web console changes look good.

vtlim

Some nits but docs otherwise look good!

docs/development/extensions-contrib/delta-lake.md

docs/ingestion/input-sources.md

Co-authored-by: Victoria Lim <[email protected]>

* something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <[email protected]> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <[email protected]> Co-authored-by: Laksh Singla <[email protected]> Co-authored-by: Victoria Lim <[email protected]>

pjain1 · 2024-01-31T06:17:36Z

@abhishekrb19 thanks for this. I was just going through this and wondering if delta lake kernel api supports any partitioning by time that we can leverage during indexing. Ingestion spec generally have an interval so can that be used to prune the amount of files on delta lake that is read during indexing, with my limited understanding of delta lake kernel api it supports filtering on columns while building scan.

I have not seen many delta lake data lakes but one I seen had a base directory which hosted the metadata in _delta_log folder, data was partitioned by day so had subdirectories for each day of data and those directories has hourly subdirectories. So if someone needs to run hourly indexing job will the extension end up reading all data again and again ?

abhishekrb19 · 2024-01-31T07:06:24Z

Hi @pjain1, yes, Delta tables can be partitioned by date. Coincidentally, Delta 3.1.0 released today. In 3.1.0, the kernel supports data skipping in addition to partition pruning using filter predicates, whenever applicable. I will try to find time and add this functionality soon. In its current form, the connector will do full table scans, so I think upgrading to 3.1.0 and adding support for filters will be a good next step.

pjain1 · 2024-01-31T08:22:02Z

@abhishekrb19 thanks

…15812) First commit: clean backport of Extension to read and ingest Delta Lake tables #15755 Second commit: change version from 30.0.0 to 29.0.0

abhishekagarwal87 and others added 30 commits December 19, 2023 22:28

something

85795fe

test commit

a3baae9

compilation fix

23d2616

more compilation fixes (fixme placeholders)

672894a

Comment out druid-kereberos build since it conflicts with newly added…

7b67bd2

… transitive deps from delta-lake Will need to sort out the dependencies later.

checkpoint

e67f5cf

remove snapshot schema since we can get schema from the row

da24f62

iterator bug fix

5bc39e1

json json json

27ba7b3

sampler flow

1bf4f7e

empty impls for read(InputStats) and sample()

934d92d

conversion?

235e03a

Merge branch 'delta_lake_what_not' of https://github.com/abhishekagar…

defbe6d

…wal87/druid into delta_lake_what_not

Merge branch 'delta_lake_what_not' of github.com:abhishekagarwal87/dr…

33e6fe6

…uid into delta_lake_what_not

Merge branch 'delta_lake_what_not' of github.com:abhishekagarwal87/dr…

a8600b7

…uid into delta_lake_what_not

conversion, without timestamp

43a4868

Merge branch 'delta_lake_what_not' of github.com:abhishekagarwal87/dr…

b78a08f

…uid into delta_lake_what_not

Web console changes to show Delta Lake

935582c

Asset bug fix and tile load

97c3d0d

Add missing pieces to input source info, etc.

bcffb79

fix stuff

a2502d2

Use a different delta lake asset

73649b2

Delta lake extension dependencies

73b0152

Merge branch 'delta_lake_what_not' of github.com:abhishekagarwal87/dr…

9d254ae

…uid into delta_lake_connector

Cleanup

b6fa446

Add InputSource, module init and helper code to process delta files.

f280480

Test init

68e8dcb

Checkpoint changes

0bc238d

Test resources and updates

ffd8bb1

some fixes

c4ec7f9

abhishekrb19 added 2 commits January 29, 2024 01:55

add pruneSchema() optimization for table scans.

e8f229b

Oops, missed the parquet files.

803f73a

abhishekrb19 mentioned this pull request Jan 29, 2024

[Kernel][Snapshots] Kernel Table API to read arbitrary snapshots delta-io/delta#2581

Closed

8 tasks

abhishekrb19 added 6 commits January 29, 2024 20:24

Update default table and rename schema constants.

da88270

Test setup and misc changes.

883a75c

Add class loader logic as the context class loader is unaware about e…

f4c3a99

…xtension classes

change some table client creation logic.

72cb1f4

Add hadoop-aws, hadoop-common and related exclusions.

e2159e5

Remove org.apache.hadoop:hadoop-common

c2b820a

abhishekrb19 requested review from vogievetsky, vtlim and abhishekagarwal87 January 30, 2024 18:48

vogievetsky approved these changes Jan 30, 2024

View reviewed changes

vtlim reviewed Jan 31, 2024

View reviewed changes

abhishekrb19 and others added 2 commits January 30, 2024 17:14

Apply suggestions from code review

1a52b56

Co-authored-by: Victoria Lim <[email protected]>

Add entry to .spelling to fix docs static check

8b9ecb9

abhishekagarwal87 approved these changes Jan 31, 2024

View reviewed changes

abhishekrb19 merged commit 9f95a69 into apache:master Jan 31, 2024
83 checks passed

abhishekrb19 added this to the 29.0.0 milestone Jan 31, 2024

abhishekrb19 mentioned this pull request Jan 31, 2024

[Backport] Extension to read and ingest Delta Lake tables (#15755) #15812

Merged

abhishekrb19 deleted the delta_lake_connector_ext branch January 31, 2024 07:15

MrPowers mentioned this pull request Feb 5, 2024

Add Apache Druid connector to integrations page delta-io/website#375

Closed

LakshSingla mentioned this pull request Feb 13, 2024

[DRAFT] 29.0.0 release notes #15896

Closed

Donutellko mentioned this pull request Jul 23, 2024

druid-deltalake-extensions support for StructType #16782

Closed

MrPowers mentioned this pull request Aug 16, 2024

[BUG] Delta chronically behind Databricks delta-io/delta#1775

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension to read and ingest Delta Lake tables #15755

Extension to read and ingest Delta Lake tables #15755

abhishekrb19 commented Jan 24, 2024 •

edited

Loading

vogievetsky left a comment

vtlim left a comment

pjain1 commented Jan 31, 2024

abhishekrb19 commented Jan 31, 2024

pjain1 commented Jan 31, 2024

Extension to read and ingest Delta Lake tables #15755

Extension to read and ingest Delta Lake tables #15755

Conversation

abhishekrb19 commented Jan 24, 2024 • edited Loading

Release note

vogievetsky left a comment

Choose a reason for hiding this comment

vtlim left a comment

Choose a reason for hiding this comment

pjain1 commented Jan 31, 2024

abhishekrb19 commented Jan 31, 2024

pjain1 commented Jan 31, 2024

abhishekrb19 commented Jan 24, 2024 •

edited

Loading