-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension to read and ingest Delta Lake tables #15755
Extension to read and ingest Delta Lake tables #15755
Conversation
… transitive deps from delta-lake Will need to sort out the dependencies later.
…wal87/druid into delta_lake_what_not
…uid into delta_lake_what_not
…uid into delta_lake_what_not
…uid into delta_lake_what_not
…uid into delta_lake_connector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Web console changes look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits but docs otherwise look good!
Co-authored-by: Victoria Lim <[email protected]>
* something * test commit * compilation fix * more compilation fixes (fixme placeholders) * Comment out druid-kereberos build since it conflicts with newly added transitive deps from delta-lake Will need to sort out the dependencies later. * checkpoint * remove snapshot schema since we can get schema from the row * iterator bug fix * json json json * sampler flow * empty impls for read(InputStats) and sample() * conversion? * conversion, without timestamp * Web console changes to show Delta Lake * Asset bug fix and tile load * Add missing pieces to input source info, etc. * fix stuff * Use a different delta lake asset * Delta lake extension dependencies * Cleanup * Add InputSource, module init and helper code to process delta files. * Test init * Checkpoint changes * Test resources and updates * some fixes * move to the correct package * More tests * Test cleanup * TODOs * Test updates * requirements and javadocs * Adjust dependencies * Update readme * Bump up version * fixup typo in deps * forbidden api and checkstyle checks * Trim down dependencies * new lines * Fixup Intellij inspections. * Add equals() and hashCode() * chain splits, intellij inspections * review comments and todo placeholder * fix up some docs * null table path and test dependencies. Fixup broken link. * run prettify * Different test; fixes * Upgrade pyspark and delta-spark to latest (3.5.0 and 3.0.0) and regenerate tests * yank the old test resource. * add a couple of sad path tests * Updates to readme based on latest. * Version support * Extract Delta DateTime converstions to DeltaTimeUtils class and add test * More comprehensive split tests. * Some test renames. * Cleanup and update instructions. * add pruneSchema() optimization for table scans. * Oops, missed the parquet files. * Update default table and rename schema constants. * Test setup and misc changes. * Add class loader logic as the context class loader is unaware about extension classes * change some table client creation logic. * Add hadoop-aws, hadoop-common and related exclusions. * Remove org.apache.hadoop:hadoop-common * Apply suggestions from code review Co-authored-by: Victoria Lim <[email protected]> * Add entry to .spelling to fix docs static check --------- Co-authored-by: abhishekagarwal87 <[email protected]> Co-authored-by: Laksh Singla <[email protected]> Co-authored-by: Victoria Lim <[email protected]>
@abhishekrb19 thanks for this. I was just going through this and wondering if delta lake kernel api supports any partitioning by time that we can leverage during indexing. Ingestion spec generally have an interval so can that be used to prune the amount of files on delta lake that is read during indexing, with my limited understanding of delta lake kernel api it supports filtering on columns while building scan. I have not seen many delta lake data lakes but one I seen had a base directory which hosted the metadata in |
Hi @pjain1, yes, Delta tables can be partitioned by date. Coincidentally, Delta 3.1.0 released today. In 3.1.0, the kernel supports data skipping in addition to partition pruning using filter predicates, whenever applicable. I will try to find time and add this functionality soon. In its current form, the connector will do full table scans, so I think upgrading to 3.1.0 and adding support for filters will be a good next step. |
@abhishekrb19 thanks |
Delta Lake is an open source storage layer that brings reliability to data lakes. Users can ingest data stored in a Delta Lake table into Apache Druid via a new input source
delta
. This is available via the Delta Lake extension, add thedruid-deltalake-extensions
to the list of loaded extensions. The Delta input source reads the configured Delta Lake table and extracts all the underlying delta files in the table's latest snapshot. Delta Lake files are versioned Parquet format.This is joint work with @abhishekagarwal87, @LakshSingla and @AmatyaAvadhanula.
Example of a DML query using MSQ:
Delta
ioConfig
in native batch spec :Web-console screenshots:
Sample and ingestion of a mock dataset from the cloud:
![CleanShot 2024-01-30 at 02 47 49@2x](https://private-user-images.githubusercontent.com/8687261/300778312-1c1869fe-b861-4f7c-9f90-8280e7efe953.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMzk1NTcsIm5iZiI6MTczOTEzOTI1NywicGF0aCI6Ii84Njg3MjYxLzMwMDc3ODMxMi0xYzE4NjlmZS1iODYxLTRmN2MtOWY5MC04MjgwZTdlZmU5NTMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwOSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDlUMjIxNDE3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YTYyYWU2MTFmMzNiNTc5MWY3NDA2ZWI1MzgyNTZhMmFhZjBmMjEyZmUxZmEzYjJkY2VmNzc0NjA0YzFiYzZlYSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.VowYPxhAdSUnTmIZBboU2ydSaIwgqmNUWIURnHDP4nU)
Delta 3.1.0 just released. Can look into upgrading the delta dependency as a follow-up.
Release note
Enhanced ingestion capabilities to support ingestion of Delta Lake tables.
This PR has: