-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallelizing processing parquet files across workers and nodes. #19400
Conversation
for more information, see https://pre-commit.ci
⚡ Required checks status: All passing 🟢Groups summary🟢 lightning_data: CPU workflow
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
…lightning into add_parquet_reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, added two comments
for more information, see https://pre-commit.ci
What does this PR do?
The main challenge with parquet files is to make sure each worker across all nodes processes the quantity of data. This PR introduces a lazy slice to make sure all workers process the exact same number of rows.
This PR introduces a ParquetReader to distribute processing parquet files across workers and nodes with ease.
Fixes #<issue_number>
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
cc @Borda