-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rust engine consume a lot of memory compared to pyarrow #2968
Comments
@djouallah 👋 in the attached notebook, which write_deltalake call is resulting in memory pressure? I'm less familiar with duckdb, but I assume the |
it is an Arrow RecordBatchReader I think |
Rust engine materalizes everything to memory prior to starting the whole process. Pyarrow probably writes batch by batch when you pass a reader |
just for my own understanding, is this something that can be fixed by datafusion ? |
@djouallah there is a PR to address this but the contributor didn't have time to finish it yet: #2289 |
I have been analyzing the performance around this issue, even with today's latest Rust engine Pyarrow engine The root cause of this behavior is, in my opinion, the collection of There is some trickiness around Nothing required from anybody, I just wanted to share the analysis and progress here. 🤔 🏃 |
Yeah, I saw the "pyarrow" deprecation message and swapped to the rust engine for one of my tables. It is only about 2.6 million rows per partition but it is very wide and all text data (200+ columns). The pods were failing due to OOM errors after doing this so I need to use the pyarrow engine and overwrite the partition still. |
I'm returning to this and I have some exciting developments! Datafusion 44 introduced LazyMemoryExec which is the missing piece to make this a lot simpler. My prior attempts at solving this problem were effectively in a similar solution domain as Conceptually I am building from @aersam's great work in #2289 and using channels to safely bring On the Rust side, the I hope to have things cleaned up and ready to rock this weekend |
Environment
Delta-rs version:
0.21.0
Binding:
Environment:
Linux
Bug
switching from pyarrow engine to rust increase memory usage by nearly 3X, the job used to works fine, but now, getting OOM errors.
I added a reproducible example with only 60 input files to demo the issue
https://colab.research.google.com/drive/1fahlV0FgKSAS8sQvRMu47s3bDP1ekLbb#scrollTo=333a177b-f075-412e-8ca1-32d44f8c07eb
The text was updated successfully, but these errors were encountered: