-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Update line-splitting logic in batched CSV reader #19508
Conversation
199c382
to
a06730e
Compare
8ee589f
to
a06730e
Compare
CodSpeed Performance ReportMerging #19508 will degrade performances by 61.98%Comparing Summary
Benchmarks breakdown
|
@@ -63,10 +63,7 @@ fn test_streaming_csv() -> PolarsResult<()> { | |||
|
|||
#[test] | |||
fn test_streaming_glob() -> PolarsResult<()> { | |||
let q = get_csv_glob(); | |||
let q = q.sort(["sugars_g"], Default::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this was passing before, but there was sort instability for this query
@@ -2129,7 +2131,7 @@ def test_read_csv_only_loads_selected_columns( | |||
break | |||
result += next_batch | |||
del result | |||
assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 13_000_000 | |||
assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 20_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use up to a 16MB chunk size now
get_file_chunks_iterator( | ||
&mut self.offsets, | ||
&mut self.last_offset, | ||
self.n_chunks, | ||
self.rows_per_batch * bytes_first_row, | ||
&mut self.chunk_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the (initial) chunk size to the default value copied from read_impl
from the in-memory CSV reader, and ignoring the user-provided batch_size
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good one. I benchmarked that one. I think that one is a good cache friendly default.
87f9aeb
to
f10394c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19508 +/- ##
==========================================
- Coverage 79.92% 79.92% -0.01%
==========================================
Files 1536 1536
Lines 211686 211697 +11
Branches 2445 2445
==========================================
+ Hits 169192 169198 +6
- Misses 41939 41944 +5
Partials 555 555 ☔ View full report in Codecov by Sentry. |
I think the Codspeed report shows that we are executing more instructions - but from local testing the wall times show a slight improvement - Before - Note that the reason this affects the timings of |
Nice one @nameexhaustion. |
Updates to use the new
CountLines
instead of the previousaccept_line
based logic, which had some incorrect edge cases.