Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update line-splitting logic in batched CSV reader #19508

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Oct 29, 2024

Updates to use the new CountLines instead of the previous accept_line based logic, which had some incorrect edge cases.

@nameexhaustion nameexhaustion changed the title fix: fix: Fix line-splitting in batched CSV reader fix: Fix line-splitting in batched CSV reader Oct 29, 2024
@github-actions github-actions bot added title needs formatting fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 29, 2024
@nameexhaustion nameexhaustion changed the title fix: Fix line-splitting in batched CSV reader fix: Fix line-splitting in batched CSV reader. Oct 29, 2024
@nameexhaustion nameexhaustion changed the title fix: Fix line-splitting in batched CSV reader. fix: Fix line-splitting in batched CSV reader Oct 29, 2024
@nameexhaustion nameexhaustion changed the title fix: Fix line-splitting in batched CSV reader fix: Update line-splitting logic in batched CSV reader Oct 30, 2024
Copy link

codspeed-hq bot commented Oct 30, 2024

CodSpeed Performance Report

Merging #19508 will degrade performances by 61.98%

Comparing nameexhaustion:batched-csv (f10394c) with main (0f64785)

Summary

❌ 10 regressions
✅ 31 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main nameexhaustion:batched-csv Change
test_pdsh_q1 16.4 ms 24.8 ms -33.71%
test_pdsh_q10 6 ms 8.4 ms -28.31%
test_pdsh_q14 2.1 ms 3.3 ms -37.33%
test_pdsh_q15 2.4 ms 3.2 ms -25.64%
test_pdsh_q22 6.1 ms 16 ms -61.98%
test_pdsh_q3 5.7 ms 7.8 ms -27.28%
test_pdsh_q4 4.3 ms 6.2 ms -30.11%
test_pdsh_q5 4.5 ms 6.3 ms -28.23%
test_pdsh_q6 1.9 ms 3.6 ms -48.23%
test_pdsh_q8 5 ms 7.2 ms -31.24%

@@ -63,10 +63,7 @@ fn test_streaming_csv() -> PolarsResult<()> {

#[test]
fn test_streaming_glob() -> PolarsResult<()> {
let q = get_csv_glob();
let q = q.sort(["sugars_g"], Default::default());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this was passing before, but there was sort instability for this query

@@ -2129,7 +2131,7 @@ def test_read_csv_only_loads_selected_columns(
break
result += next_batch
del result
assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 13_000_000
assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 20_000_000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use up to a 16MB chunk size now

get_file_chunks_iterator(
&mut self.offsets,
&mut self.last_offset,
self.n_chunks,
self.rows_per_batch * bytes_first_row,
&mut self.chunk_size,
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the (initial) chunk size to the default value copied from read_impl from the in-memory CSV reader, and ignoring the user-provided batch_size option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good one. I benchmarked that one. I think that one is a good cache friendly default.

Copy link

codecov bot commented Nov 5, 2024

Codecov Report

Attention: Patch coverage is 92.68293% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (0f64785) to head (f10394c).
Report is 42 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/csv/read/read_impl/batched.rs 92.68% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19508      +/-   ##
==========================================
- Coverage   79.92%   79.92%   -0.01%     
==========================================
  Files        1536     1536              
  Lines      211686   211697      +11     
  Branches     2445     2445              
==========================================
+ Hits       169192   169198       +6     
- Misses      41939    41944       +5     
  Partials      555      555              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion
Copy link
Collaborator Author

nameexhaustion commented Nov 5, 2024

I think the Codspeed report shows that we are executing more instructions - but from local testing the wall times show a slight improvement -

Before - python .env/data/x.py 2> /dev/null > /dev/null 6.03s user 0.63s system 600% cpu 1.109 total
After - python .env/data/x.py 2> /dev/null > /dev/null 6.01s user 0.59s system 603% cpu 1.092 total

Note that the reason this affects the timings of test_pdsh_* is because the data preparation phase is using scan_csv().sink_parquet()

@ritchie46
Copy link
Member

Nice one @nameexhaustion.

@ritchie46 ritchie46 merged commit dc53691 into pola-rs:main Nov 6, 2024
26 checks passed
tylerriccio33 pushed a commit to tylerriccio33/polars that referenced this pull request Nov 8, 2024
@c-peters c-peters added the accepted Ready for implementation label Nov 11, 2024
@nameexhaustion nameexhaustion deleted the batched-csv branch November 18, 2024 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants