fix: Update line-splitting logic in batched CSV reader #19508

nameexhaustion · 2024-10-29T08:57:50Z

Updates to use the new CountLines instead of the previous accept_line based logic, which had some incorrect edge cases.

codspeed-hq · 2024-10-30T05:55:42Z

CodSpeed Performance Report

Merging #19508 will degrade performances by 61.98%

_{Comparing nameexhaustion:batched-csv (f10394c) with main (0f64785)}

Summary

❌ 10 regressions
✅ 31 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`nameexhaustion:batched-csv`	Change
❌	`test_pdsh_q1`	16.4 ms	24.8 ms	-33.71%
❌	`test_pdsh_q10`	6 ms	8.4 ms	-28.31%
❌	`test_pdsh_q14`	2.1 ms	3.3 ms	-37.33%
❌	`test_pdsh_q15`	2.4 ms	3.2 ms	-25.64%
❌	`test_pdsh_q22`	6.1 ms	16 ms	-61.98%
❌	`test_pdsh_q3`	5.7 ms	7.8 ms	-27.28%
❌	`test_pdsh_q4`	4.3 ms	6.2 ms	-30.11%
❌	`test_pdsh_q5`	4.5 ms	6.3 ms	-28.23%
❌	`test_pdsh_q6`	1.9 ms	3.6 ms	-48.23%
❌	`test_pdsh_q8`	5 ms	7.2 ms	-31.24%

nameexhaustion · 2024-10-30T06:05:48Z

crates/polars-lazy/src/tests/streaming.rs

@@ -63,10 +63,7 @@ fn test_streaming_csv() -> PolarsResult<()> {

 #[test]
 fn test_streaming_glob() -> PolarsResult<()> {
-    let q = get_csv_glob();
-    let q = q.sort(["sugars_g"], Default::default());


I'm not sure how this was passing before, but there was sort instability for this query

nameexhaustion · 2024-10-30T06:14:37Z

py-polars/tests/unit/io/test_csv.py

@@ -2129,7 +2131,7 @@ def test_read_csv_only_loads_selected_columns(
            break
        result += next_batch
    del result
-    assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 13_000_000
+    assert 8_000_000 < memory_usage_without_pyarrow.get_peak() < 20_000_000


We use up to a 16MB chunk size now

nameexhaustion · 2024-10-30T09:10:25Z

crates/polars-io/src/csv/read/read_impl/batched.rs

                get_file_chunks_iterator(
                    &mut self.offsets,
                    &mut self.last_offset,
                    self.n_chunks,
-                    self.rows_per_batch * bytes_first_row,
+                    &mut self.chunk_size,


I've changed the (initial) chunk size to the default value copied from read_impl from the in-memory CSV reader, and ignoring the user-provided batch_size option.

Yes, good one. I benchmarked that one. I think that one is a good cache friendly default.

codecov · 2024-11-05T05:58:04Z

Codecov Report

Attention: Patch coverage is 92.68293% with 3 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (0f64785) to head (f10394c).
Report is 42 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-io/src/csv/read/read_impl/batched.rs	92.68%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #19508      +/-   ##
==========================================
- Coverage   79.92%   79.92%   -0.01%     
==========================================
  Files        1536     1536              
  Lines      211686   211697      +11     
  Branches     2445     2445              
==========================================
+ Hits       169192   169198       +6     
- Misses      41939    41944       +5     
  Partials      555      555

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nameexhaustion · 2024-11-05T06:35:27Z

I think the Codspeed report shows that we are executing more instructions - but from local testing the wall times show a slight improvement -

Before - python .env/data/x.py 2> /dev/null > /dev/null 6.03s user 0.63s system 600% cpu 1.109 total
After - python .env/data/x.py 2> /dev/null > /dev/null 6.01s user 0.59s system 603% cpu 1.092 total

Note that the reason this affects the timings of test_pdsh_* is because the data preparation phase is using scan_csv().sink_parquet()

ritchie46 · 2024-11-06T14:47:54Z

Nice one @nameexhaustion.

nameexhaustion changed the title ~~fix: fix: Fix line-splitting in batched CSV reader~~ fix: Fix line-splitting in batched CSV reader Oct 29, 2024

github-actions bot added title needs formatting fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 29, 2024

nameexhaustion changed the title ~~fix: Fix line-splitting in batched CSV reader~~ fix: Fix line-splitting in batched CSV reader. Oct 29, 2024

nameexhaustion changed the title ~~fix: Fix line-splitting in batched CSV reader.~~ fix: Fix line-splitting in batched CSV reader Oct 29, 2024

github-actions bot removed the title needs formatting label Oct 29, 2024

nameexhaustion force-pushed the batched-csv branch from 199c382 to a06730e Compare October 30, 2024 05:13

nameexhaustion changed the title ~~fix: Fix line-splitting in batched CSV reader~~ fix: Update line-splitting logic in batched CSV reader Oct 30, 2024

nameexhaustion force-pushed the batched-csv branch from 8ee589f to a06730e Compare October 30, 2024 05:37

nameexhaustion commented Oct 30, 2024

View reviewed changes

c

f10394c

nameexhaustion force-pushed the batched-csv branch from 87f9aeb to f10394c Compare November 5, 2024 05:37

nameexhaustion marked this pull request as ready for review November 5, 2024 06:35

nameexhaustion requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners November 5, 2024 06:35

ritchie46 approved these changes Nov 6, 2024

View reviewed changes

ritchie46 merged commit dc53691 into pola-rs:main Nov 6, 2024
26 checks passed

tylerriccio33 pushed a commit to tylerriccio33/polars that referenced this pull request Nov 8, 2024

fix: Update line-splitting logic in batched CSV reader (pola-rs#19508)

22f8c0b

c-peters added the accepted Ready for implementation label Nov 11, 2024

c-peters assigned nameexhaustion Nov 11, 2024

nameexhaustion deleted the batched-csv branch November 18, 2024 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Update line-splitting logic in batched CSV reader #19508

fix: Update line-splitting logic in batched CSV reader #19508

nameexhaustion commented Oct 29, 2024 •

edited

Loading

codspeed-hq bot commented Oct 30, 2024 •

edited

Loading

nameexhaustion Oct 30, 2024

nameexhaustion Oct 30, 2024

nameexhaustion Oct 30, 2024 •

edited

Loading

ritchie46 Nov 6, 2024

codecov bot commented Nov 5, 2024 •

edited

Loading

nameexhaustion commented Nov 5, 2024 •

edited

Loading

ritchie46 commented Nov 6, 2024

fix: Update line-splitting logic in batched CSV reader #19508

fix: Update line-splitting logic in batched CSV reader #19508

Conversation

nameexhaustion commented Oct 29, 2024 • edited Loading

codspeed-hq bot commented Oct 30, 2024 • edited Loading

CodSpeed Performance Report

Merging #19508 will degrade performances by 61.98%

Summary

Benchmarks breakdown

nameexhaustion Oct 30, 2024

Choose a reason for hiding this comment

nameexhaustion Oct 30, 2024

Choose a reason for hiding this comment

nameexhaustion Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

ritchie46 Nov 6, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 5, 2024 • edited Loading

Codecov Report

nameexhaustion commented Nov 5, 2024 • edited Loading

ritchie46 commented Nov 6, 2024

nameexhaustion commented Oct 29, 2024 •

edited

Loading

codspeed-hq bot commented Oct 30, 2024 •

edited

Loading

nameexhaustion Oct 30, 2024 •

edited

Loading

codecov bot commented Nov 5, 2024 •

edited

Loading

nameexhaustion commented Nov 5, 2024 •

edited

Loading