Improve performance of the CSV parser #3338

iajoiner · 2022-12-13T02:10:18Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The CSV parser we have can be significantly improved in terms of speed.
Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold · 2022-12-13T09:34:46Z

Couple of areas that might be fruitful

The various arrays are parsed using FromIterator, this is normally suboptimal. It is almost always faster to construct the buffers manually, especially for UTF8 columns where a pre-pass can determine the buffer sizes first to avoid bump-allocating
Data is parsed as StringRecord, switching to ByteRecord may be faster, would open the door to supporting binary arrays csv: Support Binary-typed columns in the CSV writer #3292, and allows using the optimised UTF-8 validation logic Faster BinaryArray to StringArray conversion (~67%) #3168
Investigate using lexical-core to decode primitives, this was found to be faster in arrow-cast

I hope to spend some time on this in the near future, as I think we could make this significantly faster

Dandandan · 2022-12-13T09:50:50Z

I did some optimization to the CSV parser in the past.

An additional suggestion is to avoid materializing to StringRecord or ByteRecord as those allocate (which the code tries to reuse across batches). This should be possible by using the csv-core crate instead of the higher level csv.

avantgardnerio · 2022-12-13T13:49:26Z

Disclaimer: I haven't looked at the code. Do we have any parallelism? Could we do a binary search looking for CR characters until we have N equal sized chunks, then process each chunk in parallel

tustvold · 2022-12-13T13:56:30Z

In general arrow-rs delegates responsibility for parallelism to downstream crates such as DataFusion. This keeps things simple and flexible, whilst also not introducing coupling to any particular threading approach. I think one could fairly easily parallelise CsvExec in DataFusion, as we already have the plumbing to split byte streams on newline characters.

Could we do a binary search looking for CR characters until we have N equal sized chunks, then process each chunk in parallel

It is a bit more complex as you need to account for escaping and quotes, see LineDelimiter

avantgardnerio · 2022-12-13T14:04:16Z

Wow, I had no idea that was possible, but: https://en.wikipedia.org/wiki/Comma-separated_values

Fields with embedded line breaks must be quoted

I learned something new today, ty!

tustvold · 2022-12-16T08:52:41Z

I plan to take a stab at this this weekend

tustvold · 2022-12-17T14:45:39Z

Interestingly switching to ByteRecord appears to make performance worse. This is not hugely surprising when UTF-8 validation is only 4% of the profile, with the majority spent in parsing logic or memory allocation, but I might have expected a slight performance benefit

This does suggest switching to csv-core may yield significant returns, but would be a significantly more involved rework 🤔

* Add CSV reader benchmark (#3338) * Add floats * Format

* Add csv-core based reader (#3338) * More docs

* Add CSV build_buffered (#3338) * Doc tweaks

tustvold · 2023-01-03T10:28:56Z

Closing as I have done all I plan to on this, there may be further optimisations but I think we've got most of the low-hanging fruit

iajoiner added the enhancement Any new improvement worthy of a entry in the changelog label Dec 13, 2022

iajoiner changed the title ~~Improve performance of the CSV reader~~ Improve performance of the CSV parser Dec 13, 2022

andygrove added the good first issue Good for newcomers label Dec 13, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 17, 2022

Add CSV reader benchmark (apache#3338)

88b5a11

tustvold mentioned this issue Dec 17, 2022

Add CSV reader benchmark (#3338) #3357

Merged

tustvold added a commit that referenced this issue Dec 17, 2022

Add CSV reader benchmark (#3338) (#3357)

07284c5

* Add CSV reader benchmark (#3338) * Add floats * Format

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 18, 2022

Add csv-core based reader (apache#3338)

ec862c9

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 18, 2022

Add csv-core based reader (apache#3338)

9b8f8af

tustvold mentioned this issue Dec 18, 2022

Add csv-core based reader (#3338) #3365

Merged

tustvold added a commit that referenced this issue Dec 19, 2022

Add csv-core based reader (#3338) (#3365)

c344433

* Add csv-core based reader (#3338) * More docs

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 19, 2022

Add CSV build_buffered (apache#3338)

7c46851

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 19, 2022

Add CSV build_buffered (apache#3338)

37cc5de

tustvold added a commit that referenced this issue Dec 19, 2022

Add CSV build_buffered (#3338) (#3368)

8cab7a2

* Add CSV build_buffered (#3338) * Doc tweaks

tustvold closed this as completed Jan 3, 2023

tustvold added the arrow Changes to the arrow crate label Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of the CSV parser #3338

Improve performance of the CSV parser #3338

iajoiner commented Dec 13, 2022 •

edited

Loading

tustvold commented Dec 13, 2022

Dandandan commented Dec 13, 2022

avantgardnerio commented Dec 13, 2022

tustvold commented Dec 13, 2022 •

edited

Loading

avantgardnerio commented Dec 13, 2022

tustvold commented Dec 16, 2022

tustvold commented Dec 17, 2022 •

edited

Loading

tustvold commented Jan 3, 2023

Improve performance of the CSV parser #3338

Improve performance of the CSV parser #3338

Comments

iajoiner commented Dec 13, 2022 • edited Loading

tustvold commented Dec 13, 2022

Dandandan commented Dec 13, 2022

avantgardnerio commented Dec 13, 2022

tustvold commented Dec 13, 2022 • edited Loading

avantgardnerio commented Dec 13, 2022

tustvold commented Dec 16, 2022

tustvold commented Dec 17, 2022 • edited Loading

tustvold commented Jan 3, 2023

iajoiner commented Dec 13, 2022 •

edited

Loading

tustvold commented Dec 13, 2022 •

edited

Loading

tustvold commented Dec 17, 2022 •

edited

Loading