Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of the CSV parser #3338

Closed
iajoiner opened this issue Dec 13, 2022 · 8 comments
Closed

Improve performance of the CSV parser #3338

iajoiner opened this issue Dec 13, 2022 · 8 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers

Comments

@iajoiner
Copy link
Contributor

iajoiner commented Dec 13, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The CSV parser we have can be significantly improved in terms of speed.
Describe the solution you'd like

Describe alternatives you've considered

Additional context

@iajoiner iajoiner added the enhancement Any new improvement worthy of a entry in the changelog label Dec 13, 2022
@iajoiner iajoiner changed the title Improve performance of the CSV reader Improve performance of the CSV parser Dec 13, 2022
@tustvold
Copy link
Contributor

Couple of areas that might be fruitful

  • The various arrays are parsed using FromIterator, this is normally suboptimal. It is almost always faster to construct the buffers manually, especially for UTF8 columns where a pre-pass can determine the buffer sizes first to avoid bump-allocating
  • Data is parsed as StringRecord, switching to ByteRecord may be faster, would open the door to supporting binary arrays csv: Support Binary-typed columns in the CSV writer #3292, and allows using the optimised UTF-8 validation logic Faster BinaryArray to StringArray conversion (~67%) #3168
  • Investigate using lexical-core to decode primitives, this was found to be faster in arrow-cast

I hope to spend some time on this in the near future, as I think we could make this significantly faster

@Dandandan
Copy link
Contributor

I did some optimization to the CSV parser in the past.

An additional suggestion is to avoid materializing to StringRecord or ByteRecord as those allocate (which the code tries to reuse across batches). This should be possible by using the csv-core crate instead of the higher level csv.

@avantgardnerio
Copy link
Contributor

Disclaimer: I haven't looked at the code. Do we have any parallelism? Could we do a binary search looking for CR characters until we have N equal sized chunks, then process each chunk in parallel

@tustvold
Copy link
Contributor

tustvold commented Dec 13, 2022

In general arrow-rs delegates responsibility for parallelism to downstream crates such as DataFusion. This keeps things simple and flexible, whilst also not introducing coupling to any particular threading approach. I think one could fairly easily parallelise CsvExec in DataFusion, as we already have the plumbing to split byte streams on newline characters.

Could we do a binary search looking for CR characters until we have N equal sized chunks, then process each chunk in parallel

It is a bit more complex as you need to account for escaping and quotes, see LineDelimiter

@avantgardnerio
Copy link
Contributor

Wow, I had no idea that was possible, but: https://en.wikipedia.org/wiki/Comma-separated_values

Fields with embedded line breaks must be quoted

I learned something new today, ty!

@andygrove andygrove added the good first issue Good for newcomers label Dec 13, 2022
@tustvold
Copy link
Contributor

I plan to take a stab at this this weekend

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 17, 2022
@tustvold
Copy link
Contributor

tustvold commented Dec 17, 2022

Interestingly switching to ByteRecord appears to make performance worse. This is not hugely surprising when UTF-8 validation is only 4% of the profile, with the majority spent in parsing logic or memory allocation, but I might have expected a slight performance benefit

image

This does suggest switching to csv-core may yield significant returns, but would be a significantly more involved rework 🤔

tustvold added a commit that referenced this issue Dec 17, 2022
* Add CSV reader benchmark (#3338)

* Add floats

* Format
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 18, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 18, 2022
tustvold added a commit that referenced this issue Dec 19, 2022
* Add csv-core based reader (#3338)

* More docs
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 19, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 19, 2022
tustvold added a commit that referenced this issue Dec 19, 2022
* Add CSV build_buffered (#3338)

* Doc tweaks
@tustvold tustvold closed this as completed Jan 3, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2023

Closing as I have done all I plan to on this, there may be further optimisations but I think we've got most of the low-hanging fruit

@tustvold tustvold added the arrow Changes to the arrow crate label Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants