-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of the CSV parser #3338
Comments
Couple of areas that might be fruitful
I hope to spend some time on this in the near future, as I think we could make this significantly faster |
I did some optimization to the CSV parser in the past. An additional suggestion is to avoid materializing to |
Disclaimer: I haven't looked at the code. Do we have any parallelism? Could we do a binary search looking for CR characters until we have N equal sized chunks, then process each chunk in parallel |
In general arrow-rs delegates responsibility for parallelism to downstream crates such as DataFusion. This keeps things simple and flexible, whilst also not introducing coupling to any particular threading approach. I think one could fairly easily parallelise CsvExec in DataFusion, as we already have the plumbing to split byte streams on newline characters.
It is a bit more complex as you need to account for escaping and quotes, see LineDelimiter |
Wow, I had no idea that was possible, but: https://en.wikipedia.org/wiki/Comma-separated_values
I learned something new today, ty! |
I plan to take a stab at this this weekend |
Interestingly switching to ByteRecord appears to make performance worse. This is not hugely surprising when UTF-8 validation is only 4% of the profile, with the majority spent in parsing logic or memory allocation, but I might have expected a slight performance benefit This does suggest switching to csv-core may yield significant returns, but would be a significantly more involved rework 🤔 |
Closing as I have done all I plan to on this, there may be further optimisations but I think we've got most of the low-hanging fruit |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The CSV parser we have can be significantly improved in terms of speed.
Describe the solution you'd like
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: