Feature Request: improve parallel processing #66

tilo · 2015-07-28T21:39:57Z

@xjlin0 commented on May 25

Smarter_csv is a great gem! Save me ton's of time by parallel processing.

One possible improvement I am hoping here, is to let smarter_csv sending out the chunks before finishing reading the entire files. Smarter_csv use readline to read csv files, smartly avoiding reading the entire csv files into memory. However the it seems cannot sending chunks out before finishing the entire csv files.

tilo · 2018-01-24T21:12:47Z

this looks related to issue #32

tilo · 2018-08-10T00:32:33Z

needs to be addressed in 2.0

https://felipeelias.github.io/ruby/2017/01/02/fast-file-processing-ruby.html

File.open("test.txt", "r").each_line do |row|
  puts row
end

#each_line is a streaming method and should be faster

https://stackoverflow.com/questions/39031541/how-to-read-a-large-text-file-line-by-line-and-append-this-stream-to-a-file-line

https://dalibornasevic.com/posts/68-processing-large-csv-files-with-ruby

tbolender · 2021-04-30T13:37:41Z

I recently stumbled upon this gem when processing a ~400MB large CSV file. Your gem helped me a lot speeding the process up, thank you @tilo a lot for this!

However, it left me a bit helpless when it came to parallel processing. When studying the linked examples like https://xjlin0.github.io/tech/2015/05/25/faster-parsing-csv-with-parallel-processing/, I noticed that they assume that the file is small enough to load it completely in memory. That is not feasible nor practical in my case.

For actual parallel processing of arbitrary large files, I suggest some kind of Enumerable implementation on entry or chunk base. This would, e.g., allow the usage in the lambda syntax of parallel or the manual distribution over a worker infrastructure.

EDIT: If you have anything planned or sketched out already, I am happy to help.

tilo · 2023-03-20T04:28:38Z

ear-marked for 2.0

tilo added the enhancement label May 2, 2017

athityakumar mentioned this issue Jun 15, 2017

CSV reading performance issue. SciRuby/daru#337

Open

tilo added the v2.0 label Jan 24, 2018

tilo closed this as completed Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: improve parallel processing #66

Feature Request: improve parallel processing #66

tilo commented Jul 28, 2015

tilo commented Jan 24, 2018

tilo commented Aug 10, 2018 •

edited

Loading

tbolender commented Apr 30, 2021 •

edited

Loading

tilo commented Mar 20, 2023

Feature Request: improve parallel processing #66

Feature Request: improve parallel processing #66

Comments

tilo commented Jul 28, 2015

tilo commented Jan 24, 2018

tilo commented Aug 10, 2018 • edited Loading

tbolender commented Apr 30, 2021 • edited Loading

tilo commented Mar 20, 2023

tilo commented Aug 10, 2018 •

edited

Loading

tbolender commented Apr 30, 2021 •

edited

Loading