Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv_batched should return an instance of an iterator #13885

Open
aburnsy opened this issue Jan 21, 2024 · 10 comments
Open

read_csv_batched should return an instance of an iterator #13885

aburnsy opened this issue Jan 21, 2024 · 10 comments
Labels
A-io-csv Area: reading/writing CSV files enhancement New feature or an improvement of an existing feature

Comments

@aburnsy
Copy link

aburnsy commented Jan 21, 2024

Description

read_csv_batched returns an instance of BatchedCsvReader, which we then need to call next_batches method on. We call it like so:

reader = pl.read_csv_batched("big_file.csv") 
while (batches :=  reader.next_batches(100)) is not None
    df = pl.concat(batches)

Would it be possible to instead return an iterator instance (with next and iter methods)? Doing so would allow us to use a for loop.

reader = pl.read_csv_batched("big_file.csv", number_of_batches_per_iteration=100) 
for batches in reader: 
    df = pl.concat(batches)

I believe this to be more concise and intuitive for users of both Rust and Python.

@aburnsy aburnsy added the enhancement New feature or an improvement of an existing feature label Jan 21, 2024
@aburnsy
Copy link
Author

aburnsy commented Jan 21, 2024

I meant to add, in this case we are pushing the number_of_batches_per_iteration param from the next_batches method into the read_csv_batched method.

I appreciate that this would be a breaking change.

@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 21, 2024

I haven't used this function yet. I'm confused that next_batches returns multiple dataframes. Does each batch process a separate file, or is each batch a partitition of the CSV by row count? Why would someone want to batch and also do multiple batches at once?

@cmdlineluser
Copy link
Contributor

@mcrumiller I believe each batch is a partition by row count. (I'm not fully aware of why it works this way.)

There have been a few read_csv_batched questions on SO and there does seem to be some confusion with how to use it correctly.

I had been using iter()

for batches in iter(lambda: reader.next_batches(100), None):
    ...

@mcrumiller
Copy link
Contributor

So each iteration you are grabbing 100 batches at a time? Why not just reduce the size of each batch by a factor of 100? Or does that mean grab the next 100 rows?

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jan 21, 2024

Yeah, .next_batches(N) returns a list of N batches that has batch_size rows.

batch_size: int = 50000

pl.read_csv_batched(batch_size = 50000)

As you are saying, I'm not sure why the multiple batches interface exists.

@mcrumiller
Copy link
Contributor

There must be a reason for this. Tracked its origin to #5212. @ritchie46 what's the rationale for retrieving multiple batches at once?

We could implement a python iterator that uses reader.next_batches(1) which would allow someone to cleanly iterate through each batch as per the OP.

@stinodego stinodego added the A-io-csv Area: reading/writing CSV files label Jan 22, 2024
@ritchie46
Copy link
Member

ritchie46 commented Jan 22, 2024

Because the iterator trait forces a single batch. This API is more generic. You can always reduce to a single batch per iteration if you want. But it will be slower and for some readers single threaded.

Edit: We could implement iterator for the BatchedReader. Then a single item contains a set of batches.

We make a number_of_batches_per_iteration=100 kwarg and use that in the next calls. If the user wants to use the old API they still can.

Anybody wants to make a PR for this?

@deanm0000
Copy link
Collaborator

deanm0000 commented Jan 22, 2024

As an aside, in answering ths question I noticed that the batch size isn't strictly the row size of each batch. I'm not sure if that's intentional or not but just flagging it.

Separately still, should there be a batched ndjson reader too?

@mimakaev
Copy link

Fully support returning an iterator. The method as is is very confusing.

@CarMoreno
Copy link

Check this issue @deanm0000 , It is related to your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

8 participants