-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv_batched should return an instance of an iterator #13885
Comments
I meant to add, in this case we are pushing the number_of_batches_per_iteration param from the next_batches method into the read_csv_batched method. I appreciate that this would be a breaking change. |
I haven't used this function yet. I'm confused that |
@mcrumiller I believe each batch is a partition by row count. (I'm not fully aware of why it works this way.) There have been a few I had been using for batches in iter(lambda: reader.next_batches(100), None):
... |
So each iteration you are grabbing 100 batches at a time? Why not just reduce the size of each batch by a factor of 100? Or does that mean grab the next 100 rows? |
Yeah,
pl.read_csv_batched(batch_size = 50000) As you are saying, I'm not sure why the multiple batches interface exists. |
There must be a reason for this. Tracked its origin to #5212. @ritchie46 what's the rationale for retrieving multiple batches at once? We could implement a python iterator that uses |
Edit: We could implement We make a Anybody wants to make a PR for this? |
As an aside, in answering ths question I noticed that the batch size isn't strictly the row size of each batch. I'm not sure if that's intentional or not but just flagging it. Separately still, should there be a batched ndjson reader too? |
Fully support returning an iterator. The method as is is very confusing. |
Check this issue @deanm0000 , It is related to your comment. |
Description
read_csv_batched returns an instance of BatchedCsvReader, which we then need to call next_batches method on. We call it like so:
Would it be possible to instead return an iterator instance (with next and iter methods)? Doing so would allow us to use a for loop.
I believe this to be more concise and intuitive for users of both Rust and Python.
The text was updated successfully, but these errors were encountered: