-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add to_parquet_row_groups #2979
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
…cikit-hep/awkward into feat-add-to-parquet-row-groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need to define the batch_iterator
function. (Is it a copy of what was in Awkward 1.x? This is a different context.) You're already iterating over batches in ak_to_parquet.py lines 408‒435. I haven't looked closely enough at it to point out whether something is superfluous, but it looks like it might be. It's worth focusing on that a bit more.
I wish there was a way to write a test that would check to see that all the batches are not loaded into memory at once. Maybe you could try it on a manual test, outside of CI, with a dataset that is too large to fit in memory at once. Putting (temporary!) print statements in some places would also let you check the temporal order of things.
…ne batch now, changed iter parameter to write_iteratively
…cikit-hep/awkward into feat-add-to-parquet-row-groups
…ne batch now, changed iter parameter to write_iteratively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good in that it works, but the code is a bit repetitive and the control flow is hard to follow. I'll just make a few edits after this review: don't merge until I'm done.
This refactoring reduces three repetitions of The >>> some_list = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
>>> next(some_list)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not an iterator >>> iterator = iter(some_list)
>>> next(iterator)
'one'
>>> next(iterator)
'two'
>>> next(iterator)
'three'
>>> for item in iterator:
... print(item)
...
four
five
six
seven
eight
nine
ten but >>> for item in some_list:
... print(item)
...
one
two
three
four
five
six
seven
eight
nine
ten I think this PR is ready for merging, after doing an "Update branch." Look it over and decide if you want to make any counter-edits. If so, ping me again for another review. If not, go ahead and merge it. |
No description provided.