Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add to_parquet_row_groups #2979

Merged
merged 23 commits into from
Jan 31, 2024
Merged

feat: add to_parquet_row_groups #2979

merged 23 commits into from
Jan 31, 2024

Conversation

zbilodea
Copy link
Collaborator

No description provided.

Copy link

codecov bot commented Jan 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f2a2340) 81.87% compared to head (14087c8) 81.89%.

Additional details and impacted files
Files Coverage Δ
src/awkward/operations/__init__.py 100.00% <100.00%> (ø)
src/awkward/operations/ak_to_parquet.py 62.01% <100.00%> (+7.38%) ⬆️
src/awkward/operations/ak_to_parquet_row_groups.py 100.00% <100.00%> (ø)

@zbilodea zbilodea marked this pull request as draft January 25, 2024 10:23
@zbilodea zbilodea marked this pull request as ready for review January 26, 2024 14:17
@zbilodea zbilodea requested a review from jpivarski January 26, 2024 14:17
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need to define the batch_iterator function. (Is it a copy of what was in Awkward 1.x? This is a different context.) You're already iterating over batches in ak_to_parquet.py lines 408‒435. I haven't looked closely enough at it to point out whether something is superfluous, but it looks like it might be. It's worth focusing on that a bit more.

I wish there was a way to write a test that would check to see that all the batches are not loaded into memory at once. Maybe you could try it on a manual test, outside of CI, with a dataset that is too large to fit in memory at once. Putting (temporary!) print statements in some places would also let you check the temporal order of things.

src/awkward/operations/ak_to_parquet.py Outdated Show resolved Hide resolved
src/awkward/operations/ak_to_parquet.py Outdated Show resolved Hide resolved
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good in that it works, but the code is a bit repetitive and the control flow is hard to follow. I'll just make a few edits after this review: don't merge until I'm done.

src/awkward/operations/ak_to_parquet.py Outdated Show resolved Hide resolved
@jpivarski jpivarski marked this pull request as draft January 29, 2024 19:47
@jpivarski jpivarski marked this pull request as ready for review January 29, 2024 20:34
@jpivarski
Copy link
Member

This refactoring reduces three repetitions of layout = ... followed by table = ... into one by defining a temporary function, and it simplifies the while ... try ... except StopIteration logic with a for loop. It also adds guards to prevent users from making some likely mistakes.

The for loop takes advantage of the distinction between iterables (such as lists, array, and iterators) and iterators: when the first items in an iterator have been consumed with next, a for loop over the iterator does not include the already-consumed items—Python does not go back to the beginning.

>>> some_list = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
>>> next(some_list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list' object is not an iterator
>>> iterator = iter(some_list)
>>> next(iterator)
'one'
>>> next(iterator)
'two'
>>> next(iterator)
'three'
>>> for item in iterator:
...     print(item)
... 
four
five
six
seven
eight
nine
ten

but

>>> for item in some_list:
...     print(item)
... 
one
two
three
four
five
six
seven
eight
nine
ten

I think this PR is ready for merging, after doing an "Update branch." Look it over and decide if you want to make any counter-edits. If so, ping me again for another review. If not, go ahead and merge it.

@zbilodea zbilodea merged commit 6bef932 into main Jan 31, 2024
39 checks passed
@zbilodea zbilodea deleted the feat-add-to-parquet-row-groups branch January 31, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants