-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParquetReader.RowGroups property #509
Conversation
Noticed that the macOS tests are failing right now, no Apple M1 binary in a certain .jar. I did a super-hacky fix for it:
i.e. add the missing binary to the existing jar. It now appears to work, except on 3.1 where there's a separate issue possible because that only has X64 which isn't being installed? |
the build should be OK now, just merging here from master... |
The build succeeds now. I have no issues accepting this, however you might want to add tests with use cases to guarantee this interface is not forgotten and removed during next refactoring session? Custom .jar can be removed as well. |
I like the .jar hack though ;) But will build latest from official sources later. |
Cool, I've added a simple test for the property and reverted the .jar change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
pusing to pre-release |
Row group readers are created up-front and held in a list, so the
OpenRowGroupReader
method is just a wrapper aroundlist[n]
.There is a note in the reader's
Dispose
method that says that although dispose is not required, it may be in the future so clients so clients have to take on extra ceremony. But the current design is super fast and efficient even with very large numbers of row groups, so it shouldn't be necessary.By clarifying that
Dispose
isn't required, this enables a simpleParquetReader.RowGroups
property that exposes a read-only list of row group readers. This then enables Linq chained expressions to operate on row groups, useReverse
to read from the end,Where
to filter on column stats, andToAsyncEnumerable().SelectManyAwait
to deserialise into a stream of objects. Can do this with multiple sorted parquet sources and then merge them with SortedMergeBy, before batching them back into row groups and directing them to a destination stream. With the right extension methods a whole operation like this could be written declaratively.Rather than causing a breaking change to existing clients I've added an interface that omits
Dispose
but retained it on the class, so theRowGroups
property (which uses the non-disposable interface) makes it clear that disposal is unnecessary.