Feature/add row group serialization #506
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At my company we need to have more granularity regarding serialization and deserialization of row groups.
We use pools of objects to avoid instantiations and we multi-thread modifications of objects in those pools and parquet row group writing using DoubleBuffer (which use the mentionned pools).
This way we have fast and memory efficient parquet jobs.
We were using version 3 of this nuget, using reflection to access private methods of
ClrBridge
class to get fast and memory efficient serialization.As serialization API implementation changed a lot, we cannot achieve the same on version 4.
So here is our contribution.
This adds a method to serialize a collection into a single row group.
This adds methods to deserialize a single row group into an existing collection.
This adds methods to deserialize row group per row group using
IAsyncEnumerable
.