-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Serialize DataFrame/Series using IPC in serde #20266
Conversation
struct SerializeWrap { | ||
name: PlSmallStr, | ||
/// Unit-length series for dispatching to IPC serialize | ||
unit_series: Series, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ScalarColumn
I had the option of either using the serde::Serialize
impl from AnyValue
, or converting it to a unit-length Series and dispatch to IPC. I chose the IPC option as the AnyValue
serde impl was missing quite a lot of dtypes, and using IPC would also give more assurance that the serialize behavior is the same with the SeriesColumn
.
|
||
#[cfg(feature = "serde")] | ||
#[test] | ||
fn test_deserialize_height_validation_8751() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test was moved from Python - the existing test used an exact string representation of the previous JSON format, but that is changed after this PR.
Our docs:
We don't guarantee serialized data format. :) |
5466abf
to
a6ff973
Compare
60a237f
to
5480878
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20266 +/- ##
==========================================
- Coverage 79.60% 79.49% -0.12%
==========================================
Files 1567 1569 +2
Lines 218528 218634 +106
Branches 2462 2462
==========================================
- Hits 173969 173801 -168
- Misses 43992 44265 +273
- Partials 567 568 +1 ☔ View full report in Codecov by Sentry. |
We currently use custom serialization code, this PR streamlines the implementation to instead use IPC, which has support for more types, and can also be faster in some cases.
I've also swapped the serialization from
ciborium
tobincode
, mainly due to an issue deserializing large byte data (enarx/ciborium#96), butbincode
also appears to do better in benchmarks and has more downloads.serialize(binary)
benchmarks:Note:
Note that there may be some degradations for the
serialize(format="json")
- this tradeoff is acceptable since it's deprecated functionality.Exploratory benchmark testing
yellow_tripdata_2015-01_head1M.csv
(1M rows)Photo
Results CSV
bench_serialize_processed_results.csv
Benchmark code
Fixes #17211