-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Serializable
ABC for Python
#5139
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
This provides a generic interface for handling serialization in Python. Handles serialization with Dask through the `"cuda"` and `"dask"` serializers. Also implements pickle serialization using `__reduce_ex__`. For Python versions with support for pickle's protocol 5, the class also supports out-of-band buffers for more efficient serialization. Subclasses are responsible for implementing `serialize` and `deserialize` to/from a Python `dict` with the `header` and a collection of `frames`. The abstract base class handles all other serialization using these two methods.
01dff21
to
e37b4a1
Compare
|
||
@classmethod | ||
def device_deserialize(cls, header, frames): | ||
for f in frames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to be true with pack/unpack serialization which packs into a single host buffer that stores metadata, and a single device buffer that stores data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense. Happy to change this as we see fit.
The bigger idea here is that when we make these changes we can now do them in one place. So hopefully that makes pack/unpack and other changes in the future easier :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think then we can leave this as is for now. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added PR ( #5309 ), which should handle a mixture of host and device frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1000 this is great! Edit: hit "Approve" instead of "Comment".
cc @Salonijain27 @hcho3 (who have also been thinking about serialization in the cuML and XGBoost contexts recently 🙂) |
Since `memoryview`s are not generally pickleable, coerce them to NumPy arrays, which are, when handling pickling with older pickle protocols. For pickle protocol 5+, stick with coercing them to `PickleBuffer`s, which pickle knows how to serialize out-of-band.
Just as a note, we should be able to drop |
As `Index` already subclasses `Serializable`, there is no need for `MultiIndex` to subclass `Serializable` as well. Plus it seems to be confusing Python. So drop it.
Serializable
ABC for PythonSerializable
ABC for Python
|
||
# all (de-)serializations are attached to cudf Objects: | ||
# Series/DataFrame/Index/Column/Buffer/etc | ||
serializable_classes = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 one from me as well! I especially like moving away from this list to anything which implements Serializable
cc @pentschev so he is aware as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding code duplication is always nice, thanks for doing that @jakirkham !
Thanks all! 😄 |
Handling in PR ( rapidsai/cuml#2281 ). |
Opening this as WIP, RFC PR to facilitate discussion. There are a few different serialization efforts. So hopefully we can figure out how this would fit in.
Adds an abstract base class
Serializable
, which pulls together some serialization logic in the cuDF Python codebase in one spot. Subclasses are responsible for implementing aserialize
method, which breaks an object down into aheader
(Pythondict
) andframes
(Buffer
s), and adeserialize
method, which takes theheader
andframes
to produce a new object. In return theSerializable
class provides a method for host serialization, device serialization, and pickling the object (using out-of-band buffers with pickle's protocol 5 when available). Also theSerializable
class is registered with Dask for serialization. So any class that subclassesSerializable
is automatically registered with Dask serialization as well.cc @kkraus14 @shwina @quasiben