Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Serializable ABC for Python #5139

Merged
merged 21 commits into from
May 20, 2020

Conversation

jakirkham
Copy link
Member

Opening this as WIP, RFC PR to facilitate discussion. There are a few different serialization efforts. So hopefully we can figure out how this would fit in.

Adds an abstract base class Serializable, which pulls together some serialization logic in the cuDF Python codebase in one spot. Subclasses are responsible for implementing a serialize method, which breaks an object down into a header (Python dict) and frames (Buffers), and a deserialize method, which takes the header and frames to produce a new object. In return the Serializable class provides a method for host serialization, device serialization, and pickling the object (using out-of-band buffers with pickle's protocol 5 when available). Also the Serializable class is registered with Dask for serialization. So any class that subclasses Serializable is automatically registered with Dask serialization as well.

cc @kkraus14 @shwina @quasiben

@jakirkham jakirkham requested a review from a team as a code owner May 8, 2020 06:21
@jakirkham jakirkham marked this pull request as draft May 8, 2020 06:21
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

jakirkham added 4 commits May 7, 2020 23:49
This provides a generic interface for handling serialization in Python.
Handles serialization with Dask through the `"cuda"` and `"dask"`
serializers. Also implements pickle serialization using `__reduce_ex__`.
For Python versions with support for pickle's protocol 5, the class also
supports out-of-band buffers for more efficient serialization.
Subclasses are responsible for implementing `serialize` and
`deserialize` to/from a Python `dict` with the `header` and a collection
of `frames`. The abstract base class handles all other serialization
using these two methods.
@jakirkham jakirkham force-pushed the add_serializable branch from 01dff21 to e37b4a1 Compare May 8, 2020 06:51

@classmethod
def device_deserialize(cls, header, frames):
for f in frames:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to be true with pack/unpack serialization which packs into a single host buffer that stores metadata, and a single device buffer that stores data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense. Happy to change this as we see fit.

The bigger idea here is that when we make these changes we can now do them in one place. So hopefully that makes pack/unpack and other changes in the future easier :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think then we can leave this as is for now. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added PR ( #5309 ), which should handle a mixture of host and device frames.

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1000 this is great! Edit: hit "Approve" instead of "Comment".

@shwina shwina self-requested a review May 8, 2020 10:18
@jakirkham
Copy link
Member Author

cc @Salonijain27 @hcho3 (who have also been thinking about serialization in the cuML and XGBoost contexts recently 🙂)

jakirkham added 2 commits May 8, 2020 13:45
Since `memoryview`s are not generally pickleable, coerce them to NumPy
arrays, which are, when handling pickling with older pickle protocols.
For pickle protocol 5+, stick with coercing them to `PickleBuffer`s,
which pickle knows how to serialize out-of-band.
@jakirkham
Copy link
Member Author

Just as a note, we should be able to drop __reduce__ from CumlArray with this change as pickling would already be handled here.

cc @dantegd @cjnolet (for awareness)

@jakirkham jakirkham changed the title WIP, RFC: Add Serializable Add Serializable ABC for Python May 19, 2020
@jakirkham jakirkham marked this pull request as ready for review May 19, 2020 00:53
@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels May 19, 2020
@jakirkham jakirkham added 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels May 19, 2020
@jakirkham jakirkham changed the title [WIP] Add Serializable ABC for Python Add Serializable ABC for Python May 19, 2020
@shwina shwina self-requested a review May 19, 2020 19:29

# all (de-)serializations are attached to cudf Objects:
# Series/DataFrame/Index/Column/Buffer/etc
serializable_classes = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 one from me as well! I especially like moving away from this list to anything which implements Serializable

@quasiben
Copy link
Member

cc @pentschev so he is aware as well

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding code duplication is always nice, thanks for doing that @jakirkham !

@jakirkham jakirkham merged commit d9629b1 into rapidsai:branch-0.14 May 20, 2020
@jakirkham jakirkham deleted the add_serializable branch May 20, 2020 01:43
@jakirkham
Copy link
Member Author

Thanks all! 😄

@jakirkham
Copy link
Member Author

Just as a note, we should be able to drop __reduce__ from CumlArray with this change as pickling would already be handled here.

cc @dantegd @cjnolet (for awareness)

Handling in PR ( rapidsai/cuml#2281 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants