-
-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Dask serializers, registration #1948
Comments
From that cuDF PR at the end, I know there's a lot going on there, but I think only 2 pieces are of particular note (the rest is cleanup thanks to those two pieces). The first is an ABC called |
@jakirkham thanks for helping to break it down and explain! -- the role of the different pieces was not very transparent to me :-) |
Looking at def __reduce__(self):
return NDArrayITKBase, (np.asarray(ndarray_itk_base),) Though am not sure how to capture |
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions. |
The pickling side of the story seems to be largely addressed through PR ( #2829 ) if I'm not mistaken. Dask serialization builds on that a bit, but just supporting pickle should be pretty good there (unless you see a need for additional performance gains). The remaining question here would be handling GPU data without device-to-host/host-to-device copies (as those can be quite costly). Looking at how Dask serializes/deserializes CuPy arrays may be a good starting point. |
This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions. |
As suggested by @jakirkham in #1942, we can improve the registration of Dask serialization and deserialization functions, including making them available in
distribute
. This could be done withNDArrayITKBase
, but we can also have optimized serializers for commonitk.DataObject
's.As @jakirkham noted:
If one communicates objects that Dask already knows how to work with (like NumPy arrays), then one will get optimized serialization for free. This may involve a little (or maybe no) code in ITK to make sure NumPy arrays are handled reasonable well and handed to Dask when functions return. The benefit here seems useful even outside the Dask context.
If one adds pickle protocol 5 support to objects (possibly extending to older Python's with
pickle5
or leveraging NumPy to do this indirectly ( numpy/numpy#12091 )), then one can leverage Dask's own support for pickle protocol 5 ( dask/distributed#3784 ) ( dask/distributed#3849 ), which should yield a similar performance boost to Dask's own custom serialization (as the underlying principles are the same). This would require merely updating how pickling works in ITK. It also will work with anything else that supports pickle protocol 5.Alternatively one can use Dask's custom serialization by following this example on how to extend to custom types. The usual strategy here is to put these registration bits in its own module. This will soften the Distributed dependency and provide a single point for registering these functions. The latter is useful as one needs to
import
this module indistributed
for it to work. This is doable with a bit of work in ITK and Distributed, but not too difficult.Any of these options seems fine. Possibly it's worth doing multiple or even all of them (this is what we did in RAPIDS rapidsai/cudf#5139 ). Just trying to show what's possible and what it would entail to do 🙂
Originally posted by @jakirkham in #1942 (comment)
The text was updated successfully, but these errors were encountered: