Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskR prototype with rpy2 and reticulate #2254

Open
mrocklin opened this issue Sep 17, 2018 · 4 comments
Open

DaskR prototype with rpy2 and reticulate #2254

mrocklin opened this issue Sep 17, 2018 · 4 comments
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better

Comments

@mrocklin
Copy link
Member

mrocklin commented Sep 17, 2018

@quasiben has been playing with Dask and R with reticulate and rpy2 with the objective of providing Dask's concurrent.futures API to R (from which they could presumably build other systems). This is somewhat tricky because you have both a Python and R session living side-by-side and need to move things between them from time to time (hopefully infrequently) using reticulate or rpy2. There are a variety of ways to do this, I thought I'd lay out the way that makes the most sense to me.

  • From R we need ways to get proxy objects in Python that point to our local objects in R without doing any clever/lossy conversion (like R dataframes to Pandas dataframes). Below I use rpy2 as this proxy object, but anything would do.

  • From within Python we need ways to serialize and deserialize these proxy objects, presumably this involves calling R's serialize function on the R side and then bringing those bytes back to Python, and then doing the reverse with deserialize.

    On the Dask side we need to implement the following interface for rpy2 objects: https://distributed.readthedocs.io/en/latest/serialization.html#dask-serialization-family

    @dask_serialize.register(rpy.whatever_an_object_type_is)
    def serialize_rpy2(obj):
        r_bytes = rpy2.call("serialize", obj)  # call R `serialize` function on pointed-to-object
        py_bytes = rpy2.get_from_R(r_bytes)  # bring bytes back home to python
        header = {}
        frames = [py_bytes]
        return header, frames
    
    @dask_deserialize.register(rpy2.whatever_an_object_type_is)
    def deserialize_rpy2(header, frames):
        [py_bytes] = frames
        r_bytes = rpy2.send_to_R(py_bytes)
        obj = rpy2.call("deserialize", r_bytes)
        return obj
  • Also on the Python side we need to implement the sizeof protocol` to quickly estimate how large an object is in bytes

    from dask.sizeof import sizeof
    @sizeof.register(rpy2.whatever_type_objects_are)
    def sizeof(x) -> int:
        return number_of_bytes
  • We'll need to develop a basic API in R. If we're starting with the concurrent.futures interface (which seems simplest) then this probably means submit, gather, as_completed, wait, and map (is anything else critical?). We should check in with people familiar with R though to learn if there is a more native API that R users would find more familiar.

@mrocklin
Copy link
Member Author

@dhirschfeld
Copy link
Contributor

Maybe arrow/feather will (one day) provide a better way to interop with R.

It seems a fully-functional arrow implementation is still a ways off though:
apache/arrow#2489

@mrocklin
Copy link
Member Author

Arrow would be important here if we wanted to run pandas code on R dataframes or R code on pandas dataframes. That isn't the case here. This issue is restricted to just calling R code on R objects (dataframes and otherwise). I don't think that Arrow is useful in this particular setting.

@mrocklin
Copy link
Member Author

@GenevieveBuckley GenevieveBuckley added enhancement Improve existing functionality or make things work better discussion Discussing a topic with no specific actions yet labels Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better
Projects
None yet
Development

No branches or pull requests

3 participants