-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross Language Client/Workers #586
Comments
Doing all these in Julia should be fairly straightforward. I also see an MsgPack.jl package https://github.com/kmsquire/MsgPack.jl
Are there code samples showing how to do 1, 2 and 3? |
Some relevant documentation:
I was looking at Julia tasks yesterday and indeed they seem quite nice. I'm a bit jealous :) You could also have workers with a single thread for computation. That's a common mode of operation in Python as well when dealing with GIL-holding Pure Python code. However it is important that the worker process can handle incoming messages while computing though. Presumably Julia's asynchronous task API supports this? It seems like there is some non-trivial interest in this. I'll spend some time scoping out a minimal API for clients and workers. |
With Julia in particular I'm curious on how to have this satisfy other parallel programming protocols in the language. The main thing that Dask brings here is intelligent dynamic task scheduling. We've thought fairly hard about how to do this well. The APIs that Python Dask uses though might not be the right APIs for Julia-Dask. Python-dask generally uses delayed function calls and futures/promises. |
Some general notes that came up in private conversation:
|
Yes, Julia processes can handle asyncrhonous communication while computing, so long as they are executing Julia code. This is done through Julia Tasks. If a lot of time is spent in a C/Fortran library call, say BLAS, then that will starve the julia scheduler. The
|
Here's a simple example using Julia's download() function to download some URLs, which calls
|
To me the larger question will be how Julia users can use these capabilities. That will eventually need a parallel array/dataframe implementation. The good news is that most Julia array code targets AbstractArrays and increasingly, AbstractDataframes are also falling into place. Thus DaskArrays and DaskDataFrames can transparently provide alternate implementations and be leveraged by users - everything will work out under the hood with multiple dispatch. |
I agree that a parallel array/dataframe implementation composes well with this, however I would disagree with the term "need". A surprisingly large amount of Dask's use today has little to do with the dask.array and dask.dataframe implementations and a lot to do with arbitrary task scheduling. Big business definitely likes "Big Data Frames" as a database surrogate on Hadoop, but outside of that there is a lot of use of Dask for arbitrary dynamic task scheduling. This was somewhat unexpected and a large pivot of the project from Dask's original implementation as a parallel array. Here is a slide deck from a recent PyData conference stressing this topic if it's of interest: http://matthewrocklin.com/slides/pydata-dc-2016#/ |
That's interesting. Thanks for the link - just went through it. So what do users typically use Dask parallelism for today? This would be quite interesting personally for me - but not sure if this is the right place to ask that question. Any pointers would help. |
These links have some additional information: http://dask.pydata.org/en/latest/use-cases.html |
This project seems to be communicating to the dask-scheduler from Julia: https://github.com/invenia/DaskDistributedDispatcher.jl |
@mrocklin Yup that's our project :) |
@mrocklin You should come to JuliaCon next week in Berkeley if you are in the region! |
I would definitely swing by if I was in the area. I've moved away from the
bay area though. Next time there is an event on the US East Coast I'll try
to swing by.
…On Sun, Jun 18, 2017 at 4:01 AM, Viral B. Shah ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> You should come to JuliaCon next
week in Berkeley if you are in the region!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#586 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszDAppN9Jmh1kWfq8OZdfHdtnjgnqks5sFNlIgaJpZM4KXtOj>
.
|
Didn't know scheduler is language agnostic. Will check whether it is possible to create small proof of concept for R. |
I'm very glad to hear it @dselivanov . Please let us know if there is anything that causes frustration on the Dask side. |
Hi @dselivanov - did you get anywhere with this? I had a conversation with someone who said that RStudio put the work in to make R play with Spark. Apparently, they ended up using PySpark to glue R to Spark. I wonder if they'd have the appetite to do the same for Dask - we might find that some of the work has already been done. |
@niallrobinson this is not true - sparkR and sparklyr work without python. Regarding dask for R - I haven't had time to create proof of concept. But I'm pretty sure it is possible. The main challenge is how to communicate over a network in a non-blocking way. |
My hope is that R has some non-blocking concurrent framework. If not then we would have to communicate in separate threads and use queues. |
It looks like https://github.com/HenrikBengtsson/future can be useful. @HenrikBengtsson what is your opinion - you've done quite a lot in this field? |
@mrocklin would be super useful if you can point to the document/scheme on what is needed to implement very minimal "big data 'hello world' " - word count. Something like
Something like above but with technical details will be super useful for the start. |
These links may help people to get started on this:
http://distributed.readthedocs.io/en/latest/journey.html
http://distributed.readthedocs.io/en/latest/protocol.html
https://github.com/invenia/Dispatcher.jl
…On Mon, Apr 16, 2018 at 2:19 AM, Dmitriy Selivanov ***@***.*** > wrote:
@mrocklin <https://github.com/mrocklin> would be super useful if you can
point to the document/scheme on what is needed to implement very minimal
"big data 'hello world' " - word count. Something like
1. start with 2 workers w1, w2 and client c
2. Client c send message to scheduler to read data
3. scheduler sends message to workers w1, w2 to read data and assign
results to keys k1_1, k2_1
4. scheduler send message to workers to make local reduce - local word
counts. Assign results to k1_2, k2_2.
5. scheduler send message to partition results by word (say 2 hash
partitions for each worker) - {k1_3, k1_4}, {k2_3, k2_4}
6. scheduler send message to workers to send results to peers:
- w1 sends k1_3 to w2
- w2 sends k2_4 to w1
7. scheduler asks for local reduce
Something like above but with technical details will be super useful for
the start.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#586 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszL8sqK3IUhYu1AgIx6buJFpJ8d4cks5tpDf-gaJpZM4KXtOj>
.
|
The Dask.distributed dynamic task scheduler could be replicated across different languages with low-to-moderate effort. This would require someone to build Client and Worker objects in the other language that communicate to the same Scheduler, which contains most of the logic but is fortunately language agnostic. More specifically, there are three players in a dask.distributed cluster, only two of which would need to be rewritten:
About 90% of the complexity of dask.distributed is in the scheduler. Fortunately the scheduler is also language agnostic, and communicates only using msgpack and long bytestrings. It should be doable to re-implement the Client and Workers in another language like R or Julia if anyone has interest. This would require the following understanding in the other language:
The text was updated successfully, but these errors were encountered: