-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow ideas #725
Comments
@guillaumeeb we're trying to add some real-world examples to our benchmark suite (as opposed to the more toy-examples there today) that are reflective of common Dask workloads. We're looking for examples roughly like the following. Ideally we'd find examples that are between 20-200 lines of code in terms of complexity. Looking at the list above, can you think of good examples that you've run across while engaging with users? |
I suspect that @ncclementi has a notebook already for RenRe. Naty can you point to that if you still have it? Did we get clearance to use it in a public setting from them? |
@mrocklin JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want. The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need. |
My recollection is that the original RenRe workflow was more than just this
one join. It was lots of things. Do we still have that? Is it possible
to make that public?
…On Fri, Mar 31, 2023 at 12:11 PM Naty Clementi ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> JD (RenRe) and I collaborated on
creating a synthetic dataset that represented the original one. The
synthetic data is not public, but it's on the oss-s3. I can make it public
if we want.
The repo on how to create the data, and a replication of their workflow
(imbalance join) are in this repo
https://github.com/coiled/imbalanced-join
I'm happy to chat with whoever will be taking the lead on this to bring
them up to speed and facilitate whatever they need.
—
Reply to this email directly, view it on GitHub
<#725 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE3YTYTGORDJHQMZ63W64F5PANCNFSM6AAAAAAWC6JO6M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
<https://coiled.io>
Matthew Rocklin CEO
|
@mrocklin we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook. |
Can you point to the repository?
…On Fri, Mar 31, 2023 at 12:39 PM Naty Clementi ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> we do have those on a private
repo, that has multiple things (happy to walk you through what's in there,
I'm available this afternoon). When I talked to JD, they mentioned their
main issue was the joins shown on the notebook.
—
Reply to this email directly, view it on GitHub
<#725 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTECQGAKLCGPTRNLMGDW64JDNANCNFSM6AAAAAAWC6JO6M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
<https://coiled.io>
Matthew Rocklin CEO
|
For the PyTorch + Optuna + GPUs doing a web search here yields not-terrible results. Here is an example (but I'm confident that there are better ones). @jacobtomlinson @mmccarty @quasiben I don't suppose you all have any interest in finding something here. My guess is that this is much easier for you all (or someone around you) than it is for me personally. |
Moving the HPO conversation into a standalone issue #759 |
Hi there, On image processing, there is some complex use case that did not get an answer yet: https://dask.discourse.group/t/parallelize-or-map-chunks-of-arrays-with-different-sizes-shapes-and-number-of-blocks/1663. Another small example on this topic: https://dask.discourse.group/t/upscaling-an-image-with-dask-image-leads-to-blurry-result/1631/3. Dask for reading and processing videos: https://dask.discourse.group/t/performing-hog-matrices-on-pims-chunks-through-imageio/570. I was hoping to find some nice Dataframe + ML workflows, but in the end these kind of topics give only very basic "toy" examples. So after browsing Discourse and Stackoverflow for 20 minutes, I've given up. |
Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue
Data loading and cleaning
Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.
Exploratory Analysis
This is where most of our demos live today. Load a dataset, fool around, make some pretty charts
RenRepunting during our first pass over workflowsEmbarrassing parallel ✅
The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.
Imaging
There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.
XGBoost
Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.
We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.
PyTorch + HyperParameter Optimization
We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.
Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster
The text was updated successfully, but these errors were encountered: