Workflow ideas #725

jrbourbeau · 2023-03-21T20:26:05Z

Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue

Data loading and cleaning

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.

Data loading and cleaning #726

Exploratory Analysis

This is where most of our demos live today. Load a dataset, fool around, make some pretty charts

Uber/Lyft, perform various simple dataframe computations and find novel results
Pangeo / earth science workflows #770
~~RenRe~~ punting during our first pass over workflows

Embarrassing parallel ✅

The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.

Add matplotlib arxiv workflow #724

Imaging

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

Image analysis #751

XGBoost

Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.

We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.

Train on a large dataset
Train on a large dataset with HPO with Optuna
Add GPUs

PyTorch + HyperParameter Optimization

We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.

Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster

PyTorch hyperparameter optimization #759

mrocklin · 2023-03-30T22:06:38Z

@guillaumeeb we're trying to add some real-world examples to our benchmark suite (as opposed to the more toy-examples there today) that are reflective of common Dask workloads. We're looking for examples roughly like the following. Ideally we'd find examples that are between 20-200 lines of code in terms of complexity. Looking at the list above, can you think of good examples that you've run across while engaging with users?

mrocklin · 2023-03-31T16:31:09Z

I suspect that @ncclementi has a notebook already for RenRe. Naty can you point to that if you still have it? Did we get clearance to use it in a public setting from them?

ncclementi · 2023-03-31T17:11:39Z

@mrocklin JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want.

The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo
https://github.com/coiled/imbalanced-join

I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need.

mrocklin · 2023-03-31T17:30:07Z

My recollection is that the original RenRe workflow was more than just this one join. It was lots of things. Do we still have that? Is it possible to make that public?

…

On Fri, Mar 31, 2023 at 12:11 PM Naty Clementi ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want. The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo https://github.com/coiled/imbalanced-join I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need. — Reply to this email directly, view it on GitHub <#725 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTE3YTYTGORDJHQMZ63W64F5PANCNFSM6AAAAAAWC6JO6M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO

ncclementi · 2023-03-31T17:38:51Z

@mrocklin we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook.

mrocklin · 2023-03-31T17:40:55Z

Can you point to the repository?

…

On Fri, Mar 31, 2023 at 12:39 PM Naty Clementi ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook. — Reply to this email directly, view it on GitHub <#725 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTECQGAKLCGPTRNLMGDW64JDNANCNFSM6AAAAAAWC6JO6M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO

mrocklin · 2023-03-31T17:45:21Z

For the PyTorch + Optuna + GPUs doing a web search here yields not-terrible results. Here is an example (but I'm confident that there are better ones).

@jacobtomlinson @mmccarty @quasiben I don't suppose you all have any interest in finding something here. My guess is that this is much easier for you all (or someone around you) than it is for me personally.

jrbourbeau · 2023-03-31T19:57:26Z

Moving the HPO conversation into a standalone issue #759

guillaumeeb · 2023-04-01T08:02:53Z

Hi there,

On image processing, there is some complex use case that did not get an answer yet: https://dask.discourse.group/t/parallelize-or-map-chunks-of-arrays-with-different-sizes-shapes-and-number-of-blocks/1663.

Another small example on this topic: https://dask.discourse.group/t/upscaling-an-image-with-dask-image-leads-to-blurry-result/1631/3.

Dask for reading and processing videos: https://dask.discourse.group/t/performing-hog-matrices-on-pims-chunks-through-imageio/570.

I was hoping to find some nice Dataframe + ML workflows, but in the end these kind of topics give only very basic "toy" examples. So after browsing Discourse and Stackoverflow for 20 minutes, I've given up.

jrbourbeau mentioned this issue Mar 21, 2023

Data loading and cleaning #726

Open

3 tasks

This was referenced Mar 30, 2023

Image analysis #751

Open

[WIP] Add rechunking workflow #752

Open

jrbourbeau mentioned this issue Mar 31, 2023

PyTorch hyperparameter optimization #759

Closed

1 task

jrbourbeau added the workflows Related to representative Dask user workflows label Mar 31, 2023

j-bennet mentioned this issue Apr 3, 2023

Exploratory DataFrame analysis #764

Merged

This was referenced Apr 5, 2023

Pangeo / earth science workflows #770

Open

Add XGBoost + Optuna / Dask workflow #785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow ideas #725

Workflow ideas #725

jrbourbeau commented Mar 21, 2023 •

edited

Loading

mrocklin commented Mar 30, 2023

mrocklin commented Mar 31, 2023

ncclementi commented Mar 31, 2023

mrocklin commented Mar 31, 2023 via email

ncclementi commented Mar 31, 2023

mrocklin commented Mar 31, 2023 via email

mrocklin commented Mar 31, 2023

jrbourbeau commented Mar 31, 2023

guillaumeeb commented Apr 1, 2023

Workflow ideas #725

Workflow ideas #725

Comments

jrbourbeau commented Mar 21, 2023 • edited Loading

Data loading and cleaning

Exploratory Analysis

Embarrassing parallel ✅

Imaging

XGBoost

PyTorch + HyperParameter Optimization

mrocklin commented Mar 30, 2023

mrocklin commented Mar 31, 2023

ncclementi commented Mar 31, 2023

mrocklin commented Mar 31, 2023 via email

ncclementi commented Mar 31, 2023

mrocklin commented Mar 31, 2023 via email

mrocklin commented Mar 31, 2023

jrbourbeau commented Mar 31, 2023

guillaumeeb commented Apr 1, 2023

jrbourbeau commented Mar 21, 2023 •

edited

Loading