Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow ideas #725

Open
2 of 10 tasks
jrbourbeau opened this issue Mar 21, 2023 · 9 comments
Open
2 of 10 tasks

Workflow ideas #725

jrbourbeau opened this issue Mar 21, 2023 · 9 comments
Labels
workflows Related to representative Dask user workflows

Comments

@jrbourbeau
Copy link
Member

jrbourbeau commented Mar 21, 2023

Note: I'm moving over the list of proposed workflows from the roadmap to this repo. I'll continue to iterate a bit on this issue

Data loading and cleaning

Dask is often used to schlep data from one format to another, cleaning or manipulating it along the way. This occurs in both dataframe and array use cases. There are lots of possible configurations here, but we’ll focus on just a few to start.

Exploratory Analysis

This is where most of our demos live today. Load a dataset, fool around, make some pretty charts

Embarrassing parallel ✅

The matplotlib-arXiv notebook is a good example we have today of embarrassingly parallel workflows. This is “Dask a as a big for loop”. It also shows cloud data access and processes 3TB of real data.

Imaging

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

XGBoost

Probably our most common application in ML, folks want to load data into a dask dataframe and then hand off to XGBoost’s Dask integration, possibly with GPUs. They also want to do this with Hyper-Parameter-Optimization.

We already have Guido’s work here at https://github.com/coiled/dask-xgboost-nyctaxi . Maybe we want to extend it with GPUs or cost analysis.

  • Train on a large dataset
  • Train on a large dataset with HPO with Optuna
  • Add GPUs

PyTorch + HyperParameter Optimization

We have Optuna. We use it above for XGBoost but we should also show how to use it in more vanilla settings with a model that can be trained on a single machine, presumably a GPU. Let’s use PyTorch for this.

Train some PyTorch GPU model that fits on a single GPU with Optuna for HPO on a cluster

@mrocklin
Copy link
Member

@guillaumeeb we're trying to add some real-world examples to our benchmark suite (as opposed to the more toy-examples there today) that are reflective of common Dask workloads. We're looking for examples roughly like the following. Ideally we'd find examples that are between 20-200 lines of code in terms of complexity. Looking at the list above, can you think of good examples that you've run across while engaging with users?

@mrocklin
Copy link
Member

I suspect that @ncclementi has a notebook already for RenRe. Naty can you point to that if you still have it? Did we get clearance to use it in a public setting from them?

@ncclementi
Copy link
Contributor

@mrocklin JD (RenRe) and I collaborated on creating a synthetic dataset that represented the original one. The synthetic data is not public, but it's on the oss-s3. I can make it public if we want.

The repo on how to create the data, and a replication of their workflow (imbalance join) are in this repo
https://github.com/coiled/imbalanced-join

I'm happy to chat with whoever will be taking the lead on this to bring them up to speed and facilitate whatever they need.

@mrocklin
Copy link
Member

mrocklin commented Mar 31, 2023 via email

@ncclementi
Copy link
Contributor

@mrocklin we do have those on a private repo, that has multiple things (happy to walk you through what's in there, I'm available this afternoon). When I talked to JD, they mentioned their main issue was the joins shown on the notebook.

@mrocklin
Copy link
Member

mrocklin commented Mar 31, 2023 via email

@mrocklin
Copy link
Member

For the PyTorch + Optuna + GPUs doing a web search here yields not-terrible results. Here is an example (but I'm confident that there are better ones).

@jacobtomlinson @mmccarty @quasiben I don't suppose you all have any interest in finding something here. My guess is that this is much easier for you all (or someone around you) than it is for me personally.

@jrbourbeau
Copy link
Member Author

Moving the HPO conversation into a standalone issue #759

@jrbourbeau jrbourbeau added the workflows Related to representative Dask user workflows label Mar 31, 2023
@guillaumeeb
Copy link

Hi there,

On image processing, there is some complex use case that did not get an answer yet: https://dask.discourse.group/t/parallelize-or-map-chunks-of-arrays-with-different-sizes-shapes-and-number-of-blocks/1663.

Another small example on this topic: https://dask.discourse.group/t/upscaling-an-image-with-dask-image-leads-to-blurry-result/1631/3.

Dask for reading and processing videos: https://dask.discourse.group/t/performing-hog-matrices-on-pims-chunks-through-imageio/570.

I was hoping to find some nice Dataframe + ML workflows, but in the end these kind of topics give only very basic "toy" examples. So after browsing Discourse and Stackoverflow for 20 minutes, I've given up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflows Related to representative Dask user workflows
Projects
None yet
Development

No branches or pull requests

4 participants