Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image analysis #751

Open
jrbourbeau opened this issue Mar 30, 2023 · 17 comments
Open

Image analysis #751

jrbourbeau opened this issue Mar 30, 2023 · 17 comments
Labels
workflows Related to representative Dask user workflows

Comments

@jrbourbeau
Copy link
Member

There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.

A representative example would be to:

  1. Load data with from_zarr
  2. Perform map_overlap with a trivial function
  3. Then select a sequence of regions one after the other
@jrbourbeau jrbourbeau added the workflows Related to representative Dask user workflows label Mar 30, 2023
@jrbourbeau
Copy link
Member Author

@GenevieveBuckley this is part of our broader effort to include large-scale, representative user workflows in examples and decisions that inform future development of Dask (xref #725). I'd love to get your thoughts on what you think makes a good representative image analysis workflow. Thoughts on a nice public dataset? Maybe you already have a notebook that shows this type of workflow off?

@jrbourbeau jrbourbeau mentioned this issue Mar 30, 2023
10 tasks
@GenevieveBuckley
Copy link

GenevieveBuckley commented Mar 31, 2023

I'd say typical image analysis workflows often have this structure:

  1. Loading data
  2. Preprocessing images (an interactive preview here would help choosing the right parameters)
  3. Identifying objects (also a good spot for interactive use, you almost always need to manually correct segmentations in some way)
  4. Measuring objects

Loading data

  • You can probably skip most of the annoying parts of this, which include moving your data, and probably converting it into a different format (like zarr) before you load and process it.

Preprocessing images could include things like:

Identifying objects usually involves

  • Segmentation of objects. (Here's a very basic scikit-image label image example).
  • Cellpose is a really nice tool: https://www.cellpose.org/
  • Cleaning up the segmentation - often there is manual correction that needs to be done.

Measuring objects

  • What people want to measure depends on their use case, and the question they're asking
  • Things like size, shape, distances from other types of objects. The scikit-image regionprops function gives you some idea of things people can measure (not many are implemented in dask-image though)
  • Statistical analysis, graphs, etc.

@mrocklin
Copy link
Member

Thanks @GenevieveBuckley . I don't suppose you or someone around you has a representative workflow lying around? My guess is that if we have a non-imaging person try to construct a normal workflow it'll be both time consuming and miss important aspects of the work. I'm hopeful that you know of some dataset and canonical notebook somewhere that we could clean up and use instead of trying to invent something from scratch.

@GenevieveBuckley
Copy link

Maybe you already have a notebook that shows this type of workflow off?

I gave a talk at PyConAU a few years ago, and it does include a basic image segmentation workflow. This isn't the most exciting example, because it's a stack of separate 2d images and embarrassingly parallel, but it was good for showing all the dask-image features we had.

Nick Sofreniew also has a nice demo from a few years ago showing interactive cell segmentation. The context here was showing off what you could do with napari, and how interactivity makes things a lot easier. Here's the jupyter notebook (there are two other jupyter notebooks that come before that one, so if things don't make sense you can skip back a bit if you need) and corresponding YouTube video.

@GenevieveBuckley
Copy link

GenevieveBuckley commented Mar 31, 2023

Thoughts on a nice public dataset?

Lots of thoughts! You might also ask @joshmoore, I think there are some good datasets already in .zarr format on the Image Data Resource (IDR) (plus you should be able to preview them remotely with napari).

MRI dataset

Timeseries light microscopy

A timeseries microscopy dataset, like these developing insect embyros, would be stunning. It would fit very easily into a workflow to load the data, filter the images, threshold bright objects to segment the nuclei, then count & measure them.

http://celltrackingchallenge.net/3d-datasets/

Developing Tribolium Castaneum embryo (red flour beetle)

Or if the one above is too big, you could try this smaller embryo

C.elegans developing embryo (worm embryo)

Lattice lightsheet microscopy

I also think a lattice lightsheet dataset could make a good example. Talley has a nice example (and has turned it into a plugin) deskewing a lattice lightsheet dataset, and you can download the example data.

The napari-lattice wiki pages have some more detail about the desired workflow (load, deskew, crop, deconvolve, etc.), which might be helpful for background context.

Electron imaging dataset

Janelia makes volume electron microscopy datasets publically available on AWS: https://openorganelle.janelia.org/

Possibly also useful are the Caltech Electron Tomography Database and/or EMPIAR (the Electron Microscopy Public Image Archive). They might be a bit harder to search through for something suitable though.

Histology whole slide image(s)

I know of the CAMELYON challenge has publicly available histology. But it can be tricky to download the images (I think they're hosted on google drive, so it works ok manually if you do it once or twice - but repeated downloads will trigger rate limiting or blocks), and also it needs to be converted to zarr (this WSI reader might be helpful?).
dask/dask-image#125

More datasets

We've been collecting lists of datasets in these issues:

@joshmoore
Copy link

joshmoore commented Apr 5, 2023

There is a list of OME-Zarr related datasets under https://ngff.readthedocs.io/en/rtd/data/index.html (though that URL will be updated to https://ngff.openmicroscopy.org/data once I'm back from travels).

There is a list of OME-Zarr related datasets under https://ngff.openmicroscopy.org/data

If other links end up here, it'd be great to also point to them from that resource.

One that needs adding, for example, is http://zebrahub.org/

@mrocklin
Copy link
Member

mrocklin commented Apr 5, 2023

What I'm seeing here is a bunch of "here is a bunch of raw material that you could use" which is great. Thank you all.

However, I suspect that @jrbourbeau is now in a position where he's being asked to judge a bunch of stuff that he doesn't understand at all. If anyone has the time to say "This workload (or two) is computationally representative of the image processing community. It is what we would want included in a regularly run Dask benchmark suite." that would be welcome.

Otherwise, my guess is that James will choose one or two of these at random and it won't be very good. (no offense James)

@GenevieveBuckley
Copy link

That's fair. You don't want lots of ideas, we need just one or two good ones.

Let's do these two things:

  1. Image analysis example 1: You can use the example I wrote for this talk using the BBBC039 dataset, since it can be used pretty much verbatim.
  2. Image analysis example 2: A nuclei segmentation example using the red flour beetle ("Developing Tribolium Castaneum embryo") from the Cell Tracking Challenge. If you wait a little bit, I'll upload a notebook or script to get you started.

@GenevieveBuckley
Copy link

It might also be helpful if James or someone could explain what the purpose of these benchmarks are.

  • Is it intended to mimic interactive use from users while they develop their Dask workflow with Coiled?
    • If so, I guess you'll add in a bunch of .compute() calls on different sections of the data,
    • and/or change a variable value and call compute on different sections of the data to compare that (eg: change the kernel/footprint size for gausiian/median filtereing, change the threshold value for images)
  • Is it a problem or a good thing to have a workflow that depends on dask-image, or do you need things to be done only with Dask? Dask-only is less realistic, but I don't know what you want these benchmarks for.
  • You'll likely have to host the image data used for these benchmarks somewhere yourself, and maybe convert them to zarr first, is that requirement ok? I think that would work a bit better than if you planned to re-download the datasets each time.

@mrocklin
Copy link
Member

That's fair. You don't want lots of ideas, we need just one or two good ones.

Good summary!

Dask-only is less realistic

Realistic is good.

You'll likely have to host the image data used for these benchmarks somewhere yourself, and maybe convert them to zarr first, is that requirement ok?

It's less fun, but also totally ok.

From my perspective there are two overlapping objectives:

  1. Give OSS Dask engineers information about common use case performance as they develop on Dask, and make sure that important workloads continue to improve and are the focus of attention.
  2. Give new users examples that they can see, get excited by, and then copy-paste-edit

For something like a Napari-like interactive session probably objective two doesn't make sense in this format (benchmarks, notebooks, etc..) and we're mostly looking for objective one, making sure that the dask engineers are sensitive to the computational needs of this group.

@GenevieveBuckley
Copy link

Image analysis example 1

Embarrassingly parallel computation, on 2D fluorescence microscopy images of cell nuclei.

Dataset: BBBC039 datset
Code: From this talk

import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure

images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)
# ... compute things! (index, area, mean_intensity) and save as output csv, probably
# ideally don't unnecessarily re-compute parts of the dask task graph multiple times (not that you would do this... but I sometimes do this without thinking about it, oops)
# ... maybe make some graphs of that output. Eg: histogram or violin plot of area, mean_intensity; maybe a 2d plot of area vs mean_intensity...

What you compute depends on why you want to make these benchmarks.

As I said above, if you are mimicking interactive user behaviour, you'll want to get a random few image regions of the raw data, and the same after most of the different steps. Users will often change variables a few times (like the sigma value for the gaussian smoothing, or different threshold values for the mask) and compare the output to pick the one they like best.

After this interactive period, setting it up and choosing all the parameter values, people generally would compute the whole dataset. Saving the output measurements to csv and generating some summary statistics/graphs would be pretty standard.

In the talk I made the plotted nuclei area vs mean intensity, but only for the first three images so it would run quickly. Mean intensity is not a very meaningful thing to measure, but it is easy to do and I couldn't think of anything better to replace it with. I think that's fine for a benchmark or demo.

@mrocklin
Copy link
Member

OK, so I'm hearing that we can have a large graph like this, and then we can do two things:

  1. Full compute/persist
  2. [x[..., ...].compute() for ... in range(10)] or something like that

That would let us cover two different important cases with the same code

Hopefully this also generates some nice images that draw users in?

@mrocklin
Copy link
Member

If what I'm hearing is correct then this sounds easy and doable (although I'll let @jrbourbeau determine that when he gets up).

@GenevieveBuckley
Copy link

Image analysis example 2: A nuclei segmentation example using the red flour beetle ("Developing Tribolium Castaneum embryo") from the Cell Tracking Challenge

Here's some half-finished work. Gist: https://gist.github.com/GenevieveBuckley/41f49c56640c155f68c346b82c04e803

I'm having a problem at the last step, going from a simple threshold mask of all the nuclei, to individual labels for each separate nucleus. There's nothing stopping you from using this and leaving the last part out, and maybe I'll figure it out later. I've got to go home now though, so I'm sharing what I've got so far.

In the gist, we have:

  • red-flour-beetle.ipynb - The notebook with the 3D image analysis demo I was going to suggest James use as a benchmark. It's rough, but the basic outline is there & we can keep working on it.
  • worm-embryo.ipynb - The same deal, but with a smaller dataset because I thought that might be easier for me to work with. Not as good for a Dask benchmark though, you want a big dataset for that.
  • skimage-worm-embryo.ipynb - This is me trying to figure out if I can segment the nuclei mask into individual nuclei, on a smaller subsection just using skimage and no dask. This isn't going very well, I might have to have a chat with Juan about why the scikit-image peak_local_max function is not behaving the way I'd like.

@GenevieveBuckley
Copy link

Here's a possible approach to fix the final labelling step for the red flour beetle.

Here we go, perhaps this will be the fix for the dask problem

https://forum.image.sc/t/doing-watershed-with-large-image/77283/2

use dask.array.map_overlap 2 with scikit-image watershed using watershed_lines=True. This makes sure that each segment is separated from its neighbors by black/background pixels. Where the overlaps end, your segments will be cut by a vertical line, but not by the background.
then, use dask_image.ndmeasure.label 1 to label the connected components. This should unify the segments at the window boundary, but not the ones that are separated by the watershed line.

I don't like having to use watershed_lines=True on the red flour beetle data (the nuclei are packed pretty tightly and they're not that big). But if it's the easiest way to get a result, we might just go with it.

@GenevieveBuckley
Copy link

This does fall into the "here is a bunch of raw material that you could use" category, so disregard it if you like.

But I found a Kaggle "3D Image Analysis using Dask" notebook, which is a nice in-the-wild example.

It looks like it was made as an exercise for this course. I don't know the author, but I think this is the right K Mader.

I suspect there'd be some work involved if you wanted to turn the notebook into a benchmark. There are several questions/exercises at the end of the notebook, and it might be possible that there is another version with example solutions to those parts floating around as well. Could be something to keep in mind for the future.

@GenevieveBuckley
Copy link

(This note is more for me, rather than James or anyone else at Coiled. I'm not expecting anyone else to look through Robert's code, but I want to link it in case there is anything that might also be useful here)

Potentially useful resource for the red-flour-beetle 3D example:

Robert Haase has written some clesperanto demo scripts segmenting cells in 3D from a tribolium embryo (red flour beetle). It's not exactly the same dataset as the cell tracking challenge one I've linked above, but looks quite similar. The one he uses has been downsampled to 1x1x1mm voxels, which is roughly similar to the lowest resolution level of the zarr file I was playing with here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflows Related to representative Dask user workflows
Projects
None yet
Development

No branches or pull requests

4 participants