-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image analysis #751
Comments
@GenevieveBuckley this is part of our broader effort to include large-scale, representative user workflows in examples and decisions that inform future development of Dask (xref #725). I'd love to get your thoughts on what you think makes a good representative image analysis workflow. Thoughts on a nice public dataset? Maybe you already have a notebook that shows this type of workflow off? |
I'd say typical image analysis workflows often have this structure:
Loading data
Preprocessing images could include things like:
Identifying objects usually involves
Measuring objects
|
Thanks @GenevieveBuckley . I don't suppose you or someone around you has a representative workflow lying around? My guess is that if we have a non-imaging person try to construct a normal workflow it'll be both time consuming and miss important aspects of the work. I'm hopeful that you know of some dataset and canonical notebook somewhere that we could clean up and use instead of trying to invent something from scratch. |
I gave a talk at PyConAU a few years ago, and it does include a basic image segmentation workflow. This isn't the most exciting example, because it's a stack of separate 2d images and embarrassingly parallel, but it was good for showing all the dask-image features we had. Nick Sofreniew also has a nice demo from a few years ago showing interactive cell segmentation. The context here was showing off what you could do with napari, and how interactivity makes things a lot easier. Here's the jupyter notebook (there are two other jupyter notebooks that come before that one, so if things don't make sense you can skip back a bit if you need) and corresponding YouTube video. |
Lots of thoughts! You might also ask @joshmoore, I think there are some good datasets already in .zarr format on the Image Data Resource (IDR) (plus you should be able to preview them remotely with napari). MRI dataset
Timeseries light microscopyA timeseries microscopy dataset, like these developing insect embyros, would be stunning. It would fit very easily into a workflow to load the data, filter the images, threshold bright objects to segment the nuclei, then count & measure them. http://celltrackingchallenge.net/3d-datasets/ Developing Tribolium Castaneum embryo (red flour beetle)
Or if the one above is too big, you could try this smaller embryo C.elegans developing embryo (worm embryo)
Lattice lightsheet microscopyI also think a lattice lightsheet dataset could make a good example. Talley has a nice example (and has turned it into a plugin) deskewing a lattice lightsheet dataset, and you can download the example data. The napari-lattice wiki pages have some more detail about the desired workflow (load, deskew, crop, deconvolve, etc.), which might be helpful for background context. Electron imaging datasetJanelia makes volume electron microscopy datasets publically available on AWS: https://openorganelle.janelia.org/ Possibly also useful are the Caltech Electron Tomography Database and/or EMPIAR (the Electron Microscopy Public Image Archive). They might be a bit harder to search through for something suitable though. Histology whole slide image(s)I know of the CAMELYON challenge has publicly available histology. But it can be tricky to download the images (I think they're hosted on google drive, so it works ok manually if you do it once or twice - but repeated downloads will trigger rate limiting or blocks), and also it needs to be converted to zarr (this WSI reader might be helpful?). More datasetsWe've been collecting lists of datasets in these issues: |
There is a list of OME-Zarr related datasets under https://ngff.openmicroscopy.org/data If other links end up here, it'd be great to also point to them from that resource. One that needs adding, for example, is http://zebrahub.org/ |
What I'm seeing here is a bunch of "here is a bunch of raw material that you could use" which is great. Thank you all. However, I suspect that @jrbourbeau is now in a position where he's being asked to judge a bunch of stuff that he doesn't understand at all. If anyone has the time to say "This workload (or two) is computationally representative of the image processing community. It is what we would want included in a regularly run Dask benchmark suite." that would be welcome. Otherwise, my guess is that James will choose one or two of these at random and it won't be very good. (no offense James) |
That's fair. You don't want lots of ideas, we need just one or two good ones. Let's do these two things:
|
It might also be helpful if James or someone could explain what the purpose of these benchmarks are.
|
Good summary!
Realistic is good.
It's less fun, but also totally ok. From my perspective there are two overlapping objectives:
For something like a Napari-like interactive session probably objective two doesn't make sense in this format (benchmarks, notebooks, etc..) and we're mostly looking for objective one, making sure that the dask engineers are sensitive to the computational needs of this group. |
Image analysis example 1Embarrassingly parallel computation, on 2D fluorescence microscopy images of cell nuclei. Dataset: BBBC039 datset import numpy as np
from dask_image.imread import imread
from dask_image import ndfilters, ndmorph, ndmeasure
images = imread('data/BBBC039/images/*.tif')
smoothed = ndfilters.gaussian_filter(images, sigma=[0, 1, 1])
thresh = ndfilters.threshold_local(smoothed, blocksize=images.chunksize)
threshold_images = smoothed > thresh
structuring_element = np.array([[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 1, 0], [1, 1, 1], [0, 1, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0]]])
binary_images = ndmorph.binary_closing(threshold_image, structure=structuring_element)
label_images, num_features = ndmeasure.label(binary_image)
index = np.arange(num_features)
area = ndmeasure.area(images, label_images, index)
mean_intensity = ndmeasure.mean(images, label_images, index)
# ... compute things! (index, area, mean_intensity) and save as output csv, probably
# ideally don't unnecessarily re-compute parts of the dask task graph multiple times (not that you would do this... but I sometimes do this without thinking about it, oops)
# ... maybe make some graphs of that output. Eg: histogram or violin plot of area, mean_intensity; maybe a 2d plot of area vs mean_intensity... What you compute depends on why you want to make these benchmarks. As I said above, if you are mimicking interactive user behaviour, you'll want to get a random few image regions of the raw data, and the same after most of the different steps. Users will often change variables a few times (like the sigma value for the gaussian smoothing, or different threshold values for the mask) and compare the output to pick the one they like best. After this interactive period, setting it up and choosing all the parameter values, people generally would compute the whole dataset. Saving the output measurements to csv and generating some summary statistics/graphs would be pretty standard. In the talk I made the plotted nuclei area vs mean intensity, but only for the first three images so it would run quickly. Mean intensity is not a very meaningful thing to measure, but it is easy to do and I couldn't think of anything better to replace it with. I think that's fine for a benchmark or demo. |
OK, so I'm hearing that we can have a large graph like this, and then we can do two things:
That would let us cover two different important cases with the same code Hopefully this also generates some nice images that draw users in? |
If what I'm hearing is correct then this sounds easy and doable (although I'll let @jrbourbeau determine that when he gets up). |
Here's some half-finished work. Gist: https://gist.github.com/GenevieveBuckley/41f49c56640c155f68c346b82c04e803 I'm having a problem at the last step, going from a simple threshold mask of all the nuclei, to individual labels for each separate nucleus. There's nothing stopping you from using this and leaving the last part out, and maybe I'll figure it out later. I've got to go home now though, so I'm sharing what I've got so far. In the gist, we have:
|
Here's a possible approach to fix the final labelling step for the red flour beetle.
|
This does fall into the "here is a bunch of raw material that you could use" category, so disregard it if you like. But I found a Kaggle "3D Image Analysis using Dask" notebook, which is a nice in-the-wild example. It looks like it was made as an exercise for this course. I don't know the author, but I think this is the right K Mader. I suspect there'd be some work involved if you wanted to turn the notebook into a benchmark. There are several questions/exercises at the end of the notebook, and it might be possible that there is another version with example solutions to those parts floating around as well. Could be something to keep in mind for the future. |
(This note is more for me, rather than James or anyone else at Coiled. I'm not expecting anyone else to look through Robert's code, but I want to link it in case there is anything that might also be useful here) Potentially useful resource for the red-flour-beetle 3D example:
Robert Haase has written some clesperanto demo scripts segmenting cells in 3D from a tribolium embryo (red flour beetle). It's not exactly the same dataset as the cell tracking challenge one I've linked above, but looks quite similar. The one he uses has been downsampled to 1x1x1mm voxels, which is roughly similar to the lowest resolution level of the zarr file I was playing with here. |
There is a surprisingly large community of people using Dask for bio-medical imaging. This includes applications like fMRI brain scans, and very high resolution microscopy (3d movies at micro resolution of cells). These folks often want to load in data, apply image processing filters across that data using map_overlap, and then visually explore the result. They want this processing done with human-in-the-loop systems.
A representative example would be to:
from_zarr
map_overlap
with a trivial functionThe text was updated successfully, but these errors were encountered: