But why is nnU-Net not outdated ? #1189

SimJeg · 2022-10-13T16:44:44Z

SimJeg
Oct 13, 2022

Hello,

The U-Net paper came out in 2015, quickly followed in 2016 by V-Net to handle 3D inputs. Seven years later these models culminated in this amazing repository which systematically sets the state of the art in medical image segmentation challenges.

In parallel, the progress of segmentation on "natural" 2D images never really stopped, see for instance the paperswithcode benchmarks ADE20k or Cityscapes.

Improvements came from both :

the encoder : from CNNs to ViTs (e.g. mask DINO), from pretraining on ImageNet to visual-language pretraining (e.g. BeiT3)
the decoder: UpperNet mask R-CNN, mask2former etc.

Why such a discrepancy ?

Is medical image segmentation more a data problem than a methodology problem ? (see this paper)
Why after all this time there is still no popular pretrained 2D or 3D encoder (such as this one or this one) ? Where are medical foundation models ?
Is it a matter of dataset size ? or because 3D images are too compute hungry ?
Do scaling laws exist for U-Net ?
Has the entire research community fallen in a nnU-Net local minima because it is such a convenient tool ?

Thanks for your inputs,
Simon

Joeycho · 2022-10-22T18:54:51Z

Joeycho
Oct 22, 2022

Hi, I started with nnUNet for my segmentation task and moved on trying to replicate it with other libraries, such as MONAI, pytorch lightning, etc. Of course, I extracted the all useful information from nnUNet, such as patch size for my dataset, fixed parameters which are independent from any dataset.. And I tried to replicate the nnUNet dataloader, too. sampling, deal with class imbalance, and data augmentation etc...

So, regarding your interesting questions,

With my short experience, both are quite influential. For example, in medical images, spacing(Resolutions) are quite variable across the dataset, and even in the same dataset. And this spacing information can be critical information for how methodology, the network topology is designed, and how the images are interpolated in the process. That is what I learned from nnUNet, and there are definitely some parameters (inferred parameters) which are more related with each dataset, and fixed parameters which are normally good for any dataset. So, I think both :) I cannot prioritise one of them.
Hmm.. maybe more variabilities in medical imaging domain, compared to RGB? in RGB images, each intensity is still in the end, 0-255. And it is meaningful as it is. But in medical imaging, it depends on.. how you view the images. (contrast level), windowing, .. with different scanner settings, different modalities (MRI, CT, Ultrasound, PET.. ).. very diverse, and it the intensity value is not absolute value in MR. in CT HU value yes, but .. very diverse.. I think this makes it difficult to have one general 2D or 3D encoder in medical image domain. Plus, medical imaging requires way more time to get good-level ground truth labels..
I think the most common issue in medical imaging is.. lack of good labels.. for supervised learning. I think dataset size itself, it should be fine, I guess.
if you look at how the nnU-Net authors chose those fixed parameters and make rules for inferred parameters.. Yes, they proposed some scaling rules (laws?) of U-Net for any medical imaging dataset.
Yes, it is convenient, and I think this is because the authors took care of practicalities for future users e.g. optimizing models for GPU resources, and very detailed documentation in here (so, users know whether they can do their tasks with their resources or not).

In general, I agree that nnU-Net setup baseline.. for any arbitrary dataset, so, it is good to go with this for researchers and compare the performance with it. And it is still very good.. in my case. baseline, but which is hard to beat... in terms of performance.

0 replies

FabianIsensee · 2022-10-31T14:58:35Z

FabianIsensee
Oct 31, 2022
Maintainer

Hey @SimJeg and @Joeycho ,
those are really tough questions to ask. For real, I think segmentation is one of the most wonderful and frustrating image analysis problems in the medical domain. Frustrating in particular because it just refuses to give us a good signal for improving network architectures. I do not want to disclose how much time I wasted during my PhD to find an architecture that would outperform a good (emphasis on good(!) ) U-Net baseline. Suffice it to say things were quite infuriating because not only were there clear indications from natural image processing that 'better' architectures should improve results, but people were also pumping out MICCAI papers left and right with their new cool fancy architectural tweaks that totally improved upon their (probably) bad U-Net baselines. As a young researcher with no idea what I was really doing this was kinds driving me nuts: Was I just not good enough to get the same promising results as all the others?
It really took a while to notice that maybe it wasn't really all my fault and that maybe, maybe, a significant proportion of the research out there suggested something that was just not there. By then I was growing a little fed up with all of this so I started trying to prove a point by overengineering the U-net to the max - resulting in a second place in the BraTS 2018 competition and the creation of nnU-Net in the context of the Medical Segmentation Decathlon. Fueled by my frustration, the initial paper titles were quite ... memorable... (never made it to publication though).
If is quite interesting that even to date, and to the best of my knowledge, there is no segmentation method that really, convincingly outperforms nnU-Net in the medical domain. It now clocks in 4 years of state-of-the-art in medical image segmentation - in a field that one would expect to be progressing fast. Yes yes there are some solutions here and there that win on some datasets and given a lot of engineering (or other tricks), but overall if you look at the big picture nnU-Net is still only challenged by task-specifically optimized versions of itself.

So why is that the case? Well your guess is pretty much as good as mine. All I can provide as an answer is a set of hypotheses that I have so far not been able to prove or disprove. Most of them are just gut feelings of mine. So make of them what you will

Hypothesis 1: Things other than the network architecture matter more in medical imaging
When you look at segmentation problems like the datasets you posted above, then the 'simple' natural images often result in quite simple pipelines: Load an entire image, augment it, pass it through network, compute loss, done. This is often note the case in medical images where the patch size you are able to process is so much smaller than the actual image. Take for example the LiTS dataset where a typical image size is in the order of 512x512x512 pixels. A reasonable patch size here is 128x128x128, which can be processed in nnU-Net with a batch size of 2 on a regular-sized GPU (10GB of VRAM). So what the network is really looking at is just patches 1/64th the size of the actual image. This situation gives rise to a lot of engineering challenges that need to be solved, resulting in much more complex data loading and handling pipelines. To give some examples:

where do I sample patches from (class imbalance)
how large do my patches need to be initially so that I am not introducing borders when rotating/scaling in data augmentation
how does this influence the sampling behavior (we sometimes partially sample from out-of-image boundaries)
Sampling must occur on the fly and we should not explicitly generate and store training patches (data diversity etc)
how do we run inference if fully convolutional is not an option (which it isn't!)

There are probably a bunch more that I forgot by now. The essence of this is: There is a lot more engineering involved in doing medical image segmentation 'right' and all of these engineering steps come with their own set of design choices and possibilities to screw things up. This is why you see so many bad UNet baselines in the literature. Remember: If it ain't nnU-net do not trust the 'UNet' baseline

Hypothesis 2: Segmentation problems in the medical domain are different
[I will probably get a lot of hate for this] Medical image segmentation problems are much easier than natural images in many aspects. You usually have fewer objects of interest and a much more controlled environment in which these are located. While a cityscapes network is expected to find a pedestrian both in its natural habitat as well as in a drunken state laying on the street and in a (most likely) uncomfortable/uncommon position, objects in the medical domain mostly prefer to behave. The liver sits where the liver sits, the kidneys are probably located in the same place, the brain sits on top of them all. Even pathologies are quite 'nice' in that they have a clear appearance that quite clearly distinguishes them from surrounding tissue.
Really if you look at the performance of segmentation methods relative to inter-rater performance you will see that basically all medical image segmentation problems are solved almost to perfection, with no 'gap' remaining that would need to be closed (assuming the upper limit is the inter-rater variability). Where there is no room for improvement there is also limited room for better methods to shine.

At this point I should note that there are of course some datasets that are not well solved. Task10 of the medical segmentation decathlon for example (colon cancer). And there are some datasets that have closely related labels (Kits 21: tumors vs cysts) that are difficult to get right just from the images, also for humans! Which brings me to my next hypothesis

Hypothesis 3: Small dataset sizes are disadvantageous for complex architectures
Basically what you already said @SimJeg : Datasets are too small, causing medical image segmentation to be mainly a data problem. For the two examples mentioned above, the correct way to improve performance would be to collect more data (while ensuring highest annotation quality!), not methods development.
In the medical domain we are often plagues by domain transfer problems as well as an insufficiently dense sampling of the underlying distribution (read: not sufficiently diverse training cases), encouraging shortcut learning and making our analyses prone to confounders. For example, in very rare cases, kidneys can be located in unexpected positions, for example if they were transplanted. Methods will almost always fail here because they only learned to recognize them in their natural habitat. Also datasets are sometimes dumb: How am I supposed to tell if it's a left or right kidney if it's somewhere random. Duh.

So overall, taken these things together I think that the benefits that a good architecture can provide (and I am certain that better architectures DO produce better results) are drowned in a sea of other factors like noise, small dataset sizes and others making it really hard to measure them. Saturated as medical image segmentation is, you really gotta try hard to create an architecture more powerful than the U-Net and to prove its value (at least if you want to convince me).

So it is impossible to beat the U-Net?

Absolutely not! If all the stars are aligned and you have a large, high quality dataset with sufficiently difficult target structures you can make it work. See for example our AMOS2022 winning contribution where we could clearly see an improvement when switching to a residual encoder :-) This is just a situation that is not given for most datasets.

Regarding some of the other discussion points:

Why do we still not have general purpose pretrained encoders?
Hypotheses:

Medical datasets are too diverse for that. MRI != CT != Microscopy != whatever else we use to image things
pretrained segmentation models suck really bad at transferring them to other segmentation tasks. Features are too specific and not as nice as the difficult and diverse ImageNet pretraining gives

Not hypotheses:

different image/patch sizes require different topologies, which are effectively different and non-compatible architectures. You'd need a flood of pretrained models for all occasions

I think unsupervised/self supervised pretraining is highly promising and will give fantastic results in the years to come

Has the entire research community fallen in a nnU-Net local minima because it is such a convenient tool ?

Not really I hope. nnU-net has the ability to catalyze progress because you can use it to verify new methodologies. You can drop in your architecture and test it in an environment where everything else is taken care of. You can do really comprehensive analyses in mutiple datasets with minimal effort, for example for evaluating things like new loss functions. Quite neat really and lots of people use it for that! There are also a lot of nnU-Net independent works in segmentation (like in MONAI). Especially MONAI has a lot of traction due to the professional developers working on it (as opposed to random dudes like me doing nnU-Net). But as long as nnU-net dominates the competitions people will come flock to us ;-)

Phew. Enough rambling for today. I hope these somewhat incoherent thoughts contain the answers you were looking for!

Best,
Fabian

7 replies

GuillaumeMougeot Mar 1, 2023

@FabianIsensee Man! As myself being "a random dude like you doing nnUNet", I cannot tell you how much your comment relieved me. I am in my last year of my PhD thesis working on microscopy images mainly, I am struggling so hard to convinced everyone that U-Net is still a thing. It is very difficult for people like me to be inbetween two chairs, on one side, fundamental research in computer vision where everyone is convinced that what you're doing is outdated and don't understand why you are doing engineering only and not "actual" research in computer science and, on the other hand, fundamental research in biology/medical science where nobody understand why you do not use the latest results but these good all U-Net. Anyway, sorry for this additional rambling and thank you so much for your contribution.
(If you allow me, I'll definitely cite some parts of your comment here in my PhD manuscript.)

Best,
Guillaume
PS: if anybody is interested I tried to improve the way nnUNet is been engineered a little bit here... sorry for the ad.

FabianIsensee Mar 1, 2023
Maintainer

Hey Guillaume, thanks for the warm word! I find it frustrating that solving a problem well is often underappreciated if there is no glaring 'novelty' that one can attribute the good performance to. This was basically the story of my PhD. It is really unfortunate that people so strongly incentivised to do something 'new' that they start pumping out badly evaluated papers with architectures that make no difference at best.

To respond to @Joeycho (because it fits in this context), the resenc UNet continues to give better performance on larger datasets, so it's a natural choice when playing with AMOS and the like. We can make this the default in the future, actually. But for this to make sense it would have to perform better (or equally well) on all tasks, and that is where we are currently struggling a little.
I think it is encouraging to see that the resenc UNet starts to pop off a little. That means not all is lost for methodological research in the biomedical segmentation domain ^^ Datasets just had to get a little larger.

@GuillaumeMougeot , amazing work on that repository of yours! It looks very clean! Have you by chance evaluated whether if gives the same (or better) segmentation performance on a board range of datasets? There have been some attempts at recreating nnU-Net in the past and so far they did fall short in one aspect or the other ^^ I absolutely agree with you that the engineering of the original nnU-Net is not ideal. I have actually rewritten nnU-net from the ground up to fix that and the release is very soon :-)

Feel free to use any comments you like in your manuscript haha

Best,
Fabian

GuillaumeMougeot Mar 1, 2023

Thank you @FabianIsensee for your quick reply! And really happy to see that the story of my PhD is not anecdotal and mirrors yours ^^.

I cannot tell that I have tested my implementation on a large range of datasets but mainly on biological (custom nucleus dataset) and medical datasets (parts of the Medical Segmentation Decathlon). My results match the result of your official nnUNet 3D fullres model on all of these datasets when used without cross-validation (training several models with cross-validation is too time-consuming in my case). I am currently evaluating it on more dataset. The main goal of my repo was to create a baseline where it would be easy to plug in or out pieces of codes (make it modular) to quickly evaluate novel methods. And as you mentioned it, my main research topic rn is reducing the amount of segmentation ground truth requirements with semi-/self-supervision methods. And of course I am still comparing my results to your official implementation results ;-)

Can't wait to see your rewritten nnUNet code!! :o

Thanks again,
Guillaume

Joeycho Mar 21, 2023

Thank you @FabianIsensee for your comment :). Meanwhile, I finished my master thesis project with a lot of ideas and codes from nnU-Net (take the inferred parameters, General U-Net architectures from nnUNet and use MONAI and Pytorch lightning for other parts, which helps me to be more experimental in general).

"the resenc UNet continues to give better performance on larger datasets" makes sense with my observation in small datasets (only 16 CT/MRI images). After all, I didn't use resenc UNet in my experiments. Rather, I used Dilated Convolutions in only the first layer of encoder. I can clearly see that it is still extremely challenging to find the best optimal U-Net architecture for the given dataset. But I agree and appreciate that you found and developed a pretty good starting point (benchmark) via nnUNet for any medical image segmentation tasks.

chrisrapson Sep 29, 2023

In the AMOS2022 paper, the Residual Encoder looked like a clear improvement, and I see you've chosen it again for this year's AutoPET submission, but the default network is still a plain unet. Is that because there are some cases where the residual encoder underperforms? If so, have you been able to identify any heuristic for what those cases are?

SimJeg · 2022-10-31T16:08:06Z

SimJeg
Oct 31, 2022
Author

Hi @FabianIsensee , Hi @Joeycho ,

I really appreciate the time you both took to write detailed answers !

Hypothesis 2: maybe indeed several segmentation tasks are to some extent "solved" (e.g. lung segmentation) and that progress should be measured on harder tasks. Such tasks could include lesion detection which is one the main use-case for deep learning in medical imaging. Similarly to my first post, it is very weird for me that the go-to method in medical imaging is still a 2 steps pipeline : segment and reduce the false positives. On natural images, it's been at least 5 years and Mask-RCNN that both steps are done simultaneously (see this leaderboard). RetinaU-Net (and now nnDetection) are paving the way !

Hypothesis 1 : probably at the end the bitter lesson will apply. More compute (e.g. a 500GB GPU ⚡🏭) will make engineering much easier (no patch anymore).

I also think that self-supervised (and if possible vision-language models like CLIP) could be a game changer. It proved to work in 2D for X-ray and computational pathology (e.g. table 3 in this paper) I worked on), but I have not seen anything convincing so far in 3D (did you ?). I don't agree that 3D medical datasets are too diverse and we should have 1 model per modality / resolution : I would bet that the distribution of all MRI / CT / PET at any resolution out there is much less diverse than the one of natural images (e.g. YouTube or LAION 5B). CLIP ViT-H achieved remarkable success in any downstream task using a single training set, similar results could apply in medical imaging. The matter is finding an organization with both the time for research and the money for compute, data and talent to train such models 😅

Phew. Enough rambling for today. I hope these somewhat incoherent thoughts contain the answers you were looking for!

Yes it did ! I will let the discussion opened however,
Thank you,
Simon

7 replies

SimJeg Nov 9, 2022
Author

@Joeycho I have the same questions as you :) Pretraining on medical images is not technically harder however you need it to generalize to your setup (number of channels can be a problem, patch size / resolution can be another). Some people already tried and reported success in transfer learning (e.g. MedicalNet from Tencent) but there were no adoption so far from the community. So 2 options : people did not try hard enough, or it does not help them in their tasks.

SimJeg Nov 9, 2022
Author

For your second question I don't know. I believe doing a single model is more elegant, and has proved to be more efficient on natural images. It is also much easier to maintain if you want to use your model for production.

Joeycho Nov 9, 2022

@SimJeg Hmm.. which question you meant with 2nd question? (2 steps pipeline, first segment and reduce false positive in medical imaging)?

doing a single model is more elegant, and has proved to be more efficient on natural images

So, did you mean, it's typical to use 2 or more models in medical imaging?

SimJeg Nov 9, 2022
Author

By 2nd question I meant your second reply sorry. Yes I would say it is typical to use 2 or more models for detection in medical imaging (e.g. U-Net + a classification network to reduce FP).

Joeycho Nov 10, 2022

Hmm.. you meant segmentation as well? Or for detection task? I am curious for segmentation task, whether we need additional model to reduce FP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

But why is nnU-Net not outdated ? #1189

{{title}}

Replies: 3 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

But why is nnU-Net not outdated ? #1189

Replies: 3 comments · 14 replies

FabianIsensee Oct 31, 2022 Maintainer

FabianIsensee Mar 1, 2023 Maintainer

SimJeg Oct 31, 2022 Author

SimJeg Nov 9, 2022 Author

SimJeg Nov 9, 2022 Author

SimJeg Nov 9, 2022 Author

Replies: 3 comments 14 replies

FabianIsensee
Oct 31, 2022
Maintainer

FabianIsensee Mar 1, 2023
Maintainer

SimJeg
Oct 31, 2022
Author

SimJeg Nov 9, 2022
Author

SimJeg Nov 9, 2022
Author

SimJeg Nov 9, 2022
Author