Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-mAP bug: P, R and mAP are all 0.0 #9059

Closed
1 task done
athrunsunny opened this issue Aug 21, 2022 · 27 comments · Fixed by #9068
Closed
1 task done

Zero-mAP bug: P, R and mAP are all 0.0 #9059

athrunsunny opened this issue Aug 21, 2022 · 27 comments · Fixed by #9068
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested

Comments

@athrunsunny
Copy link

Search before asking

Question

I train my custom one class dataset(about 40,000 pictures)use default cfg of yolov5s with the latest version(6.2),P, R and mAP are all 0,(It's the same after about 100 epochs)however, it is normal to use the previous version(6.1)and it's normal when I train with coco128,I think this may be a bug, but I didn't debug it.

Additional

No response

@athrunsunny athrunsunny added the question Further information is requested label Aug 21, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Aug 21, 2022

👋 Hello @athrunsunny, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected].

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@giuseppebrb
Copy link

It's a bug already happened the other day but @glenn-jocher fixed it, today reappeared

@pourmand1376
Copy link
Contributor

Have you updated your repo? @glenn-jocher has created a PR for this yesterday.

@giuseppebrb
Copy link

In my case, yesterday everything was working again, today bug returned. I clone the repository at every run in my Google Colab notebook.

@glenn-jocher glenn-jocher added bug Something isn't working help wanted Extra attention is needed labels Aug 21, 2022
@glenn-jocher glenn-jocher changed the title When training custom one class dataset, P, R and map are all 0 Zero-mAP bug: P, R and mAP are all 0.0 Aug 21, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 21, 2022

@athrunsunny @giuseppebrb @pourmand1376 I'm still seeing this myself, but it's not consistently reproducible. I trained 10 models in a row in Colab just now and the zero-mAP bug randomly appears in 2 out of 10. Importantly all val losses are identical, all single-gpu trainings are now 100% reproducible, so the bug is not in training, it is in the metrics apparently.

I need help debugging this, I don't know what's causing the problem.

# Train 10x YOLOv5s on COCO128 for 3 epochs
for _ in range(10):
  !python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache

Screenshot 2022-08-21 at 14 13 14

@AyushExel FYI this is happening, don't have a solution yet.

@pourmand1376
Copy link
Contributor

@glenn-jocher
Okay, If you don't know where to start, I suggest using git bisect. This one will find the exact commit which has made the issue.

I will try to solve the problem. However, I don't know any commit which doesn't have this issue. Do you know any?

@glenn-jocher
Copy link
Member

@pourmand1376 yes I've been manually bisecting commits, but since it's not 100% reproducible I would only test one training, and the most recent commit that passed the one training I would revert to. This only appeared in the last few days. Some other observations:

  • --device cpu trainings always appear to work correctly
  • --noplots trainings always appear to work correctly even on GPU
  • another user mentioned nan batchnorm values, but this doesn't make sense to me as all val losses are identical for both zero map and normal runs.

@AyushExel
Copy link
Contributor

This seems like a p0 problem. Let's prioritize this.
I've seen similar things happen when some hyper-param is set to a bad value. Was there any commit lately related to hyps?

@glenn-jocher
Copy link
Member

@pourmand1376 so it seems like the git bisect approach is correct, but the test must be high statistics, i.e. run 30 trainings with 30 good results and we can be 97% sure there's no bug in that commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 21, 2022

@AyushExel no, last hyp changes were months ago. I think this might be related to EMA or running val.py directly from train.py or maybe AMP/FP16 ops between train/val. But the strangest part is that val losses are always identical across all models, regardless of whether map is 0 or correct.

Also another observation is that calling val.py directly always works. The bug is in val during training only. In other words, there is absolutely nothing wrong with training or the models that are being trained, what's wrong is that val.py during training is erroneously reporting zero-mAP.

@glenn-jocher
Copy link
Member

Another thing I suspected was the introduction of identical filenames in /classify. This has never happened before in the repo. If that's the case then the bug would appear directly in the v6.2 tag.

@pourmand1376
Copy link
Contributor

pourmand1376 commented Aug 21, 2022

Hmm, I am using

for _ in range(30):
  !python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5x.pt --cache

So far:

!git bisect start
!git bisect bad
!git bisect good 20f1b7ea086fb317da93d2e603c4def2ebfb8187
!git bisect good fe809b8dad5236d86d5acbe047b5e0e6895b2b8a
!git bisect bad 4bc5520e9424cdb0bd73bffb091b85934d5096c8
!git bisect bad 27fb6fd8fc21c20290041f38046d7a60ae8c6e3a
!git bisect bad eb359c3a226f55c9b51efcfeae2e31c820e6e08a
!git bisect good 5c854fab5e43df82ebfd51197c2dc58e5212c5a6
!git bisect good d40cd0d454dcc34312cb5c40f45f64b76665c40c
!git bisect bad de6e6c0110adbb41f829c1288d5cdab7105892ae
!git bisect bad 61adf017f231f470afca2636f1f13e4cce13914b

Once we understand bad commits, it would be a lot easier to detect the problem.

@glenn-jocher
Copy link
Member

@pourmand1376 oh, YOLOv5x will be slow. You can use the default Colab training command, the bug will appear there:

for _ in range(10):
  !python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache

@glenn-jocher
Copy link
Member

I trained v6.2 tag 15x and all trainings are correct. So it's not actually in the v6.2 release apparently but has been introduced in the 38 commits since then.

@pourmand1376
Copy link
Contributor

pourmand1376 commented Aug 21, 2022

I am updating the git bisect commits. Good ones are not for sure as I am running 4 concurrent colab sessions but bad ones are for sure.

@robotaiguy
Copy link

robotaiguy commented Aug 21, 2022

I'm still on commit d1dfcab and no issues with loss, P, R, mAP, etc.
I did notice that it depended on where I was seeing the values displayed whether they appeared as NaN or 0...and trainning didn not halt...in fact, it felt normal, other than not seeing.

And I've been training 2 models (one locally, and one on colab all night and morning).

@pourmand1376
Copy link
Contributor

pourmand1376 commented Aug 21, 2022

This is what bisect reports ...

61adf01 is the first bad commit

commit 61adf017f231f470afca2636f1f13e4cce13914b
Author: Glenn Jocher <[[email protected]](mailto:[email protected])>
Date:   Thu Aug 18 20:12:33 2022 +0200

    `torch.empty()` for speed improvements (#9025)
    
    `torch.empty()` for speed improvement
    
    Signed-off-by: Glenn Jocher <[[email protected]](mailto:[email protected])>

:040000 040000 c7c88a2192877d503c0ef118407ca9ba9062e3f3 03d4349010436426c5870f8cab61b453890f5aeb M	models
:040000 040000 9c18cbfcbcf0c2c58e9d83d91cfdc31bd652d6b0 38a9925a4fe83db85246725490edbd5455cedb1e M	utils

@glenn-jocher
Copy link
Member

@pourmand1376 ohhhhhhh this makes sense since torch.empty is not reproducible! Awesome @pourmand1376 this is a huge help. I'll revert this commit immediately, the speed gains are super small from the torch.zeros to torch.empty switch in any case.

@AyushExel
Copy link
Contributor

@glenn-jocher @pourmand1376 quick question - do we know WHY this caused issues? From the docs I understand that it creates unintailized tensor with random values(explains non-reproduciblity) but I don't get why the metrics would be stuck at 0 ?

@glenn-jocher
Copy link
Member

@pourmand1376 thanks for the help!! I've pushed #9068 with a fix by @0zppd just now

@pourmand1376
Copy link
Contributor

pourmand1376 commented Aug 21, 2022

@AyushExel
I don't know. Maybe this one should be reported as a bug in the upstream pytorch. We can talk about this with pytorch main contributors ...

At least, they should put this on their docs.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 21, 2022

@giuseppebrb @athrunsunny @pourmand1376 @robotwhispering @AyushExel good news 😃! Your original issue may now be fixed ✅ in PR #9068 by @0zppd. @pourmand1376 tracked down the problem using git bisect and running each training 10x since the bug was not reproducible on every training, but showed up in maybe 1/3 of the trainings.

The original issue was that I had replaced torch.zeros() with torch.empty() on some ops like warmup and profiling to try to get some slight speed improvements, and once op in particular ran a torch.empty() tensor through the model when it was in .train() mode, leading the batchnorm layers to add those values to the tracked statistics. Since torch.empty is not initialized it can take on extremely high or low values, leading some batchnorm layers to randomly output NaN.

The PR has been extensively tested on 10x Colab trainings and all 10 came back good now:
Screenshot 2022-08-21 at 15 40 00

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@athrunsunny
Copy link
Author

@glenn-jocher @pourmand1376 @0zppd Thank you for your great work

@shubhambagwari
Copy link

I forked the latest version of Yv5, Still, I am facing this error.
Earlier I trained this model with 2000 images but faced the same issue.

Roboflow has been used to label the custom dataset.
I have read the earlier discussions but not working.

image

@glenn-jocher
Copy link
Member

@shubhambagwari try 300 epochs.

@Davegdd
Copy link

Davegdd commented Dec 21, 2022

@shubhambagwari try 300 epochs.

Thank you for your answer. I was having the same issue, it resolved after increasing epochs just beyond 3.

@umarbutler
Copy link

@shubhambagwari try 300 epochs.

Thank you for your answer. I was having the same issue, it resolved after increasing epochs just beyond 3.

Thanks for this, I quit my training a number of times to try and fix getting 0 P, R and mAP values in the first two epochs not realising that it would only start to increase by epoch 3 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants