Zero-mAP bug: P, R and mAP are all 0.0 #9059

athrunsunny · 2022-08-21T02:36:01Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I train my custom one class dataset（about 40,000 pictures）use default cfg of yolov5s with the latest version（6.2）,P, R and mAP are all 0，(It's the same after about 100 epochs)however, it is normal to use the previous version（6.1）and it's normal when I train with coco128,I think this may be a bug, but I didn't debug it.

Additional

No response

github-actions · 2022-08-21T02:36:38Z

👋 Hello @athrunsunny, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected].

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

giuseppebrb · 2022-08-21T08:15:05Z

It's a bug already happened the other day but @glenn-jocher fixed it, today reappeared

pourmand1376 · 2022-08-21T11:26:48Z

Have you updated your repo? @glenn-jocher has created a PR for this yesterday.

giuseppebrb · 2022-08-21T11:29:53Z

In my case, yesterday everything was working again, today bug returned. I clone the repository at every run in my Google Colab notebook.

glenn-jocher · 2022-08-21T12:15:29Z

@athrunsunny @giuseppebrb @pourmand1376 I'm still seeing this myself, but it's not consistently reproducible. I trained 10 models in a row in Colab just now and the zero-mAP bug randomly appears in 2 out of 10. Importantly all val losses are identical, all single-gpu trainings are now 100% reproducible, so the bug is not in training, it is in the metrics apparently.

I need help debugging this, I don't know what's causing the problem.

# Train 10x YOLOv5s on COCO128 for 3 epochs
for _ in range(10):
  !python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache

@AyushExel FYI this is happening, don't have a solution yet.

pourmand1376 · 2022-08-21T12:23:42Z

@glenn-jocher
Okay, If you don't know where to start, I suggest using git bisect. This one will find the exact commit which has made the issue.

I will try to solve the problem. However, I don't know any commit which doesn't have this issue. Do you know any?

glenn-jocher · 2022-08-21T12:27:56Z

@pourmand1376 yes I've been manually bisecting commits, but since it's not 100% reproducible I would only test one training, and the most recent commit that passed the one training I would revert to. This only appeared in the last few days. Some other observations:

--device cpu trainings always appear to work correctly
--noplots trainings always appear to work correctly even on GPU
another user mentioned nan batchnorm values, but this doesn't make sense to me as all val losses are identical for both zero map and normal runs.

AyushExel · 2022-08-21T12:30:39Z

This seems like a p0 problem. Let's prioritize this.
I've seen similar things happen when some hyper-param is set to a bad value. Was there any commit lately related to hyps?

glenn-jocher · 2022-08-21T12:30:52Z

@pourmand1376 so it seems like the git bisect approach is correct, but the test must be high statistics, i.e. run 30 trainings with 30 good results and we can be 97% sure there's no bug in that commit.

glenn-jocher · 2022-08-21T12:33:18Z

@AyushExel no, last hyp changes were months ago. I think this might be related to EMA or running val.py directly from train.py or maybe AMP/FP16 ops between train/val. But the strangest part is that val losses are always identical across all models, regardless of whether map is 0 or correct.

Also another observation is that calling val.py directly always works. The bug is in val during training only. In other words, there is absolutely nothing wrong with training or the models that are being trained, what's wrong is that val.py during training is erroneously reporting zero-mAP.

glenn-jocher · 2022-08-21T12:36:01Z

Another thing I suspected was the introduction of identical filenames in /classify. This has never happened before in the repo. If that's the case then the bug would appear directly in the v6.2 tag.

pourmand1376 · 2022-08-21T12:37:40Z

Hmm, I am using

for _ in range(30):
  !python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5x.pt --cache

So far:

!git bisect start
!git bisect bad
!git bisect good 20f1b7ea086fb317da93d2e603c4def2ebfb8187
!git bisect good fe809b8dad5236d86d5acbe047b5e0e6895b2b8a
!git bisect bad 4bc5520e9424cdb0bd73bffb091b85934d5096c8
!git bisect bad 27fb6fd8fc21c20290041f38046d7a60ae8c6e3a
!git bisect bad eb359c3a226f55c9b51efcfeae2e31c820e6e08a
!git bisect good 5c854fab5e43df82ebfd51197c2dc58e5212c5a6
!git bisect good d40cd0d454dcc34312cb5c40f45f64b76665c40c
!git bisect bad de6e6c0110adbb41f829c1288d5cdab7105892ae
!git bisect bad 61adf017f231f470afca2636f1f13e4cce13914b

Once we understand bad commits, it would be a lot easier to detect the problem.

glenn-jocher · 2022-08-21T12:40:31Z

@pourmand1376 oh, YOLOv5x will be slow. You can use the default Colab training command, the bug will appear there:

for _ in range(10):
  !python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache

glenn-jocher · 2022-08-21T12:49:28Z

I trained v6.2 tag 15x and all trainings are correct. So it's not actually in the v6.2 release apparently but has been introduced in the 38 commits since then.

pourmand1376 · 2022-08-21T13:00:10Z

I am updating the git bisect commits. Good ones are not for sure as I am running 4 concurrent colab sessions but bad ones are for sure.

Resolves ultralytics/hub#82 Signed-off-by: Glenn Jocher <[email protected]>

robotaiguy · 2022-08-21T13:19:44Z

I'm still on commit d1dfcab and no issues with loss, P, R, mAP, etc.
I did notice that it depended on where I was seeing the values displayed whether they appeared as NaN or 0...and trainning didn not halt...in fact, it felt normal, other than not seeing.

And I've been training 2 models (one locally, and one on colab all night and morning).

pourmand1376 · 2022-08-21T13:22:58Z

This is what bisect reports ...

61adf01 is the first bad commit

commit 61adf017f231f470afca2636f1f13e4cce13914b
Author: Glenn Jocher <[[email protected]](mailto:[email protected])>
Date:   Thu Aug 18 20:12:33 2022 +0200

    `torch.empty()` for speed improvements (#9025)
    
    `torch.empty()` for speed improvement
    
    Signed-off-by: Glenn Jocher <[[email protected]](mailto:[email protected])>

:040000 040000 c7c88a2192877d503c0ef118407ca9ba9062e3f3 03d4349010436426c5870f8cab61b453890f5aeb M	models
:040000 040000 9c18cbfcbcf0c2c58e9d83d91cfdc31bd652d6b0 38a9925a4fe83db85246725490edbd5455cedb1e M	utils

glenn-jocher · 2022-08-21T13:24:31Z

@pourmand1376 ohhhhhhh this makes sense since torch.empty is not reproducible! Awesome @pourmand1376 this is a huge help. I'll revert this commit immediately, the speed gains are super small from the torch.zeros to torch.empty switch in any case.

AyushExel · 2022-08-21T13:40:32Z

@glenn-jocher @pourmand1376 quick question - do we know WHY this caused issues? From the docs I understand that it creates unintailized tensor with random values(explains non-reproduciblity) but I don't get why the metrics would be stuck at 0 ?

glenn-jocher · 2022-08-21T13:43:03Z

@pourmand1376 thanks for the help!! I've pushed #9068 with a fix by @0zppd just now

pourmand1376 · 2022-08-21T13:44:58Z

@AyushExel
I don't know. Maybe this one should be reported as a bug in the upstream pytorch. We can talk about this with pytorch main contributors ...

At least, they should put this on their docs.

glenn-jocher · 2022-08-21T13:48:10Z

@giuseppebrb @athrunsunny @pourmand1376 @robotwhispering @AyushExel good news 😃! Your original issue may now be fixed ✅ in PR #9068 by @0zppd. @pourmand1376 tracked down the problem using git bisect and running each training 10x since the bug was not reproducible on every training, but showed up in maybe 1/3 of the trainings.

The original issue was that I had replaced torch.zeros() with torch.empty() on some ops like warmup and profiling to try to get some slight speed improvements, and once op in particular ran a torch.empty() tensor through the model when it was in .train() mode, leading the batchnorm layers to add those values to the tracked statistics. Since torch.empty is not initialized it can take on extremely high or low values, leading some batchnorm layers to randomly output NaN.

The PR has been extensively tested on 10x Colab trainings and all 10 came back good now:

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

athrunsunny · 2022-08-22T00:48:56Z

@glenn-jocher @pourmand1376 @0zppd Thank you for your great work

shubhambagwari · 2022-12-13T05:51:38Z

I forked the latest version of Yv5, Still, I am facing this error.
Earlier I trained this model with 2000 images but faced the same issue.

Roboflow has been used to label the custom dataset.
I have read the earlier discussions but not working.

glenn-jocher · 2022-12-17T11:31:39Z

@shubhambagwari try 300 epochs.

Davegdd · 2022-12-21T19:39:07Z

@shubhambagwari try 300 epochs.

Thank you for your answer. I was having the same issue, it resolved after increasing epochs just beyond 3.

umarbutler · 2024-06-05T13:40:44Z

@shubhambagwari try 300 epochs.

Thank you for your answer. I was having the same issue, it resolved after increasing epochs just beyond 3.

Thanks for this, I quit my training a number of times to try and fix getting 0 P, R and mAP values in the first two epochs not realising that it would only start to increase by epoch 3 😆

athrunsunny added the question Further information is requested label Aug 21, 2022

glenn-jocher added bug Something isn't working help wanted Extra attention is needed labels Aug 21, 2022

glenn-jocher changed the title ~~When training custom one class dataset, P, R and map are all 0~~ Zero-mAP bug: P, R and mAP are all 0.0 Aug 21, 2022

glenn-jocher mentioned this issue Aug 21, 2022

images and labels found but val.py output P=0, R=0, mAP=0 #6359

Closed

1 task

glenn-jocher referenced this issue Aug 21, 2022

zero-mAP fix return .detach() to EMA

af17e42

Resolves ultralytics/hub#82 Signed-off-by: Glenn Jocher <[email protected]>

glenn-jocher mentioned this issue Aug 21, 2022

mAP zero while training ultralytics/hub#82

Closed

1 task

glenn-jocher linked a pull request Aug 21, 2022 that will close this issue

zero-mAP fix remove torch.empty() forward pass in .train() mode #9068

Merged

glenn-jocher closed this as completed Aug 21, 2022

River-Cold mentioned this issue Sep 19, 2022

TypeError when training with custom data #9384

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-mAP bug: P, R and mAP are all 0.0 #9059

Zero-mAP bug: P, R and mAP are all 0.0 #9059

athrunsunny commented Aug 21, 2022

github-actions bot commented Aug 21, 2022 •

edited by UltralyticsAssistant

Loading

giuseppebrb commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022

giuseppebrb commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

AyushExel commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 •

edited

Loading

robotaiguy commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022

AyushExel commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022 •

edited by UltralyticsAssistant

Loading

athrunsunny commented Aug 22, 2022

shubhambagwari commented Dec 13, 2022

glenn-jocher commented Dec 17, 2022

Davegdd commented Dec 21, 2022

umarbutler commented Jun 5, 2024

Zero-mAP bug: P, R and mAP are all 0.0 #9059

Zero-mAP bug: P, R and mAP are all 0.0 #9059

Comments

athrunsunny commented Aug 21, 2022

Search before asking

Question

Additional

github-actions bot commented Aug 21, 2022 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

giuseppebrb commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022

giuseppebrb commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022 • edited Loading

pourmand1376 commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

AyushExel commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022 • edited Loading

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 • edited Loading

glenn-jocher commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 • edited Loading

robotaiguy commented Aug 21, 2022 • edited Loading

pourmand1376 commented Aug 21, 2022 • edited Loading

glenn-jocher commented Aug 21, 2022

AyushExel commented Aug 21, 2022

glenn-jocher commented Aug 21, 2022

pourmand1376 commented Aug 21, 2022 • edited Loading

glenn-jocher commented Aug 21, 2022 • edited by UltralyticsAssistant Loading

athrunsunny commented Aug 22, 2022

shubhambagwari commented Dec 13, 2022

glenn-jocher commented Dec 17, 2022

Davegdd commented Dec 21, 2022

umarbutler commented Jun 5, 2024

github-actions bot commented Aug 21, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022 •

edited

Loading

robotaiguy commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022 •

edited

Loading

pourmand1376 commented Aug 21, 2022 •

edited

Loading

glenn-jocher commented Aug 21, 2022 •

edited by UltralyticsAssistant

Loading