-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yolov5s model converging differently between latest yolov5 and dated yolov5 (5 months) #7027
Comments
@saumitrabg 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
For Ultralytics to provide assistance your code should also be:
If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
FWIW-I see the difference in hyperparameters between my old version and new version of YOLOv5. Old: New: |
@saumitrabg judging by your results it seems pretty apparent that if all else is equal the better performing model simply started from pretrained weights and the lower one didnt. |
@glenn-jocher yes, the same datasets. The yolov5m model was trained with same real datasets with a base model that was trained on bunch of synthetic data. This is the same mAP curve with yolov5s model with the new code as well- exact same datasets. However, with the older yolo code, we do the same- train wtih real datasets with a base model that was trained on bunch of synthetic data and they get to much higher mAP score after 300 epochs. We like to ideally keep adding the new incremental data to the previous model but as yolo code changes, that is not possible. So, we are keeping version of datasets where 1st model gets trained on the default yolo model that comes in the repo. What else can we exolore? If you look at the charts, the new models off new yolo code don't go beyond 0.2 mAP score even after training on the same datasets for higher epochs. That didn't happen with the older code. Other thing is that x/lr0 curve is very different with the AI model with new yolo code- it is alwasy a straightline now. |
@saumitrabg I would put any differences down to your implementation or user error. All YOLOv5 models are trained on COCO from scratch on each release and results improve slightly in most cases. |
@glenn-jocher Thanks. To make sure there is no user error, we went back and redownloaded the yolov5 git and retrained but it is still showing the same. We have been training yolov5 models for 1.5 years now (they are really great) and it is quite simple actually- just change the coco128.yaml file with the corresponding train/val datasets, pick the right rect size 640 and things have worked great. Also, our coco128.yaml file was old and had different format for train/val datasets and we fixed that to make sure we start with a clean slate. Not much progress. We will keep inspecting if there is any user error though there is no much needed to train. |
@saumitrabg got it. We are training multiple models (i.e. 8+ models in parallel right now) across COCO and VOC both from scratch (COCO) and from pretrained (VOC) as part of our normal R&D, and both are operating and training correctly, so I don't see any sign of training issues today. |
@glenn-jocher any clue how we can make progress? See, how we get higher mAP score with the older yolov5 default weights. A few things:
|
@saumitrabg the only thing I can think of is an AutoAnchor bug which was resolved last week. See #7067 and #7060 If you could provide a fully reproducible example of what you are seeing then we could start debugging it, but lacking that there is nothing for us to do. A reproducible example would be one data.yaml with autodownload capability and two branches that you say perform very differently. git clone https://github.com/ultralytics/yolov5 yolov5-1 -b BRANCH1
cd yolov5-1
python train.py --data DATA.yaml
cd ..
git clone https://github.com/ultralytics/yolov5 yolov5-2 -b BRANCH2
cd yolov5-2
python train.py --data DATA.yaml |
@glenn-jocher sure-we will provide the debug information (2 yolo snapshots). Do you recommend a particular branch from December, 2021 that we can use? |
@saumitrabg well if you're saying v5.0 and master are producing different results then: git clone https://github.com/ultralytics/yolov5 yolov5-1 -b v5.0
cd yolov5-1
python train.py --data DATA.yaml
cd ..
git clone https://github.com/ultralytics/yolov5 yolov5-2 -b master
cd yolov5-2
python train.py --data DATA.yaml |
@glenn-jocher We confirmed that your v6.0 branch works well while the master or even the 2-month old branch (tests/aws) don't work with default settings. We will stay on v6.0 for now and will move from the small to medium AI weights, however, like to understand what you need from us to help debug this info. All are trained on medium weights and with our data/4x T4 GPUs, it takes 50-60 hrs to train. The red line- 1st v6.0 model- had mAP go to 0 after 48th epoch and hence, we restarted the 2nd v6.0 model using 48th best.pt as a baseline and assume that they are the same continuation. A few other things that I saw:
|
@saumitrabg is this just on your dataset? If you train coco128.yaml to 300 epochs do you see the same performance on both branches? |
@glenn-jocher we have only tried on our datasets since we are building custom AI models. Regardless of the number of epochs, the master branch performs worse from get-go. |
@saumitrabg we need to be able to reproduce this ourselves, otherwise there is nothing for us to investigate. For example v6.0 and v6.1 model official records are here, and you can see near identical performance across all 10 YOLOv5 models on the COCO dataset between the two versions: |
@glenn-jocher if that was your conclusion, we should not have been told to reproduce between v6.0 and master latest : -). |
@saumitrabg yes it's good you've confirmed a difference, but for us to investigate we need to be able to reproduce the difference ourselves, i.e. we would need your dataset and your data.yaml so we can run your same command and then try to figure out where the differences are originating from. It seems the differences appear in less than 10 epochs, so it shouldn't take long, we just need your dataset, or any other dataset that you see is also producing the same behavior. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
This also happened in my case, @saumitrabg is there any solution you used to tackle the problem? Thank you. |
@TimbusCalin 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
For Ultralytics to provide assistance your code should also be:
If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
We did 2 things:1. Went back to v6.0 and kept our Yolov5 code locked at the version2. I believe, we changed lr0 in hyperparameters. Make it lower to see if that helps.
Sent from Yahoo Mail on Android
On Mon, Oct 24, 2022 at 1:12 PM, Glenn ***@***.***> wrote:
@TimbusCalin 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀 . We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem.
How to create a Minimal, Reproducible Example
When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
- ✅ Minimal – Use as little code as possible to produce the problem
- ✅ Complete – Provide all parts someone else needs to reproduce the problem
- ✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem
For Ultralytics to provide assistance your code should also be:
- ✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
- ✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code
|
Search before asking
Question
I have tried the same dataset with both yolov5s and yolov5m. My mAP scores are not converging as good as it used to with the new code. Did I miss any tuned parameters?

Additional
No response
The text was updated successfully, but these errors were encountered: