-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YOLO loss #2376
YOLO loss #2376
Conversation
…me scale for coordinates as the official implementation
Honestly, after debugging the code and running many experiments, I can't see what I am doing wrong. The training loss at the beginning goes down quite fast, but then it stays constant and pushes the detection confidences to 0. The most significant change that I made is having a sigmoid layer before the detector output. y = σ(x) = 1 / (1 + e^(-x)) → x = log(y / (1 - y)) Then, when I apply the exponential to get the final width and height, it simplifies to just: y / (1 - y) Where y is the output after the sigmoid (0 < y < 1, so the training is stable). This allows me to apply the sigmoid everywhere using the built-in cuDNN function, which happens in place, while still being mathematically equivalent. I even made that change to Darknet and reproduced the results on VOC2012, so the problem is not there, I am a bit surprised that nobody is doing it this way... So, any help or insights would be appreciated :) |
Thank you for all this hard work. Unfortunately I don't have time to spend on this at the moment. Maybe you could try a very simple backbone with a single yolo layer and therefore a single yolo loss function. Maybe that will help to debug. Maybe the yolo layer could be at stride 16 and you could train on VOC which is easier than COCO. |
Maybe a VGG backbone. You could even train it on darknet first. Then port the weights using your visitor. Then check the forward pass works correctly in dlib. Then go from there trying to restart the training in dlib. Maybe you could re-randomise the last couple conv layers before the yolo layer and train those. |
This is why autograd is so awesome. There are no headaches when writing custom modules and custom loss functions. |
Thank you for those suggestions :) I am still motivated to make this work, I hope not to lose the momentum... |
Well, after a week of trying stuff, I could not make it work... At first, I thought it was a problem of weighing the different parts of the loss, then I managed to get the objectness not to go to zero, but the network predicts the same bounding boxes for all images... My main concern is that I am not understanding how dlib works well enough to use multiple tensors in a single loss layer, and I am making a conceptual mistake... Thank you in advance, and please, don't feel pressured or anything :) |
Yeah I’ll look in a bit when I get a chance :) |
Thank you :D there's no hurry |
@pfeatherstone since you seemed a bit skeptical, here are some results on this implementation trained from scratch on the COCO dataset (I didn't even initialize the The bounding box regression is not perfect, but it's been training for about 12h only. I think it's showing promising results, even if they are from the training set. I tried it on my webcam, and it also works. So, @davisking, no need to spend time on this now, I will run some extra tests and prepare this PR for review when I have time (maybe, for the example program, a simpler backbone would be better) :) |
That looks awesome. Great work! Do you have some idea of training benchmarks vs darknet or pytorch (ultralytics repos)? It would be interesting to know memory usage when training at say 416x416 and the number of epochs required before plateau'ing. |
I'm quite excited by this as this might draw more dlib users and maybe more PRs |
Thank you! Frankly, I've never used YOLOv5... I've only used Darknet and recently YOLOR (which is based on Scaled-YOLOv4, which in turn is based on YOLOv5, I think). I've never liked the experience of those toolkits/repos. Everything is overly complicated to bypass the shortcomings of Python (performance and multi-threading). That is the main reason why I was motivated to add this to dlib, as it has the best user experience, for me... and I won't have to use those anymore. Regarding the plateauing, it's hard to compare, since each toolkit might normalize the gradient differently (e.g. using the batch size, the layer dimensions, etc.), so the learning rate has different meanings in each toolkit/repo. Anyway, using the default scheduler in dlib makes me have one less thing to worry about. |
You might get away without burn-in, but typically with yolo models, gradients are unstable at the beginning. That's why burn-in is typically used. However, in my experience, when using CIOU loss, the gradients are way more stable and I can use a default scheduler. I wouldn't implement CIOU now though. Maybe a future PR. |
After visualizing some data, it does not make sense to use this feature for YOLO detectors.
Yes, I told you there were bugs, that's why I didn't want to share it yet. |
Thank you everyone for the hard work! Not easy doing all of this in our free time, even harder to convince employer to do this at work. I'm excited for all of this. I do believe it will attract more users to dlib and therefore more PRs and more goodies. |
Just a quick update i attached a part of my training loops as an example
I run the code on windows just fyi. Whats interesting is when training yolo models you usually get to around 80% of the accuracy quite fast and then it takes a lot of time to gain the last % of map (which makes total sense). But already this area is at a much higher map for the same dataset as we have it with the dlib implementation. Another finding is that (as u can also see in the loss). It performs worse on the train data without augmentation in testing than on the actual validation data (i attached some images). Which is a bit wierd and i have to check if the data augmentation is buggy? I get that the loss value is better on the testdata (no agumentation vs with) but when both are loaded without the train data should perform way better. it might also be an idea to get start with a very basic v3 model to mimic the results and the training and see how it compares. |
@VisionEp1 thank you for checking on this :) Some notes, you can use the Using the |
Just a thought, it might be a good idea to benchmark against other 'vanilla' repositories that haven't optimised the training loops and the image augmentation, such as frgfm/Holocron. If you get similar performance then it's likely there is nothing wrong with the loss function but more an issue with optimising training loops, augmentation, hyper-parameter tuning and things like that. |
@pfeatherstone thats exactly what i wanted to do when starting with a smaller set. I dont know holocron etc I also now turned off mixup since thats not supported for detection in the default yolo repository and see if that makes a difference. However i still dont get the "wrong" class errors i get on the train data without augmentation during testing, which i don't have on the test data. its always wierd when the train data performs worse then the test data. i was pretty sure i had some wrong annotated data when i checked with the visualize once but after debugging a could no reproduce it. so might be lack of coffee for the dev or some random condition which only happens very rare. I am doing some in between coco eval metrics now to see if that gives me any more ideas. |
i might also run some training with very low batch sizes on the alexay repo just to make sure thats not the issue. Another thing which i need to try is changing stuff like: multi gpu vs 1 gpu, maybe different cuda versions etc. I also considered a more "bruto" force method, like the ones used in the automl context (but not on the architecture) where i create a tiny yolo net and load in pretrained weights and run loads of different Trainings regarding the most common settings. Obviously this might also lead to nothing but do you think the following could give some insights: I define the most common settings mainly the yolo options, the data augmentation options and scheduler options Do you have a ready dlib config which matches any known tiny yolo networks exactly? |
@VisionEp1 So @arrufat created a whole bunch of network definitions. https://github.com/dlibml/dnn/blob/master/src/detection/yolov5.h contains yolov5_n which is small. |
Theoretically, if you use a small network, which can do a few epochs in not too much time, you could define hyper-parameter ranges and use http://dlib.net/optimization.html#find_min_global to find optimal values. I highly doubt this would work in practice, but you could try anyway. |
People do this already but using genetic algorithms to find optimal parameters. i don't know how |
@VisionEp1 I am pretty confident about those YOLOv5 definitions, I see no need on trying with the YOLOv3 or YOLOv4 ones. From my experience, on an internal dataset, the training code I shared achieves an mAP of almost 90%, while the test reaches only about 66%. |
thanks for the answers. For me it's not about finding perfect values it's more about gathering information on where to look for the performance drop. You stated you got for the yolov5l Which is way way lower then for example the ultralitcs one on the same network definition 0.50:0.95 49.0 even a old v4 which I trained from scratch to compare scores: thanks again for all the help so far. i will test and see if i can find some more ideas. |
Also I would look at the optimizer. @arrufat correct me if i'm wrong but the optimizer is just vanilla SGD. The pytorch repos group weights into 3 different groups depending on what requires a weight decay and biases take higher lr I think. That could make a difference. |
Yes, what I am trying to say is that I don't think the performance drop comes from the network architecture. But you don't have to believe me, in the official dlib example, that network is exactly the same as the one presented in the YOLOv3 paper: I even tried at some point to fine-tune from the official Darknet53 weights. I would say that the main differences are:
So, you could use that network in the training code I shared with you, however, I don't think that'll make a difference. |
Oh I agree the network definitions are correct. I would use any architecture, even something that isn't official yolo model, maybe with only 1 head, and experiment with different training setups. |
I know the YOLOv5 code uses different momentum and bias lr settings for the warm-up and the main training stages, I haven't looked into that deeply. |
I completely agree I was not trying to find bugs where no bugs are. I am very sorry if that came across wrong Thanks for the answers I will go ahead and try different training setups |
@VisionEp1 no, you are right to question everything, I don't even trust the 1-year-ago version of myself :P |
A genetic algorithm will be much worse. |
@VisionEp1 @carkaci I fixed the YOLOv7 version, I've started a training a couple of hours ago, and it seems to work... Feel free to try it out: https://github.com/dlibml/dnn/blob/master/src/detection/yolov7.h |
Thank you, I will test it.
…On Mon, Aug 15, 2022 at 6:34 PM Adrià Arrufat ***@***.***> wrote:
@VisionEp1 <https://github.com/VisionEp1> @carkaci
<https://github.com/carkaci> I fixed the YOLOv7 version, I've started a
training a couple of hours ago, and it seems to work... Feel free to try it
out: https://github.com/dlibml/dnn/blob/master/src/detection/yolov7.h
—
Reply to this email directly, view it on GitHub
<#2376 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AW5C34O5BURD6MP5CND5QZLVZJPQ5ANCNFSM46GDQCUA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I have recently started a training. Everything seems Ok.
On Mon, Aug 15, 2022 at 6:36 PM Abdurrahman Carkacioglu ***@***.***>
wrote:
… Thank you, I will test it.
On Mon, Aug 15, 2022 at 6:34 PM Adrià Arrufat ***@***.***>
wrote:
> @VisionEp1 <https://github.com/VisionEp1> @carkaci
> <https://github.com/carkaci> I fixed the YOLOv7 version, I've started a
> training a couple of hours ago, and it seems to work... Feel free to try it
> out: https://github.com/dlibml/dnn/blob/master/src/detection/yolov7.h
>
> —
> Reply to this email directly, view it on GitHub
> <#2376 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AW5C34O5BURD6MP5CND5QZLVZJPQ5ANCNFSM46GDQCUA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@arrufat Do you have an update on training? I'm very interested. Do also have the same curves for yolov5, v4 and v3? It's a lot to ask I know :) ! If they're all similar, then it will probably indicate that the augmentation, training recipes and things like that are what contribute the most to performance. |
@pfeatherstone I do not have curves for all those models. Currently, I only have them for YOLOv5m and YOLOv7 under same training settings, except for the batch size, 48×2 vs 14×2, but using the same learning rate scheduler, so YOLOv7 has way more steps per epoch. Check out the training curves: YOLOv5mYOLOv7Training configI have mild data augmentation, as you can see from here. ./build/Release/train \
--name yolov7-coco2017 \
--size 512 \
--batch-gpu 14 \ # this is set to 48 for YOLOv5m
--gpus 2 \
--warmup 1 \
--epochs 100 --cosine \
--learning-rate 0.001 \
--min-learning-rate 0.0001 \
--momentum 0.9 \
--weight-decay 0.0005 \
--iou-anchor 0.2 \
--iou-ignore 0.7 \
--lambda-box 1 \
--lambda-obj 1 \
--lambda-cls 1 \
--gamma-obj 0 \
--gamma-cls 0 \
--mosaic 0 \
--angle 3 \
--min-coverage 0.5 \
--hsi 0.0 0.0 0.0 \
--scale 0.5 \
--shift 0.2 \
--perspective 0.00 \
--mixup 0.00 \
data/coco2017 | tee -a training.log |
Interesting v7 learns quicker. Thanks a lot |
Maybe we can continue the discussion here: dlibml/yolo-object-detector#2 |
Like Arrufat, I also have the feeling it learns faster than YOLOv5. I set
"--patience=200000", that is why training takes 5-6 days. I will let you
know the result when finished.
…On Tue, Aug 16, 2022 at 10:24 PM pfeatherstone ***@***.***> wrote:
@arrufat <https://github.com/arrufat> Do you have an update on training?
I'm very interested. Do also have the same curves for yolov5, v4 and v3?
It's a lot to ask I know :) ! If they're all similar, then it will probably
indicate that the augmentation, training recipes and things like that are
what contribute the most to performance.
—
Reply to this email directly, view it on GitHub
<#2376 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AW5C34LLIM67EVGNGCVROW3VZPTE5ANCNFSM46GDQCUA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Oh, |
Hi, I've been spending the last few days trying to make a
loss_yolo
layer for dlib, in particular the loss presented in the YOLOv3 paper.I think I came up with a pretty straightforward implementation but, as of now, it still does not work.
I wondered if you could have a look. I am quite confident the loss implementation is correct, however I think I might be making some assumptions about the dlib API when the loss layer takes several inputs from the network.
I tried to make the loss similar to the
loss_mmod
in the way you set the options of the layer, etc.So, my question is, does this way of coding the loss in dlib make sense for multiple outputs? Or is dlib doing something I don't expect?
There's also a simple example program that takes a path containing a
training.xml
file (like the one from the face or vehicle detection examples).Thanks in advance :)