Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLO loss #2376

Merged
merged 65 commits into from
Jul 30, 2021
Merged

YOLO loss #2376

merged 65 commits into from
Jul 30, 2021

Conversation

arrufat
Copy link
Contributor

@arrufat arrufat commented Jun 6, 2021

Hi, I've been spending the last few days trying to make a loss_yolo layer for dlib, in particular the loss presented in the YOLOv3 paper.

I think I came up with a pretty straightforward implementation but, as of now, it still does not work.

I wondered if you could have a look. I am quite confident the loss implementation is correct, however I think I might be making some assumptions about the dlib API when the loss layer takes several inputs from the network.

I tried to make the loss similar to the loss_mmod in the way you set the options of the layer, etc.

So, my question is, does this way of coding the loss in dlib make sense for multiple outputs? Or is dlib doing something I don't expect?

There's also a simple example program that takes a path containing a training.xml file (like the one from the face or vehicle detection examples).

Thanks in advance :)

@arrufat arrufat marked this pull request as draft June 6, 2021 16:12
@arrufat
Copy link
Contributor Author

arrufat commented Jun 13, 2021

Honestly, after debugging the code and running many experiments, I can't see what I am doing wrong.
I have even loaded the pretrained backbone, initialized the anchors weights to 0.5 as in here, used the same scales for the loss and the gradient updates as in the official YOLO code, but still couldn't make it work... maybe @pfeatherstone is right, and it's not as straightforward as I thought. I couldn't see any weird tricks in the training process in the official repository, though...

The training loss at the beginning goes down quite fast, but then it stays constant and pushes the detection confidences to 0.
I will keep checking, but I am open to suggestions. I think we're close to having this loss in dlib (unless I am making some terrible assumption about how dlib works, or a big conceptual mistake).

The most significant change that I made is having a sigmoid layer before the detector output.
As you may know, the box coordinates are computed like this:
image
And the objectness and classifier use a (multi) binary log loss, which means that a sigmoid is applied everywhere, except for the width and height outputs.
So, I decided to apply it everywhere, and then I can recover the original width and height outputs, by inverting the sigmoid:

y = σ(x) = 1 / (1 + e^(-x)) → x = log(y / (1 - y))

Then, when I apply the exponential to get the final width and height, it simplifies to just:

y / (1 - y)

Where y is the output after the sigmoid (0 < y < 1, so the training is stable). This allows me to apply the sigmoid everywhere using the built-in cuDNN function, which happens in place, while still being mathematically equivalent. I even made that change to Darknet and reproduced the results on VOC2012, so the problem is not there, I am a bit surprised that nobody is doing it this way...

So, any help or insights would be appreciated :)

@pfeatherstone
Copy link
Contributor

Thank you for all this hard work. Unfortunately I don't have time to spend on this at the moment. Maybe you could try a very simple backbone with a single yolo layer and therefore a single yolo loss function. Maybe that will help to debug. Maybe the yolo layer could be at stride 16 and you could train on VOC which is easier than COCO.

@pfeatherstone
Copy link
Contributor

Maybe a VGG backbone. You could even train it on darknet first. Then port the weights using your visitor. Then check the forward pass works correctly in dlib. Then go from there trying to restart the training in dlib. Maybe you could re-randomise the last couple conv layers before the yolo layer and train those.

@pfeatherstone
Copy link
Contributor

This is why autograd is so awesome. There are no headaches when writing custom modules and custom loss functions.

@arrufat
Copy link
Contributor Author

arrufat commented Jun 13, 2021

Thank you for those suggestions :)
I didn't try a simpler backbone, but I tried with a single YOLO output to see if there was a problem with multiple outputs, but that didn't work. I might try that again, since I addressed several problems in the loss function since then.
Frankly, I am not too worried about the gradients, it's all MSE and binary log loss, so quite straightforward. What bothers me is that, at least the objectness confidence should be easy to learn, but it gets pushed to 0 (I tried weighing differently, even though in the official implementation all weights (obj, bbr and class) are set to 1).

I am still motivated to make this work, I hope not to lose the momentum...

@arrufat
Copy link
Contributor Author

arrufat commented Jun 19, 2021

Well, after a week of trying stuff, I could not make it work...

At first, I thought it was a problem of weighing the different parts of the loss, then I managed to get the objectness not to go to zero, but the network predicts the same bounding boxes for all images...
I will keep working on it, but @davisking, do you think you can have a look at some point?
I know it's a lot to ask, and I understand it's not a priority.

My main concern is that I am not understanding how dlib works well enough to use multiple tensors in a single loss layer, and I am making a conceptual mistake...

Thank you in advance, and please, don't feel pressured or anything :)

@davisking
Copy link
Owner

Yeah I’ll look in a bit when I get a chance :)

@arrufat
Copy link
Contributor Author

arrufat commented Jun 19, 2021

Thank you :D there's no hurry

@arrufat
Copy link
Contributor Author

arrufat commented Jun 23, 2021

@pfeatherstone since you seemed a bit skeptical, here are some results on this implementation trained from scratch on the COCO dataset (I didn't even initialize the input_rgb_image with (0, 0, 0), just the default values from dlib.)
image
image
image
image
image
image
image
image

The bounding box regression is not perfect, but it's been training for about 12h only. I think it's showing promising results, even if they are from the training set. I tried it on my webcam, and it also works.

So, @davisking, no need to spend time on this now, I will run some extra tests and prepare this PR for review when I have time (maybe, for the example program, a simpler backbone would be better) :)
I am thrilled to have YOLO detectors working in dlib! As always, thank you for creating this outstanding library :)

@pfeatherstone
Copy link
Contributor

That looks awesome. Great work! Do you have some idea of training benchmarks vs darknet or pytorch (ultralytics repos)? It would be interesting to know memory usage when training at say 416x416 and the number of epochs required before plateau'ing.

@pfeatherstone
Copy link
Contributor

I'm quite excited by this as this might draw more dlib users and maybe more PRs

@arrufat
Copy link
Contributor Author

arrufat commented Jun 23, 2021

That looks awesome. Great work! Do you have some idea of training benchmarks vs darknet or pytorch (ultralytics repos)? It would be interesting to know memory usage when training at say 416x416 and the number of epochs required before plateau'ing.

Thank you!

Frankly, I've never used YOLOv5... I've only used Darknet and recently YOLOR (which is based on Scaled-YOLOv4, which in turn is based on YOLOv5, I think).

I've never liked the experience of those toolkits/repos. Everything is overly complicated to bypass the shortcomings of Python (performance and multi-threading). That is the main reason why I was motivated to add this to dlib, as it has the best user experience, for me... and I won't have to use those anymore.
In terms of memory, I think dlib uses slightly more memory than Darknet, but I can't tell for the others, since I didn't use them with the same backbones...

Regarding the plateauing, it's hard to compare, since each toolkit might normalize the gradient differently (e.g. using the batch size, the layer dimensions, etc.), so the learning rate has different meanings in each toolkit/repo.

Anyway, using the default scheduler in dlib makes me have one less thing to worry about.
I never liked training with fixed number of steps/epochs and decreasing the learning rate at arbitrary positions, which have to be decided beforehand.
That's why, in this PR, I use burn-in at the beginning (not sure if it's needed, though) and then fallback to the default learning rate scheduler.

@pfeatherstone
Copy link
Contributor

You might get away without burn-in, but typically with yolo models, gradients are unstable at the beginning. That's why burn-in is typically used. However, in my experience, when using CIOU loss, the gradients are way more stable and I can use a default scheduler. I wouldn't implement CIOU now though. Maybe a future PR.

@arrufat
Copy link
Contributor Author

arrufat commented Jul 11, 2022

Hi Arrufat. I have used your candidate yolov7.h. However, even for "--batch-size=1" (RTX3060 https://www.hepsiburada.com/lenovo-legion-5-intel-core-i7-11800h-16gb-1tb-ssd-rtx3060-freedos-173-fhd-tanabilir-bilgisayar-82jm0012tx--p-HBCV00000VZJFB?magaza=Hepsiburada-6GB VRAM), I am getting an insufficient memory error.

Yes, I told you there were bugs, that's why I didn't want to share it yet.

@pfeatherstone
Copy link
Contributor

Thank you everyone for the hard work! Not easy doing all of this in our free time, even harder to convince employer to do this at work. I'm excited for all of this. I do believe it will attract more users to dlib and therefore more PRs and more goodies.

@VisionEp1
Copy link

Just a quick update i attached a part of my training loops as an example
training.log

.\train.exe --name yolov5l-c [4.log](https://github.com/davisking/dlib/files/9121092/4.log) oco2017 --size 512 --batch-gpu 6 --learning-rate 0.01 --min-learning-rate 0.0001 --gpus 4 --epochs 300 --cosine --iou-anchor 0.2 --iou-ignore 0.7 --gamma-obj 0.5 --gamma-cls 1.0 --mosaic 0.5 --angle 3 --min-coverage 0.5 --hsi 0.3 0.2 0.1 --scale 0.5 --shift 0.2 --perspective 0.01 --test-period 30 --mixup 0.01 X:\Datasets\coco2017 | tee -Append -FilePath "training.log"

I run the code on windows just fyi.
So the map is still increasing.

Whats interesting is when training yolo models you usually get to around 80% of the accuracy quite fast and then it takes a lot of time to gain the last % of map (which makes total sense). But already this area is at a much higher map for the same dataset as we have it with the dlib implementation.

Another finding is that (as u can also see in the loss).
images.zip

It performs worse on the train data without augmentation in testing than on the actual validation data (i attached some images). Which is a bit wierd and i have to check if the data augmentation is buggy?

I get that the loss value is better on the testdata (no agumentation vs with) but when both are loaded without the train data should perform way better.

it might also be an idea to get start with a very basic v3 model to mimic the results and the training and see how it compares.
So not any solutions yet just sharing my findings in between.

@arrufat
Copy link
Contributor Author

arrufat commented Jul 15, 2022

@VisionEp1 thank you for checking on this :)

Some notes, you can use the detect program from that repository to generate the detection images with nice looking bounding boxes without having to capture the window. I also noticed that there might be something wrong with the data augmentation. For instance, I noticed that even without using mosaic, I get similar performances.

Using the train program, you can add the --visualize option to see what the data augmentation looks like. I couldn't see anything that might indicate I am doing something wrong, but maybe you have better eyes.

@pfeatherstone
Copy link
Contributor

Just a thought, it might be a good idea to benchmark against other 'vanilla' repositories that haven't optimised the training loops and the image augmentation, such as frgfm/Holocron. If you get similar performance then it's likely there is nothing wrong with the loss function but more an issue with optimising training loops, augmentation, hyper-parameter tuning and things like that.

@VisionEp1
Copy link

@pfeatherstone thats exactly what i wanted to do when starting with a smaller set. I dont know holocron etc
@arrufat i think mosaic on/off might not be the best metric. its a "hot" topic in the other github that sometimes it helps sometimes it doesnt. usually it helps when the dataset is big and the network as well.
(but it should help a bit on coco+v5 i agree).

I also now turned off mixup since thats not supported for detection in the default yolo repository and see if that makes a difference.

However i still dont get the "wrong" class errors i get on the train data without augmentation during testing, which i don't have on the test data. its always wierd when the train data performs worse then the test data.

i was pretty sure i had some wrong annotated data when i checked with the visualize once but after debugging a could no reproduce it. so might be lack of coffee for the dev or some random condition which only happens very rare.

I am doing some in between coco eval metrics now to see if that gives me any more ideas.
u think it's possible for you to create a basic as possible vanilla v3 example? (if you have time to do it).

@VisionEp1
Copy link

i might also run some training with very low batch sizes on the alexay repo just to make sure thats not the issue.

Another thing which i need to try is changing stuff like: multi gpu vs 1 gpu, maybe different cuda versions etc.
I had hardware related bugs in the past on many different ml freamworks and sometimes they are hard to detect.
However that doesnt help if neither of us can reproduce map values which are close to the other repos. Did any of you try a small yolo network with 1 gpu only to test this as a possible cause?

I also considered a more "bruto" force method, like the ones used in the automl context (but not on the architecture) where i create a tiny yolo net and load in pretrained weights and run loads of different Trainings regarding the most common settings.

Obviously this might also lead to nothing but do you think the following could give some insights:

I define the most common settings mainly the yolo options, the data augmentation options and scheduler options
with a set of plausible values. Then i start training with random datapoints out of the possible configurations and use the dlib tools to get some insights what the best combinations might be.

Do you have a ready dlib config which matches any known tiny yolo networks exactly?
for example https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-tiny.cfg
so following the @pfeatherstone approach i can stay as close as possible to one of those networks.
i think the tiny ones make more sense(even tho currently i train the normal one) since we can get some insights faster

@pfeatherstone
Copy link
Contributor

@VisionEp1 So @arrufat created a whole bunch of network definitions. https://github.com/dlibml/dnn/blob/master/src/detection/yolov5.h contains yolov5_n which is small.

@pfeatherstone
Copy link
Contributor

Theoretically, if you use a small network, which can do a few epochs in not too much time, you could define hyper-parameter ranges and use http://dlib.net/optimization.html#find_min_global to find optimal values. I highly doubt this would work in practice, but you could try anyway.

@pfeatherstone
Copy link
Contributor

pfeatherstone commented Jul 19, 2022

People do this already but using genetic algorithms to find optimal parameters. i don't know how dlib::find_min_global() fairs against genetic algorithms, but they have the same goal.

@arrufat
Copy link
Contributor Author

arrufat commented Jul 19, 2022

@VisionEp1 I am pretty confident about those YOLOv5 definitions, I see no need on trying with the YOLOv3 or YOLOv4 ones.

From my experience, on an internal dataset, the training code I shared achieves an mAP of almost 90%, while the test reaches only about 66%.
If I train the same YOLOv5l model in PyTorch on that dataset, I get about 74% mAP on the test set. So, for me, it was a problem of the network not generalizing well enough. Sadly, these days, I won't have much time to play with it.

@VisionEp1
Copy link

thanks for the answers.
I will go ahead and try the yolov5n. I assume it's the same one as in the ultralytics.

For me it's not about finding perfect values it's more about gathering information on where to look for the performance drop.

You stated you got for the yolov5l
overall performance Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.330 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.543

Which is way way lower then for example the ultralitcs one on the same network definition

0.50:0.95 49.0
and 0.5: 67.3

even a old v4 which I trained from scratch to compare scores:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.496 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.682

thanks again for all the help so far. i will test and see if i can find some more ideas.

@pfeatherstone
Copy link
Contributor

Also I would look at the optimizer. @arrufat correct me if i'm wrong but the optimizer is just vanilla SGD. The pytorch repos group weights into 3 different groups depending on what requires a weight decay and biases take higher lr I think. That could make a difference.

@arrufat
Copy link
Contributor Author

arrufat commented Jul 19, 2022

Yes, what I am trying to say is that I don't think the performance drop comes from the network architecture. But you don't have to believe me, in the official dlib example, that network is exactly the same as the one presented in the YOLOv3 paper: I even tried at some point to fine-tune from the official Darknet53 weights.

I would say that the main differences are:

  • upscaling: dlib uses bilinear, yolos use nearest neighbors
  • input image: YOLOv3 uses a 0-mean initialized input layer, in dlib you can achieve this by instantiating the network as darknet::yolov3_train_type net(options, input_rgb_image(0, 0, 0));

So, you could use that network in the training code I shared with you, however, I don't think that'll make a difference.

@pfeatherstone
Copy link
Contributor

Oh I agree the network definitions are correct. I would use any architecture, even something that isn't official yolo model, maybe with only 1 head, and experiment with different training setups.

@arrufat
Copy link
Contributor Author

arrufat commented Jul 19, 2022

Also I would look at the optimizer. @arrufat correct me if i'm wrong but the optimizer is just vanilla SGD. The pytorch repos group weights into 3 different groups depending on what requires a weight decay and biases take higher lr I think. That could make a difference.

I know the YOLOv5 code uses different momentum and bias lr settings for the warm-up and the main training stages, I haven't looked into that deeply.

@VisionEp1
Copy link

I completely agree I was not trying to find bugs where no bugs are. I am very sorry if that came across wrong

Thanks for the answers I will go ahead and try different training setups

@arrufat
Copy link
Contributor Author

arrufat commented Jul 19, 2022

@VisionEp1 no, you are right to question everything, I don't even trust the 1-year-ago version of myself :P

@davisking
Copy link
Owner

People do this already but using genetic algorithms to find optimal parameters. i don't know how dlib::find_min_global() fairs against genetic algorithms, but they have the same goal.

A genetic algorithm will be much worse.

@arrufat
Copy link
Contributor Author

arrufat commented Aug 15, 2022

@VisionEp1 @carkaci I fixed the YOLOv7 version, I've started a training a couple of hours ago, and it seems to work... Feel free to try it out: https://github.com/dlibml/dnn/blob/master/src/detection/yolov7.h

@carkaci
Copy link

carkaci commented Aug 15, 2022 via email

@carkaci
Copy link

carkaci commented Aug 16, 2022 via email

@arrufat
Copy link
Contributor Author

arrufat commented Aug 16, 2022

I have recently started a training. Everything seems Ok.

Yes, me too. I started yesterday and after 22 epochs, I got an mAP of 36.91%... Let's see how it goes, but I have the feeling it learns faster than YOLOv5 equivalent models.
image

@pfeatherstone
Copy link
Contributor

@arrufat Do you have an update on training? I'm very interested. Do also have the same curves for yolov5, v4 and v3? It's a lot to ask I know :) ! If they're all similar, then it will probably indicate that the augmentation, training recipes and things like that are what contribute the most to performance.

@arrufat
Copy link
Contributor Author

arrufat commented Aug 17, 2022

@pfeatherstone I do not have curves for all those models. Currently, I only have them for YOLOv5m and YOLOv7 under same training settings, except for the batch size, 48×2 vs 14×2, but using the same learning rate scheduler, so YOLOv7 has way more steps per epoch.

Check out the training curves:

YOLOv5m

image

YOLOv7

image

Training config

I have mild data augmentation, as you can see from here.

./build/Release/train \
    --name yolov7-coco2017 \
    --size 512 \
    --batch-gpu 14 \  # this is set to 48 for YOLOv5m
    --gpus 2 \
    --warmup 1 \
    --epochs 100 --cosine \
    --learning-rate 0.001 \
    --min-learning-rate 0.0001 \
    --momentum 0.9 \
    --weight-decay 0.0005 \
    --iou-anchor 0.2 \
    --iou-ignore 0.7 \
    --lambda-box 1 \
    --lambda-obj 1 \
    --lambda-cls 1 \
    --gamma-obj 0 \
    --gamma-cls 0 \
    --mosaic 0 \
    --angle 3 \
    --min-coverage 0.5 \
    --hsi 0.0 0.0 0.0 \
    --scale 0.5 \
    --shift 0.2 \
    --perspective 0.00 \
    --mixup 0.00 \
    data/coco2017 | tee -a training.log

@pfeatherstone
Copy link
Contributor

Interesting v7 learns quicker. Thanks a lot

@arrufat
Copy link
Contributor Author

arrufat commented Aug 17, 2022

Maybe we can continue the discussion here: dlibml/yolo-object-detector#2

@carkaci
Copy link

carkaci commented Oct 11, 2022 via email

@arrufat
Copy link
Contributor Author

arrufat commented Oct 11, 2022

Like Arrufat, I also have the feeling it learns faster than YOLOv5. I set
"--patience=200000", that is why training takes 5-6 days. I will let you
know the result when finished.

Oh, --patience uses the number of epochs, not the training steps in the YOLO repository. You can set it to a floating point number, though, but 200000 is definitely way too high.

@VisionEp1
Copy link

@carkaci @arrufat
if possible can we move it to your discussion post?

so we have all the "insights" over there as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants