-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different training results #22
Comments
After training 41 epoch. I still got mAP |
I have the same result with the latest commits - very low recall, precision is fine After 200 epoch on small test dataset with 3 classes I got 0.07 recall and nearly 0.97 precision PS: Thank you for your work 👍 |
@xiao1228 @JegernOUTT sorry guys, I'm still trying to figure out the exact loss terms to use. Small changes have huge effects. Some of the options that need testing are:
The main region of the code affected is small. My main strategy is to resume training from official yolov3.pt weights and look for the loss terms that produce the best mAP after 1 epoch. In the latest commit b7d0397 the mAP after 1 epoch of resumed training is about 0.50 mAP, down from 0.57 mAP with the official weights, so something is probably still not right I think. Lines 160 to 182 in b7d0397
|
@glenn-jocher Thank you Glenn, yea I resume the yolov3.pt and after 3 epochs, the precision and recall become 0.381 and 0.458.. |
Ok I'm going to document my test results here. These are the mAPs after 1 epoch of yolov3.pt resumed training. All mAPs are as produced by
... further tests? |
@glenn-jocher thanks for the update! which means CE classification loss gives a slightly better mAP compare to the rest of the changes then? |
CE means Lines 173 to 174 in b7d0397
The problem though is none of these changes I tried retains the 0.57 mAP at the start of the resumed epoch. I'm not sure what to do, any ideas are welcome. |
Sorry I'm testing on
Here is the Tensorboard information I recorded. |
@glenn-jocher No luck here too.... So far I think i can get the model to overfit the training set without augmentations. As soon I use augmentations, loss kinda get stuck at some point. Btw, I also found the augmentations you used seems to be different from the C version. So it's very possible the model might converge differently if trained from the original YoloV3 weights. |
@ydixon thats interesting, I'll try disabling the HSV augmentation and then disabling the spatial augmentation as two additional tests. You are right this could have a significant impact. I'm also going to try a larger batch size (16 vs 12). Darknet may be training with a much larger batch size (64?). @ECer23 thanks a lot for the plots! I'm going to plot the same on my end for one resumed epoch. I wish your results were correct, but I think you might accidentally be testing the same yolov3.pt to get that 56.67 mAP. After you resume for one epoch the new weights are saved in latest.pt (not yolov3.pt). The code I've been using to do these tests is here: sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
cd yolov3/checkpoints
wget https://storage.googleapis.com/ultralytics/yolov3.pt
cp yolov3.pt latest.pt
cd ..
python3 train.py -img_size 416 -batch_size 12 -epochs 1 -resume 1
python3 test.py -img_size 416 -weights_path checkpoints/latest.pt -conf_thres 0.5 |
@glenn-jocher From this cfg, batch=64, subdivision=16. Therefore, real batch size = 64 / 16 = 4. |
hi @glenn-jocher I have tried to using batch size 16 and trained for 40+ epochs...the mAP is still 0.1083.. |
@ydixon so I'm assuming they accumulate the gradient for the 64 images and then update the optimizer only once at the end of the 64 images (with the subdivisions only serving to reduce the memory requirements)? @xiao1228 the shape of those plots looks good but the rate of change of P and R is painfully slow. If we didn't divide Line 163 in b7d0397
I resumed b7d0397 again for 1 epoch and get these loss plots over the epoch. If the model were perfectly aligned with darknet I think we'd expect to see the losses stay pretty consistent, but instead they drop over the epoch, especially the width and height terms. mAP at end is the same 0.5015 I saw before. |
@glenn-jocher the loss become nan in the first epoch by using k = nM |
@glenn-jocher That's what I thought too. Seems to make most common sense. However, after asking the authors, they insist on saying it's being updated per minibatch. |
@ydixon that's strange. Wouldn't that be equivalent to batch size 4 with no minibatches? I'll test it out both ways. I tested batch size 16 and the resumed mAP increases from 0.50 (bs 12) to 0.5126 (bs 16). Its possible effective batch size 64 would make a big difference. @xiao1228 ok yes I was afraid that might happen. The parameter updates become too large and the training becomes unstable. I've been updating #22 (comment) with my test results. Positive results are that switching |
@glenn-jocher after 70+ epochs i can see the precision and recall are becoming flat...but the mAP is still only 0.1961... |
@xiao1228 thanks for the update! Yes your plots make sense, the LR scheduler multiplies the initial 1e-3 lr by 0.1 at epoch 54 and 61 (to match yolov3.cfg settings). This assumes total training time of 68 epochs. From your plots though it seems far too soon to drop the LR at epoch 54, as the P and R are still increasing linearly ... Lines 104 to 112 in d748bed
You could also try varying My tests are not revealing any breakthroughs unfortunately. All I have so far is what I mentioned before, increasing batch size to 16 and using CE for I have one big change to test also, which is ignoring non best >0.5 IOU anchors. This is a little tricky to implement but I should have it soon. This is explicitly stated in the paper, so it could have an impact. |
@xiao1228 @ydixon I just realized something important. When I used this repo for the xview challenge (https://github.com/ultralytics/xview-yolov3) I saw a vast improvement in performance when using weighted CE for Lines 33 to 41 in f79e7ff
|
@glenn-jocher thank you very much for this update. Have you tried it using the new weighted CE? also I will keep the lr to 1e-3 for the entire training then... |
@xiao1228 yes most of I resume-trained one epoch with the weighted CE, and mAP came out lower :( training from scratch with weighted CE may be better though, as I see that the two lowest count categories have 0.0 mAP each (the latest test.py additionally produces a mAP for each class). |
@glenn-jocher @ydixon @xiao1228 Training depends on batch size and subdivision as follows: |
@okanlv I see your link. Ok I'll try and follow your logic here. If I use the following values Lines 6 to 7 in d336e00
with the darknet equation then this would be |
@glenn-jocher I tried to train from scratch again with the new weighted CE..but after 10 epochs seems the trend is very similar as the previous one :( |
@xiao1228 I just made a new commit 24a4197 which switch BUT I noticed that dropping If you have time, I would explore |
@okanlv @glenn-jocher For example:
Case 1: net.seen = 124
Case 2: net.seen = 128
Also refer to AlexeyAB/darknet#1736 and pjreddie/darknet#224 |
@ydixon You are right. I have totally forgotten the modification in parser. So to correct my previous answer: @glenn-jocher Line 118 in d336e00
You have to do 1 more modification. Divide the loss by 4 before calling loss.backward() to average the loss.
|
@ydixon @okanlv the latest commit from a few days ago actually already includes code for accumulating gradients. I have not tested it from scratch, but resuming training with batch size 64 vs 16 did not show a big effect after one epoch :( This might not be the most elegant implementation (any better ideas are welcome), but if you uncomment this Lines 132 to 135 in 05f28ab
@okanlv are you saying that the latest darknet code does not load |
@glenn-jocher I tested batch size of 64 on my own too. Not much improvement too :(. Loss still stuck at some local minimum I think. As for accumulated gradients, if you don't divide it by 4, it would be like summing 4 average loss values of size 16 instead of getting the average loss value of size 64. Either way, I don't see any improvement yet. |
All, I trained to 60 epochs using the current setup. I used batch size 16 for the first 24 hours, then reverted to batch size 12 accidentally for the rest (hence the nonlinearity at epoch 10). A strange hiccup happened at epoch 40, then learning rate dropped from 1e-3 to 1e-4 at epoch 51 as part of the LR scheduler. This seemed to produce much accelerated improvements in recall during the last ten epochs. The test mAP at epoch 55 was 0.40 with The strange thing is that I had to lower |
Hello, I have the same problem as you, when training TP and FN is always 0 or maybe 1 sometimes. How can you solve it?Thank you so much! |
Hi @glenn-jocher |
@xiao1228 you should set Are your plots for training COCO or a custom dataset? They look pretty good! |
@fourth-archive thank you for the information, i tried both for setting The graph is plotted based on COCO. |
@xiao1228 if you don't see any error then it worked. Do you see a new If |
@fourth-archive Thank you, but I mean the |
Hi,
I started to train the yolov3 using 1 GPU without changing your code. And i got the below graphs...Which are all slightly different from your results. The shapes are roughly the same but the values are all in a different range shown below. I am a bit confused...It will be great if you could point me out the right direction thank you!
The text was updated successfully, but these errors were encountered: