Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When training yolov4-csp, iou_loss is a negative number. #7573

Open
kendyChina opened this issue Apr 1, 2021 · 1 comment
Open

When training yolov4-csp, iou_loss is a negative number. #7573

kendyChina opened this issue Apr 1, 2021 · 1 comment
Labels
Training issue Training issue - no-detections / Nan avg-loss / low accuracy:

Comments

@kendyChina
Copy link

kendyChina commented Apr 1, 2021

@AlexeyAB Hi. Thank you for your excellent work.
When I train yolov4-csp.cfg, I may have some problems.
Darknet prints logs showing that iou_loss is negative, which puzzles me, and when I train yolov4.cfg with the same dataset, iou_loss is not negative.

The command is:

cd /home/ma-user/work/xulexuan/smokefire_det_yolov4-csp_2 && \
 /home/ma-user/work/xulexuan/darknet_yolov4_smokefire/darknet detector train \
/home/ma-user/work/xulexuan/smokefire_det_yolov4-csp_2/smokefire.data \
/home/ma-user/work/xulexuan/smokefire_det_yolov4-csp_2/yolov4-csp.cfg \
/home/ma-user/work/xulexuan/smokefire_det_yolov4-csp_2/weights/yolov4-csp.conv.142 \
-gpus 2,3 -dont_show -map | tee 
/home/ma-user/work/xulexuan/smokefire_det_yolov4-csp_2/yolov4-csp.log

This is a part of yolov4-csp.cfg file I modified based on #7087 (comment), where I modified the custom anchors and replaced all the mish with relu.

[net]
# Testing
#batch=1
#subdivisions=1

# Training
batch=64
subdivisions=8

width=768
height=416
channels=3
momentum=0.949
decay=0.0005
angle=180

saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.0001
burn_in=2000
max_batches = 20000
policy=steps
steps=16000,18000
scales=.1,.1

mosaic=1

letter_box=1

ema_alpha=0.9998

#optimized_memory=1

#23:104x104 54:52x52 85:26x26 104:13x13 for 416

--- some network structures ---

[convolutional]
size=1
stride=1
pad=1
filters=21
# activation=logistic
activation=linear


[yolo]
mask = 0,1,2
anchors = 20,17, 46,32, 74,62, 110,118, 187,183, 204,79, 336,260, 407,160, 622,322
classes=2
num=9
jitter=.1
# scale_x_y = 2.0
scale_x_y = 1.05
# objectness_smooth=1
objectness_smooth=0
ignore_thresh = .7
truth_thresh = 1
#random=1
resize=1.5
iou_thresh=0.2
# iou_normalizer=0.05
cls_normalizer=0.5
obj_normalizer=4.0
# iou_loss=ciou
nms_kind=diounms
beta_nms=0.6
# new_coords=1
max_delta=5

And some of my training logs (Please forgive me for not being able to take screenshots due to my work.):

 (next mAP calculation at 7939 iterations) 
 Last accuracy [email protected] = 26.06 %, best = 37.29 % 
 7936: 10.981703, 9.571694 avg loss, 0.000200 rate, 2.520074 seconds, 1015808 images, 4.344318 hours left
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.667758), count: 18, class_loss = 24.657625, iou_loss = -17.400597, total_loss = 7.257029 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.772175), count: 18, class_loss = 0.682268, iou_loss = 1.041399, total_loss = 1.723668 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.654958), count: 66, class_loss = 295.130829, iou_loss = -214.998322, total_loss = 80.132507 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.870868), count: 17, class_loss = 0.020216, iou_loss = 0.422479, total_loss = 0.442695 
 total_bbox = 2854589, rewritten_bbox = 0.199678 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.765109), count: 59, class_loss = 3.666386, iou_loss = 3.651301, total_loss = 7.317687 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.774608), count: 26, class_loss = 0.318644, iou_loss = 1.661815, total_loss = 1.980459 
 total_bbox = 2855317, rewritten_bbox = 0.202850 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.705656), count: 35, class_loss = 99.842155, iou_loss = -72.648842, total_loss = 27.193316 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.785459), count: 48, class_loss = 2.310018, iou_loss = 2.719765, total_loss = 5.029783 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.768607), count: 20, class_loss = 0.047516, iou_loss = 0.920244, total_loss = 0.967760 
 total_bbox = 2854692, rewritten_bbox = 0.199671 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.680916), count: 33, class_loss = 86.114616, iou_loss = -62.412079, total_loss = 23.702538 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.757862), count: 52, class_loss = 2.811994, iou_loss = 2.700784, total_loss = 5.512778 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.754424), count: 32, class_loss = 0.117560, iou_loss = 1.315984, total_loss = 1.433544 
 total_bbox = 2855434, rewritten_bbox = 0.202841 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.675176), count: 19, class_loss = 104.066246, iou_loss = -76.616226, total_loss = 27.450022 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.720800), count: 51, class_loss = 4.315526, iou_loss = 2.963520, total_loss = 7.279046 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.754142), count: 39, class_loss = 0.299071, iou_loss = 2.201756, total_loss = 2.500827 
 total_bbox = 2854801, rewritten_bbox = 0.199664 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.676617), count: 63, class_loss = 256.629120, iou_loss = -187.507233, total_loss = 69.121887 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.727937), count: 74, class_loss = 4.728501, iou_loss = 5.578818, total_loss = 10.307320 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.787509), count: 33, class_loss = 0.101363, iou_loss = 1.512439, total_loss = 1.613802 
 total_bbox = 2855604, rewritten_bbox = 0.202829 % 
libpng warning: Incorrect sBIT chunk length
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.650997), count: 36, class_loss = 57.469650, iou_loss = -40.197891, total_loss = 17.271757 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.784982), count: 35, class_loss = 1.067892, iou_loss = 1.251908, total_loss = 2.319800 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.808537), count: 13, class_loss = 0.023424, iou_loss = 0.402296, total_loss = 0.425721 
 total_bbox = 2854885, rewritten_bbox = 0.199658 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.662302), count: 33, class_loss = 144.609161, iou_loss = -105.346619, total_loss = 39.262539 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.750408), count: 77, class_loss = 3.802018, iou_loss = 4.551557, total_loss = 8.353575 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.757675), count: 46, class_loss = 0.173961, iou_loss = 2.452817, total_loss = 2.626778 
 total_bbox = 2855760, rewritten_bbox = 0.202818 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.641636), count: 35, class_loss = 167.206879, iou_loss = -122.611183, total_loss = 44.595695 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.702767), count: 40, class_loss = 2.472324, iou_loss = 3.432393, total_loss = 5.904717 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.812279), count: 22, class_loss = 0.050357, iou_loss = 0.674862, total_loss = 0.725219 
 total_bbox = 2854982, rewritten_bbox = 0.199651 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.668511), count: 47, class_loss = 191.111221, iou_loss = -139.335358, total_loss = 51.775871 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.720861), count: 64, class_loss = 4.735015, iou_loss = 4.423585, total_loss = 9.158601 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.751544), count: 35, class_loss = 0.344764, iou_loss = 2.394387, total_loss = 2.739151 
 total_bbox = 2855906, rewritten_bbox = 0.202808 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.747988), count: 43, class_loss = 138.384521, iou_loss = -101.551109, total_loss = 36.833412 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.771579), count: 77, class_loss = 4.758459, iou_loss = 3.766207, total_loss = 8.524666 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.811942), count: 36, class_loss = 0.089222, iou_loss = 1.079173, total_loss = 1.168395 
 total_bbox = 2855138, rewritten_bbox = 0.199640 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.831341), count: 5, class_loss = 15.909742, iou_loss = -11.639129, total_loss = 4.270614 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.685597), count: 39, class_loss = 1.998399, iou_loss = 2.989974, total_loss = 4.988373 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.761367), count: 29, class_loss = 0.102724, iou_loss = 1.149873, total_loss = 1.252597 
 total_bbox = 2855979, rewritten_bbox = 0.202873 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.710227), count: 28, class_loss = 80.933899, iou_loss = -58.424400, total_loss = 22.509497 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.751094), count: 37, class_loss = 1.295604, iou_loss = 1.900529, total_loss = 3.196134 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.748046), count: 11, class_loss = 0.025220, iou_loss = 0.633326, total_loss = 0.658546 
 total_bbox = 2855214, rewritten_bbox = 0.199635 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.666958), count: 10, class_loss = 44.806202, iou_loss = -32.348846, total_loss = 12.457357 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.747149), count: 30, class_loss = 1.239678, iou_loss = 1.776720, total_loss = 3.016399 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.761410), count: 16, class_loss = 0.038273, iou_loss = 0.816880, total_loss = 0.855153 
 total_bbox = 2856035, rewritten_bbox = 0.202869 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.763314), count: 9, class_loss = 26.460726, iou_loss = -19.388584, total_loss = 7.072142 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.768976), count: 27, class_loss = 1.160739, iou_loss = 1.306634, total_loss = 2.467374 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.768428), count: 21, class_loss = 0.221103, iou_loss = 1.318804, total_loss = 1.539907 
 total_bbox = 2855271, rewritten_bbox = 0.199631 % 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 4.00, cls: 0.50) Region 144 Avg (IOU: 0.636865), count: 4, class_loss = 0.456027, iou_loss = -0.082396, total_loss = 0.373631 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 0.50) Region 159 Avg (IOU: 0.724057), count: 23, class_loss = 0.622950, iou_loss = 1.407006, total_loss = 2.029956 
v3 (mse loss, Normalizer: (iou: 0.75, obj: 0.40, cls: 0.50) Region 174 Avg (IOU: 0.761153), count: 16, class_loss = 0.038379, iou_loss = 0.783839, total_loss = 0.822218 
 total_bbox = 2856078, rewritten_bbox = 0.202866 % 

Finally, here's my system information:

 CUDA-version: 9000 (10020), cuDNN: 7.4.1, CUDNN_HALF=1, GPU count: 4  
 OpenCV version: 4.9.1
 CUDNN_HALF=1 
 2 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-32GB

Please feel free to contact me if any information is lacking.
Thanks!

@kendyChina kendyChina added the Training issue Training issue - no-detections / Nan avg-loss / low accuracy: label Apr 1, 2021
@zyrant
Copy link

zyrant commented Apr 22, 2021

hi
i also get this problem, have you fixed this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Training issue Training issue - no-detections / Nan avg-loss / low accuracy:
Projects
None yet
Development

No branches or pull requests

2 participants