-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After interrupting training, load weights/last.pt to continue training #1368
Comments
Hello @winnerCR7, thank you for your interest in our work! Ultralytics has open-sourced YOLOv5 at https://github.com/ultralytics/yolov5, featuring faster, lighter and more accurate object detection. YOLOv5 is recommended for all new projects. ![]() To continue with this repo, please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments. If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@winnerCR7 it looks like you're encountering a CUDA out of memory error as you try to resume training after stopping at epoch 95. This issue can often be resolved by reducing the batch size or image size during the continuation of training. It's great to hear that you were able to continue training in VScode using the same arguments. As for restoring the training record of TensorBoard after resuming training, you can try launching TensorBoard with the Feel free to adjust the batch size or image size as needed to prevent the CUDA out of memory issue, and best of luck with your continued training. The YOLO community and the Ultralytics team are here to support you throughout the process. |
I stopped after training 95 epochs. I entered
python train.py --batch-size 16 --img-size 416 --weights weights/last.pt --data data/bdd100k/bdd100k.data --cfg cfg/yolov3-spp-bdd100k.cfg
in the terminal to continue training, and after running a epoch on training set, I reported this error, but I can normally continue training in VScode using the same args. Why is this?BTW, after resuming training, how should the training record of TensorBoard be restored? I opened the URL and found that it was always the training record before the training was interrupted, it does not seem to be updated.
Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='cfg/yolov3-spp-bdd100k.cfg', data='data/bdd100k/bdd100k.data', device='', epochs=300, evolve=False, freeze_layers=False, img_size=[416], multi_scale=False, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='weights/last.pt')
Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce RTX 2070', total_memory=7979MB)
device1 _CudaDeviceProperties(name='GeForce GTX 1060 6GB', total_memory=6078MB)
Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Model Summary: 225 layers, 6.26218e+07 parameters, 6.26218e+07 gradients
Optimizer groups: 76 .bias, 76 Conv2d.weight, 73 other
Caching labels data/bdd100k/labels/train.npy (69863 found, 0 missing, 0 empty, 1 duplicate, for 69863 images): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69863/69863 [00:02<00:00, 26837.25it/s]
Caching labels data/bdd100k/labels/val.npy (10000 found, 0 missing, 0 empty, 0 duplicate, for 10000 images): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 26744.37it/s]
Image sizes 416 - 416 train, 416 test
Using 8 dataloader workers
Starting training for 300 epochs...
File "train.py", line 497, in
train(hyp) # train normally
File "train.py", line 322, in train
scaled_loss.backward()
File "/home/cr7/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/cr7/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 7.79 GiB total capacity; 2.27 GiB already allocated; 1.57 GiB free; 4.26 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1591914880026/work/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f1a3129ab5e in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1f39d (0x7f1a314e639d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_alloc(unsigned long) + 0x5b (0x7f1a314e098b in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0xd767c6 (0x7f1a3246a7c6 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd7af6d (0x7f1a3246ef6d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0xd6dc9a (0x7f1a32461c9a in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xd6f07f (0x7f1a3246307f in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd72cd0 (0x7f1a32466cd0 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f1a32466f29 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xdd9880 (0x7f1a324cd880 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe1daf8 (0x7f1a32511af8 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7f1a32467bdc in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #12: + 0xdd958b (0x7f1a324cd58b in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #13: + 0xe1db54 (0x7f1a32511b54 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #14: + 0x29dee26 (0x7f1a5b288e26 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: + 0x2a2e634 (0x7f1a5b2d8634 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7f1a5aea0ff8 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x2ae7df5 (0x7f1a5b391df5 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f1a5b38f0f3 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7f1a5b38fed2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f1a5b388549 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f1a5e8d8638 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #22: + 0xc819d (0x7f1a613f919d in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #23: + 0x76db (0x7f1a7a1d86db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #24: clone + 0x3f (0x7f1a79f0188f in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: