When I training I have a problem. #4

wntg · 2022-11-23T14:31:14Z

-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/root/workspace/UniFormerV2-main/slowfast/utils/multiprocessing.py", line 60, in run
ret = func(cfg)
File "/root/workspace/UniFormerV2-main/tools/train_net.py", line 489, in train
train_loader, model, optimizer, loss_scaler, train_meter, cur_epoch, cfg, writer
File "/root/workspace/UniFormerV2-main/tools/train_net.py", line 105, in train_epoch
loss_scaler(loss, optimizer, clip_grad=cfg.SOLVER.CLIP_GRADIENT, parameters=model.parameters(), create_graph=is_second_order)
File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/timm/utils/cuda.py", line 43, in call
self._scaler.scale(loss).backward(create_graph=create_graph)
File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/xclip/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [4, 768, 8, 14, 14]], which is output 0 of AsStridedBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Andy1621 · 2022-11-23T16:40:47Z

It might be caused by different versions of PyTorch or CUDA. You can try to add clone here:

UniFormerV2/slowfast/models/uniformerv2_model.py

Line 263 in 6a678d0

    
           tmp_feats = self.dpe[j](tmp_feats).view(N, C, T_down, L - 1).permute(3, 0, 2, 1).contiguous()

Thus line264 is as follows：

tmp_feats = self.dpe[j](tmp_feats.clone()).view(N, C, T_down, L - 1).permute(3, 0, 2, 1).contiguous()

Andy1621 · 2022-11-27T09:15:43Z

@wntg Hi! Have you solved the problem?

Andy1621 · 2022-11-30T14:19:11Z

As there is no more activity, I am closing the issue, don't hesitate to reopen it if necessary.

xiezexun · 2023-07-29T11:01:38Z

请问，有vit_b16.pth的链接吗？我下载的预训练模型权重参数不匹配

Andy1621 · 2023-07-30T14:40:07Z

请查看https://github.com/OpenGVLab/UniFormerV2/blob/main/extract_clip/extract.ipynb

Andy1621 pinned this issue Nov 23, 2022

Andy1621 unpinned this issue Nov 23, 2022

Andy1621 added the good first issue Good for newcomers label Nov 23, 2022

Andy1621 closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I training I have a problem. #4

When I training I have a problem. #4

wntg commented Nov 23, 2022

Andy1621 commented Nov 23, 2022 •

edited

Loading

Andy1621 commented Nov 27, 2022

Andy1621 commented Nov 30, 2022

xiezexun commented Jul 29, 2023

Andy1621 commented Jul 30, 2023

When I training I have a problem. #4

When I training I have a problem. #4

Comments

wntg commented Nov 23, 2022

Andy1621 commented Nov 23, 2022 • edited Loading

Andy1621 commented Nov 27, 2022

Andy1621 commented Nov 30, 2022

xiezexun commented Jul 29, 2023

Andy1621 commented Jul 30, 2023

Andy1621 commented Nov 23, 2022 •

edited

Loading