Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

推理代码是需要多少张gpu显卡才能跑通的呀? #24

Closed
Deep-imagelab opened this issue Dec 10, 2024 · 11 comments
Closed

推理代码是需要多少张gpu显卡才能跑通的呀? #24

Deep-imagelab opened this issue Dec 10, 2024 · 11 comments

Comments

@Deep-imagelab
Copy link

我用多卡和单卡推理的时候,会报不同的错误,按照推理代码默认的设置:
llava_device = 'cuda:1'
t5llm_device = 'cuda:2'
意思是至少需要3张显卡吗?请问作者这边是用几张什么类型的显卡跑的推理代码呢?

@shallowdream204
Copy link
Owner

shallowdream204 commented Dec 10, 2024

对,我是使用3张A100来inference的,不过3张>=32G的显卡(比如V100)应该就可以
是报了什么错呢?

@Deep-imagelab
Copy link
Author

我是尝试了单卡和双卡,今天又在四卡上尝试了,报错信息如下:
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-11 06:43:35,592 - PixArt - INFO - Initializing: DDP for inference
2024-12-11 06:43:40,183 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-11 06:43:46,822 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-11 06:43:46,827 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-11 06:43:46,827 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass allow_pickle=False to raise an error instead.
2024-12-11 06:44:57,769 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-11 06:44:57,772 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-11 06:44:57,772 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.55s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:26<00:00, 4.50s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 77, in log_validation
img_pre = swinir(img_lq)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 883, in forward
x = self.conv_after_body(self.forward_features(x)) + x
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 855, in forward_features
x = layer(x, x_size)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 488, in forward
return self.patch_embed(self.conv(self.patch_unembed(self.residual_group(x, x_size), x_size))) + x
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 408, in forward
x = blk(x, x_size)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 261, in forward
x_windows = window_partition(shifted_x, self.window_size) # nW
B, window_size, window_size, C
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 49, in window_partition
x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
RuntimeError: shape '[1, 51, 8, 38, 8, 180]' is invalid for input of size 22730400
[2024-12-11 06:46:11,725] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5003) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_06:46:11
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5003)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

麻烦帮忙看看了

@Deep-imagelab
Copy link
Author

如果是按照单卡跑,同时把启动命令改成普通的python test_wcx.py 这种,报错信息是如下:
/usr/bin/env /opt/conda/bin/python /home/oppoer/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 57073 -- /home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-11 06:51:50,078 - PixArt - INFO - Initializing: DDP for inference
2024-12-11 06:51:54,613 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-11 06:52:01,220 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-11 06:52:01,225 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-11 06:52:01,225 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass allow_pickle=False to raise an error instead.
2024-12-11 06:53:16,758 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-11 06:53:16,761 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-11 06:53:16,761 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:37<00:00, 18.93s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:22<00:00, 3.68s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/pixart_controlnet.py", line 528, in getattr
return getattr(self.base_model, name)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'PixArtMS' object has no attribute 'device'

麻烦抽空再看看这个的报错信息呢。

@shallowdream204
Copy link
Owner

你用4卡跑的时候是swinir那块报错了,你能提供一下你测试的图像么?

@Deep-imagelab
Copy link
Author

Deep-imagelab commented Dec 11, 2024

你用4卡跑的时候是swinir那块报错了,你能提供一下你测试的图像么?

是对测试图像的尺寸有要求吗?我就网上随便找了张图测试的,尺寸信息是615x820x3的rgb图像。

@shallowdream204
Copy link
Owner

shallowdream204 commented Dec 11, 2024

刚刚更新了util_image.py,fix了这个bug,非常感谢你的issue~
可以用更新后的util_image.py,再尝试运行一下

@Deep-imagelab
Copy link
Author

刚刚更新了util_image.py,fix了这个bug,非常感谢你的issue~ 可以用更新后的util_image.py,再尝试运行一下

使用最新的尝试了下,还是会报错,报错信息如下:
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-12 01:43:49,619 - PixArt - INFO - Initializing: DDP for inference
2024-12-12 01:43:54,108 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-12 01:44:01,586 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-12 01:44:01,591 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-12 01:44:01,591 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass allow_pickle=False to raise an error instead.
2024-12-12 01:45:27,224 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-12 01:45:27,231 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-12 01:45:27,231 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.56s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:23<00:00, 3.89s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 82, in log_validation
caption = llava_model.get_caption([img_pre_pil])
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/llava_caption.py", line 88, in get_caption
output_ids = self.model.generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/model/language_model/llava_llama.py", line 139, in generate
return super().generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/model/language_model/llava_llama.py", line 92, in forward
return super().forward(
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'
[2024-12-12 01:46:40,879] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22140) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_01:46:40
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22140)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@shallowdream204
Copy link
Owner

transformers版本的问题,试一下
pip install transformers==4.44.2

@Deep-imagelab
Copy link
Author

pip install transformers==4.44.2

还是报错了
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 87, in log_validation
txt_fea = caption_emb[None].to(device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-12-12 08:03:31,035] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 24603) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_08:03:31
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 24603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@shallowdream204
Copy link
Owner

这个报错看不出来问题,方便的话可以添加下微信交流?

@Deep-imagelab
Copy link
Author

这个报错看不出来问题,方便的话可以添加下微信交流?

15529609856是我的电话,可以搜到我的微信,非常感谢~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants