推理代码是需要多少张gpu显卡才能跑通的呀？ #24

Deep-imagelab · 2024-12-10T11:38:08Z

我用多卡和单卡推理的时候，会报不同的错误，按照推理代码默认的设置：
llava_device = 'cuda:1'
t5llm_device = 'cuda:2'
意思是至少需要3张显卡吗？请问作者这边是用几张什么类型的显卡跑的推理代码呢？

shallowdream204 · 2024-12-10T13:53:24Z

对，我是使用3张A100来inference的，不过3张>=32G的显卡（比如V100）应该就可以
是报了什么错呢？

Deep-imagelab · 2024-12-11T06:50:43Z

我是尝试了单卡和双卡，今天又在四卡上尝试了，报错信息如下：
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-11 06:43:35,592 - PixArt - INFO - Initializing: DDP for inference
2024-12-11 06:43:40,183 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-11 06:43:46,822 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-11 06:43:46,827 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-11 06:43:46,827 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
2024-12-11 06:44:57,769 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-11 06:44:57,772 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-11 06:44:57,772 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.55s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:26<00:00, 4.50s/it]
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 77, in log_validation
img_pre = swinir(img_lq)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 883, in forward
x = self.conv_after_body(self.forward_features(x)) + x
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 855, in forward_features
x = layer(x, x_size)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 488, in forward
return self.patch_embed(self.conv(self.patch_unembed(self.residual_group(x, x_size), x_size))) + x
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 408, in forward
x = blk(x, x_size)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 261, in forward
x_windows = window_partition(shifted_x, self.window_size) # nWB, window_size, window_size, C
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/swinir.py", line 49, in window_partition
x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
RuntimeError: shape '[1, 51, 8, 38, 8, 180]' is invalid for input of size 22730400
[2024-12-11 06:46:11,725] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5003) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_06:46:11
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5003)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

麻烦帮忙看看了

Deep-imagelab · 2024-12-11T07:08:13Z

如果是按照单卡跑，同时把启动命令改成普通的python test_wcx.py 这种，报错信息是如下：
/usr/bin/env /opt/conda/bin/python /home/oppoer/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 57073 -- /home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-11 06:51:50,078 - PixArt - INFO - Initializing: DDP for inference
2024-12-11 06:51:54,613 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-11 06:52:01,220 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-11 06:52:01,225 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-11 06:52:01,225 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass allow_pickle=False to raise an error instead.
2024-12-11 06:53:16,758 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-11 06:53:16,761 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-11 06:53:16,761 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:37<00:00, 18.93s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:22<00:00, 3.68s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/diffusion/model/nets/pixart_controlnet.py", line 528, in getattr
return getattr(self.base_model, name)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'PixArtMS' object has no attribute 'device'

麻烦抽空再看看这个的报错信息呢。

shallowdream204 · 2024-12-11T07:26:36Z

你用4卡跑的时候是swinir那块报错了，你能提供一下你测试的图像么？

Deep-imagelab · 2024-12-11T08:38:13Z

你用4卡跑的时候是swinir那块报错了，你能提供一下你测试的图像么？

是对测试图像的尺寸有要求吗？我就网上随便找了张图测试的，尺寸信息是615x820x3的rgb图像。

shallowdream204 · 2024-12-11T16:06:26Z

刚刚更新了util_image.py，fix了这个bug，非常感谢你的issue~
可以用更新后的util_image.py，再尝试运行一下

Deep-imagelab · 2024-12-12T01:58:29Z

刚刚更新了util_image.py，fix了这个bug，非常感谢你的issue~ 可以用更新后的util_image.py，再尝试运行一下

使用最新的尝试了下，还是会报错，报错信息如下：
/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
/opt/conda/lib/python3.10/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-12-12 01:43:49,619 - PixArt - INFO - Initializing: DDP for inference
2024-12-12 01:43:54,108 - PixArt - WARNING - lewei scale: (2.0,), base size: 64
2024-12-12 01:44:01,586 - PixArt - INFO - Using fp16 inference for DiT.
2024-12-12 01:44:01,591 - PixArt - INFO - ControlPixArtMSHalfSR2Branch Model Parameters: 2,211,876,212
2024-12-12 01:44:01,591 - PixArt - INFO - T5 max token length: 120
An error occurred while trying to fetch /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema: Error no file named diffusion_pytorch_model.safetensors found in directory /home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/sd-vae-ft-ema.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
2024-12-12 01:45:27,224 - PixArt - INFO - Load checkpoint from ckpt/DreamClear-1024.pth. Load ema: False.
2024-12-12 01:45:27,231 - PixArt - WARNING - Missing keys: ['base_model.pos_embed']
2024-12-12 01:45:27,231 - PixArt - WARNING - Unexpected keys: []
/home/notebook/data/sharedgroup/RG_YLab/aigc_share_group_data/wuchaoxiong/DiffusionModels/PixArt-alpha/t5-v1_1-xxl/
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.56s/it]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:23<00:00, 3.89s/it]
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 82, in log_validation
caption = llava_model.get_caption([img_pre_pil])
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/llava_caption.py", line 88, in get_caption
output_ids = self.model.generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/model/language_model/llava_llama.py", line 139, in generate
return super().generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/llava/model/language_model/llava_llama.py", line 92, in forward
return super().forward(
TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'
[2024-12-12 01:46:40,879] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22140) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_01:46:40
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22140)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

shallowdream204 · 2024-12-12T07:36:10Z

transformers版本的问题，试一下
pip install transformers==4.44.2

Deep-imagelab · 2024-12-12T08:15:38Z

pip install transformers==4.44.2

还是报错了
Traceback (most recent call last):
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 341, in
log_validation(model,accelerator,model.device)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/notebook/code/personal/80402852/ProjectSDDreamClear1209/test_wcx.py", line 87, in log_validation
txt_fea = caption_emb[None].to(device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-12-12 08:03:31,035] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 24603) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

test_wcx.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_08:03:31
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 24603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

shallowdream204 · 2024-12-12T08:19:40Z

这个报错看不出来问题，方便的话可以添加下微信交流？

Deep-imagelab · 2024-12-12T08:30:28Z

这个报错看不出来问题，方便的话可以添加下微信交流？

15529609856是我的电话，可以搜到我的微信，非常感谢~

shallowdream204 closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

推理代码是需要多少张gpu显卡才能跑通的呀？ #24

推理代码是需要多少张gpu显卡才能跑通的呀？ #24

Deep-imagelab commented Dec 10, 2024

shallowdream204 commented Dec 10, 2024 •

edited

Loading

Deep-imagelab commented Dec 11, 2024

Deep-imagelab commented Dec 11, 2024

shallowdream204 commented Dec 11, 2024

Deep-imagelab commented Dec 11, 2024 •

edited

Loading

shallowdream204 commented Dec 11, 2024 •

edited

Loading

Deep-imagelab commented Dec 12, 2024

shallowdream204 commented Dec 12, 2024

Deep-imagelab commented Dec 12, 2024

shallowdream204 commented Dec 12, 2024

Deep-imagelab commented Dec 12, 2024

推理代码是需要多少张gpu显卡才能跑通的呀？ #24

推理代码是需要多少张gpu显卡才能跑通的呀？ #24

Comments

Deep-imagelab commented Dec 10, 2024

shallowdream204 commented Dec 10, 2024 • edited Loading

Deep-imagelab commented Dec 11, 2024

test_wcx.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-11_06:46:11 host : task-20241211142401-45448 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5003) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Deep-imagelab commented Dec 11, 2024

shallowdream204 commented Dec 11, 2024

Deep-imagelab commented Dec 11, 2024 • edited Loading

shallowdream204 commented Dec 11, 2024 • edited Loading

Deep-imagelab commented Dec 12, 2024

test_wcx.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-12_01:46:40 host : task-20241211142401-45448 rank : 0 (local_rank: 0) exitcode : 1 (pid: 22140) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

shallowdream204 commented Dec 12, 2024

Deep-imagelab commented Dec 12, 2024

test_wcx.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-12_08:03:31 host : task-20241211142401-45448 rank : 0 (local_rank: 0) exitcode : 1 (pid: 24603) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

shallowdream204 commented Dec 12, 2024

Deep-imagelab commented Dec 12, 2024

shallowdream204 commented Dec 10, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-11_06:46:11
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5003)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Deep-imagelab commented Dec 11, 2024 •

edited

Loading

shallowdream204 commented Dec 11, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_01:46:40
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22140)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-12_08:03:31
host : task-20241211142401-45448
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 24603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html