Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[window11]平台下现在是否无法使用GPU进行指令精调 #565

Closed
7 tasks done
AceyKubbo opened this issue Jun 11, 2023 · 3 comments
Closed
7 tasks done

[window11]平台下现在是否无法使用GPU进行指令精调 #565

AceyKubbo opened this issue Jun 11, 2023 · 3 comments

Comments

@AceyKubbo
Copy link

AceyKubbo commented Jun 11, 2023

详细描述问题

目前因为deepspeed和nccl两个库无法在win上使用,所以现在在win平台下进行训练是不可行的吗?

参考信息

依赖情况(代码类问题务必提供)

Package Version
transformers 4.29.1
torch 2.0.1+cu118
peft 0.3.0.dev0

运行日志或截图

Traceback (most recent call last):
  File "E:\pyCode\Chinese-LLaMA-Alpaca\scripts\training\run_clm_sft_with_peft.py", line 468, in <module>
    main()
  File "E:\pyCode\Chinese-LLaMA-Alpaca\scripts\training\run_clm_sft_with_peft.py", line 205, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "D:\Python310\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 118, in __init__
  File "D:\Python310\lib\site-packages\transformers\training_args.py", line 1333, in __post_init__
    and (self.device.type != "cuda")
  File "D:\Python310\lib\site-packages\transformers\training_args.py", line 1697, in device
    return self._setup_devices
  File "D:\Python310\lib\site-packages\transformers\utils\generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "D:\Python310\lib\site-packages\transformers\training_args.py", line 1631, in _setup_devices
    self.distributed_state = PartialState(backend=self.ddp_backend)
  File "D:\Python310\lib\site-packages\accelerate\state.py", line 143, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
  File "D:\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in

必查项目

  • 基础模型:Alpaca-Plus-7B
  • 运行系统:Windows
  • 问题分类: 模型训练与精调
  • 模型正确性检查:务必检查模型的SHA256.md,模型不对的情况下无法保证效果和正常运行。
  • (必选)由于相关依赖频繁更新,请确保按照Wiki中的相关步骤执行
  • (必选)我已阅读FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案
  • (必选)第三方插件问题:例如llama.cpptext-generation-webuiLlamaChat等,同时建议到对应的项目中查找解决方案
@ymcui
Copy link
Owner

ymcui commented Jun 12, 2023

建议你去查询一下这两个库在windows下的安装,我们没有在windows下训练这些模型,无法提供帮助。

@AceyKubbo
Copy link
Author

AceyKubbo commented Jun 12, 2023

建议你去查询一下这两个库在windows下的安装,我们没有在windows下训练这些模型,无法提供帮助。

deepspeed在issues看到了,搞笑的竟然不支持自家OS

去掉deepspeed选项,尝试nccl也是,issues中提到正在开发2.0,到时候再看看能不能支持win

目前看来只能先装个linux虚拟机来进行训练了

@AceyKubbo
Copy link
Author

解决方案可以考虑wsl,window自带的linux子系统,这样也可以在window下跑,虽然是虚拟机,但是比vm感觉还是方便很多

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants