-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed #30
Comments
你好, 上面的错误应该是初始化comm出错导致的,请问是否有跑过plugin/torch/example.py?另外也请发一下python侧的改动和log文件吧 |
plugin/torch/example.py是可以跑通的,我是在megatron/training/initialize.py的_initialize_distributed函数中直接修改了
|
在NV单机测试的话会有如下输出然后直接退出
|
flagcx comm会在第一次调用通信op时初始化,另外可以加上gloo,backend='cpu:gloo,cuda:flagcx'。 |
加上gloo就会走gloo了,flagcx不生效 |
gpu上的tensor会走flagcx。没有输出说明没走到flagcx初始化阶段 ”python3: flagcx/flagcx.cc:93: bool is_homo_comm(): Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed.“ 这个错误发生时python代码执行到哪一步骤了? |
Please try the latest commit and figure out if this issue has already been fixed up. |
我尝试在Megatron-LM 0.8中将进程组backend设为cuda:flagcx,但是执行时报错如下:
python3: flagcx/flagcx.cc:93: bool is_homo_comm(): Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed.
且配置export FLAGCX_DEBUG=INFO也没有debug信息输出,请问这种情况一般是什么原因造成的?
The text was updated successfully, but these errors were encountered: