Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed #30

Open
Nemo-HelloWorld opened this issue Feb 7, 2025 · 7 comments
Open

Comments

@Nemo-HelloWorld
Copy link

我尝试在Megatron-LM 0.8中将进程组backend设为cuda:flagcx,但是执行时报错如下:
python3: flagcx/flagcx.cc:93: bool is_homo_comm(): Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed.
且配置export FLAGCX_DEBUG=INFO也没有debug信息输出,请问这种情况一般是什么原因造成的?

@MC952-arch
Copy link
Collaborator

你好,

上面的错误应该是初始化comm出错导致的,请问是否有跑过plugin/torch/example.py?另外也请发一下python侧的改动和log文件吧

@Nemo-HelloWorld
Copy link
Author

plugin/torch/example.py是可以跑通的,我是在megatron/training/initialize.py的_initialize_distributed函数中直接修改了

torch.distributed.init_process_group(
            backend='cuda:flagcx',    #args.distributed_backend,
            world_size=args.world_size,
            rank=args.rank,
            timeout=timedelta(minutes=args.distributed_timeout_minutes),
        )

@Nemo-HelloWorld
Copy link
Author

在NV单机测试的话会有如下输出然后直接退出

> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 8
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/data2/nfs/liyucong/Megatron-LM-gloo/megatron/core/datasets'
W0208 02:44:32.934000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320534 closing signal SIGTERM
W0208 02:44:32.941000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320535 closing signal SIGTERM
W0208 02:44:32.970000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320536 closing signal SIGTERM
W0208 02:44:32.974000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320537 closing signal SIGTERM
W0208 02:44:32.974000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320539 closing signal SIGTERM
W0208 02:44:32.979000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320540 closing signal SIGTERM
W0208 02:44:32.980000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 320541 closing signal SIGTERM
E0208 02:44:39.662000 140291211355968 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 4 (pid: 320538) of binary: /usr/bin/python

@MC952-arch
Copy link
Collaborator

plugin/torch/example.py是可以跑通的,我是在megatron/training/initialize.py的_initialize_distributed函数中直接修改了

torch.distributed.init_process_group(
            backend='cuda:flagcx',    #args.distributed_backend,
            world_size=args.world_size,
            rank=args.rank,
            timeout=timedelta(minutes=args.distributed_timeout_minutes),
        )

flagcx comm会在第一次调用通信op时初始化,另外可以加上gloo,backend='cpu:gloo,cuda:flagcx'。
设置下面的环境变量后还是没有log么?
export FLAGCX_DEBUG=INFO
export FLAGCX_DEBUG_SUBSYS=ALL

@Nemo-HelloWorld
Copy link
Author

加上gloo就会走gloo了,flagcx不生效
设置环境变量也没有输出

@MC952-arch
Copy link
Collaborator

加上gloo就会走gloo了,flagcx不生效 设置环境变量也没有输出

gpu上的tensor会走flagcx。没有输出说明没走到flagcx初始化阶段

”python3: flagcx/flagcx.cc:93: bool is_homo_comm(): Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed.“ 这个错误发生时python代码执行到哪一步骤了?

@MC952-arch
Copy link
Collaborator

我尝试在Megatron-LM 0.8中将进程组backend设为cuda:flagcx,但是执行时报错如下: python3: flagcx/flagcx.cc:93: bool is_homo_comm(): Assertion `flagCXEnsureCommReady() == flagcxSuccess' failed. 且配置export FLAGCX_DEBUG=INFO也没有debug信息输出,请问这种情况一般是什么原因造成的?

Please try the latest commit and figure out if this issue has already been fixed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants