Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cambricon] Fix MLU PID retrieval issue in nnodes FlagScale training #737

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

cifar10
Copy link
Contributor

@cifar10 cifar10 commented Sep 10, 2024

1、原始代码说明
对于多机训练,主节点在ssh到每个节点后启动flagscale进程(包括主节点),启动进程时获取该进程PID。先将在前一个节点保存的PID文件删除,然后重新覆写入当前节点的PID,直至达到最后一个节点。

2、问题说明
(1)在寒武纪镜像中执行四机(多机)训练时,主节点在其他子节点依次启动flagscale进程并获取和覆写PID进程号。但是其他节点在启动flagscale后,启动进程(初始化进程)在启动了实际工作的子进程后便退出了,获取的PID也即失效。
(2)wait_for_finish函数中判断传入的PID是否仍在运行,以决定是否中断程序。由于获取的PID失效,所以程序会在各个节点立即提前中止。

3、修改说明
在各个节点启动程序后,通过 ps -aux | grep "MLU_VISIBLE_DEVICES"获取真正运行的进程号,并保存在文件中。

@KerwinKai KerwinKai merged commit 42a92ff into FlagOpen:main Sep 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants