[Issue]: nccl nn.parallel error, need more experienced to look #1442

jdgh000 · 2024-12-01T01:37:52Z

I filed this but i am led into wild goose chase here, need more seasoned, experienced engineer to look into:
#1421
Once filed

it says you were able to reproduce
another guy named hackrill says it is specific to IG. which is wrong because I already told MI250. he just made improper conclusion because I was just lazy about providing CPU info (because it is irrelevant) which I later corrected.
later on he changes story that it is only reprocuble on IG only but fails to provide log let along supposedly successful run log on discrete (i.e. MI250),
he keeps asking for more information which I already provided.
This is shambles, and total, I can not follow through how it is being debugged here because it is done in such a blindly random way, need more season engineer to look it more seriously.
Because this is more basic/common nn.parallel model that is not running on mi250:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
On nvidia, i just runs fine.

rhel9

epyc

mi250

ROCm 6.2.0

rccl

see #1421 for details.

No response

No response

nileshnegi · 2024-12-01T01:53:17Z

Able to run on 1 and 8 gfx90a GPUs. See attached logs:
~~stdout_ngpu1.log~~
~~stdout_ngpu8.log~~
stdout_ngpu1.log
stdout_ngpu8.log
(Edit: added GPU info for verbosity)

How are you building ROCm PyTorch?

nileshnegi · 2024-12-17T07:58:32Z

any updates @jdgh000?

zichguan-amd marked this as a duplicate of #1421 Jan 14, 2025

Provide feedback