Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: nccl nn.parallel error, need more experienced to look #1442

Open
jdgh000 opened this issue Dec 1, 2024 · 2 comments
Open

[Issue]: nccl nn.parallel error, need more experienced to look #1442

jdgh000 opened this issue Dec 1, 2024 · 2 comments

Comments

@jdgh000
Copy link

jdgh000 commented Dec 1, 2024

Problem Description

I filed this but i am led into wild goose chase here, need more seasoned, experienced engineer to look into:
#1421
Once filed

  1. it says you were able to reproduce
  2. another guy named hackrill says it is specific to IG. which is wrong because I already told MI250. he just made improper conclusion because I was just lazy about providing CPU info (because it is irrelevant) which I later corrected.
  3. later on he changes story that it is only reprocuble on IG only but fails to provide log let along supposedly successful run log on discrete (i.e. MI250),
  4. he keeps asking for more information which I already provided.
    This is shambles, and total, I can not follow through how it is being debugged here because it is done in such a blindly random way, need more season engineer to look it more seriously.
    Because this is more basic/common nn.parallel model that is not running on mi250:
    https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
    On nvidia, i just runs fine.

Operating System

rhel9

CPU

epyc

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

see #1421 for details.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@nileshnegi
Copy link
Collaborator

nileshnegi commented Dec 1, 2024

Able to run on 1 and 8 gfx90a GPUs. See attached logs:
stdout_ngpu1.log
stdout_ngpu8.log
stdout_ngpu1.log
stdout_ngpu8.log
(Edit: added GPU info for verbosity)

How are you building ROCm PyTorch?

@nileshnegi
Copy link
Collaborator

any updates @jdgh000?

@zichguan-amd zichguan-amd marked this as a duplicate of #1421 Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants