Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: duplicate symbol Definition #1498

Open
Qizhi697 opened this issue Jan 22, 2025 · 2 comments
Open

[Issue]: duplicate symbol Definition #1498

Qizhi697 opened this issue Jan 22, 2025 · 2 comments

Comments

@Qizhi697
Copy link

Qizhi697 commented Jan 22, 2025

Problem Description

I am using the image: rocm/dev-ubuntu-20.04:6.3.1-complete. When executing ./install.sh -t --prefix=${RCCL_INSTALL_PREFIX}, an error occurs.
Code is at the tag : rocm-6.3.1

ld.lld: error: duplicate symbol: ncclCommRegister
>>> defined at api_trace.cc
>>>            /tmp/api_trace-8e48f3.o:(ncclCommRegister)
>>> defined at nccl.cu
>>>            nccl.cu.o:(.text+0x6D80) in archive libmscclpp_nccl.a

ld.lld: error: duplicate symbol: ncclCommDeregister
>>> defined at api_trace.cc
>>>            /tmp/api_trace-8e48f3.o:(ncclCommDeregister)
>>> defined at nccl.cu
>>>            nccl.cu.o:(.text+0x6D90) in archive libmscclpp_nccl.a

ld.lld: error: duplicate symbol: ncclMemAlloc
>>> defined at api_trace.cc
>>>            /tmp/api_trace-8e48f3.o:(ncclMemAlloc)
>>> defined at nccl.cu
>>>            nccl.cu.o:(.text+0x6DA0) in archive libmscclpp_nccl.a

ld.lld: error: duplicate symbol: ncclMemFree
>>> defined at api_trace.cc
>>>            /tmp/api_trace-8e48f3.o:(ncclMemFree)
>>> defined at nccl.cu
>>>            nccl.cu.o:(.text+0x7390) in archive libmscclpp_nccl.a

Operating System

Ubuntu 20.04

CPU

Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz

GPU

MI210

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@nileshnegi
Copy link
Collaborator

This can be related to some recent commits in RCCL. Are you using the latest RCCL develop commit? Did you clone a fresh copy with this commit or use git pull?

If you used git pull, I would suggest git submodule update --init --recursive.

Also, you can try adding -l --disable-mscclpp to your install.sh command to build only for local GPU target and disable MSCCLPP as it is not supported on MI210.

@Qizhi697
Copy link
Author

@nileshnegi

  • I am not using the latest commit from the develop branch, but rather the tag: rocm-6.3.1.
  • After adding -l and eliminating the impact of MSCCLPP, the errors are gone. Thank you very much!
  • You mentioned that MSCCLPP is not supported on MI210. Could you please let me know which AMD GPUs are currently supported by MSCCLPP? Is there a maintained list of GPUs supported by MSCCLPP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants