Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we remove CUDA dependencies from nccl_net.h? #310

Open
capcah opened this issue Mar 23, 2020 · 5 comments
Open

Can we remove CUDA dependencies from nccl_net.h? #310

capcah opened this issue Mar 23, 2020 · 5 comments

Comments

@capcah
Copy link

capcah commented Mar 23, 2020

Currently nccl_net.h pulls nccl.h only for using ncclResult_t, as far I understand. However, nccl.h pulls in <cuda_runtime.h> and <cuda_fp16.h>, which means that plugins can only be built in machines with the CUDA headers available.

We could remove this dependency by splitting nccl.h.in into two: types.h.in and nccl.h. The first would contain the general type definitions (everything above the collective comms ops), and the second would include the first and contain the rest of the file. With this split, only nccl.h needs the CUDA headers.

This way we can make the .in file smaller, separate API from types, and allow nccl_net.h to include only types.h without pulling cuda headers in. What do y'all think?

@sjeaugey
Copy link
Member

Indeed we could probably split them into two parts.

Alternatively, since the API is versioned, you can copy the definitions from nccl_net.h and nccl.h in your project, and remove what you don't need. The definitions won't change in the version you choose to implement. When you want to support a new version, you can update your includes to the new definitions (or even add a new version of the struct along the other if you want to support multiple versions).

capcah added a commit to capcah/nccl that referenced this issue Mar 25, 2020
This allows us to remove CUDA dependencies from nccl_net as described in
NVIDIA#310.
@capcah
Copy link
Author

capcah commented Mar 25, 2020

I sent a pull request implementing a version of this, let me know if this works for you. Copying the file into our projects works, but at Google we prefer to track upstream repositories as much as possible, so if you think that this PR (or something similar, open to suggestions!) works, we'd prefer to get that upstreamed.

@sjeaugey
Copy link
Member

Thanks. The patch looks good to me. We're pretty busy on other things right now so it might take us a week or two before we can look into this; let me know if you need this merged quicker than that.

Also don't hesitate to ping us again if nothing happened in two weeks from now.

@capcah
Copy link
Author

capcah commented Mar 26, 2020

That's alright, I can keep the work on my side. Should I also send PRs for the 2.6 branch?

@sjeaugey
Copy link
Member

It would be indeed better to base your patches on 2.6. But hopefully, given those parts didn't change much in 2.6, the rebase should be easy.

capcah added a commit to capcah/nccl that referenced this issue May 15, 2020
This allows us to remove CUDA dependencies from nccl_net as described in
NVIDIA#310.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants