Can we remove CUDA dependencies from nccl_net.h? #310

capcah · 2020-03-23T19:18:08Z

Currently nccl_net.h pulls nccl.h only for using ncclResult_t, as far I understand. However, nccl.h pulls in <cuda_runtime.h> and <cuda_fp16.h>, which means that plugins can only be built in machines with the CUDA headers available.

We could remove this dependency by splitting nccl.h.in into two: types.h.in and nccl.h. The first would contain the general type definitions (everything above the collective comms ops), and the second would include the first and contain the rest of the file. With this split, only nccl.h needs the CUDA headers.

This way we can make the .in file smaller, separate API from types, and allow nccl_net.h to include only types.h without pulling cuda headers in. What do y'all think?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2020-03-24T00:53:38Z

Indeed we could probably split them into two parts.

Alternatively, since the API is versioned, you can copy the definitions from nccl_net.h and nccl.h in your project, and remove what you don't need. The definitions won't change in the version you choose to implement. When you want to support a new version, you can update your includes to the new definitions (or even add a new version of the struct along the other if you want to support multiple versions).

This allows us to remove CUDA dependencies from nccl_net as described in NVIDIA#310.

capcah · 2020-03-25T17:42:28Z

I sent a pull request implementing a version of this, let me know if this works for you. Copying the file into our projects works, but at Google we prefer to track upstream repositories as much as possible, so if you think that this PR (or something similar, open to suggestions!) works, we'd prefer to get that upstreamed.

sjeaugey · 2020-03-25T17:55:09Z

Thanks. The patch looks good to me. We're pretty busy on other things right now so it might take us a week or two before we can look into this; let me know if you need this merged quicker than that.

Also don't hesitate to ping us again if nothing happened in two weeks from now.

capcah · 2020-03-26T16:14:20Z

That's alright, I can keep the work on my side. Should I also send PRs for the 2.6 branch?

sjeaugey · 2020-03-26T18:19:56Z

It would be indeed better to base your patches on 2.6. But hopefully, given those parts didn't change much in 2.6, the rebase should be easy.

This allows us to remove CUDA dependencies from nccl_net as described in NVIDIA#310.

capcah added a commit to capcah/nccl that referenced this issue Mar 25, 2020

Move the NCCL types to its own file.

08a5321

This allows us to remove CUDA dependencies from nccl_net as described in NVIDIA#310.

capcah mentioned this issue Mar 25, 2020

Move the NCCL types to its own file. #312

Open

capcah added a commit to capcah/nccl that referenced this issue May 15, 2020

Move the NCCL types to its own file.

c8c53fd

This allows us to remove CUDA dependencies from nccl_net as described in NVIDIA#310.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we remove CUDA dependencies from nccl_net.h? #310

Can we remove CUDA dependencies from nccl_net.h? #310

capcah commented Mar 23, 2020

sjeaugey commented Mar 24, 2020

capcah commented Mar 25, 2020

sjeaugey commented Mar 25, 2020

capcah commented Mar 26, 2020

sjeaugey commented Mar 26, 2020

Can we remove CUDA dependencies from nccl_net.h? #310

Can we remove CUDA dependencies from nccl_net.h? #310

Comments

capcah commented Mar 23, 2020

sjeaugey commented Mar 24, 2020

capcah commented Mar 25, 2020

sjeaugey commented Mar 25, 2020

capcah commented Mar 26, 2020

sjeaugey commented Mar 26, 2020