Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVlink support #174

Open
ritazh opened this issue Sep 26, 2024 · 3 comments
Open

NVlink support #174

ritazh opened this issue Sep 26, 2024 · 3 comments

Comments

@ritazh
Copy link

ritazh commented Sep 26, 2024

Is there any plans for adding support for NVLink? e.g. GB200 NVL72
If so, can you share a rough example for what a typical device class and ResourceClaimTemplate might look like? Thanks!

@ritazh
Copy link
Author

ritazh commented Oct 15, 2024

@klueska do you have any thoughts around this?

e.g. run nvidia-smi topo -m or nvidia-smi nvlink --status could expose the nvlink connections and the topology information for the scheduler to pick a node connected via nvlink vs not

@ace-cohere
Copy link

I'm also curious about if/how DRA would support this scenario or similar use cases, e.g.

  • require multiple nodes interconnected by nvlink
  • prefer multiple nodes interconnected by nvlink
  • prefer nodes based on IB cluster network topology
    • e.g. prefer all nodes from same IB leafgroup/RoCE network block, otherwise best effort

Partitioning an IB network seems in some ways similar to TPU slices, which were discussed at kubecon briefly along with partitionable devices kubernetes/enhancements#4874 -- curious if that kind of approach might make sense for GPUs but as a soft rather than hard constraint?

it's possible to do this today with soft affinities but there's a fragmentation issue it would be nice for the scheduler to try to solve (best effort allocations never necessarily align the network topology 100%, I'm not sure DRA can do much better but figured I'd throw it out)

@ritazh
Copy link
Author

ritazh commented Nov 26, 2024

This might be helpful to you #97 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants