NVlink support #174

ritazh · 2024-09-26T07:19:24Z

Is there any plans for adding support for NVLink? e.g. GB200 NVL72
If so, can you share a rough example for what a typical device class and ResourceClaimTemplate might look like? Thanks!

ritazh · 2024-10-15T15:41:44Z

@klueska do you have any thoughts around this?

e.g. run nvidia-smi topo -m or nvidia-smi nvlink --status could expose the nvlink connections and the topology information for the scheduler to pick a node connected via nvlink vs not

ace-cohere · 2024-11-26T19:45:44Z

I'm also curious about if/how DRA would support this scenario or similar use cases, e.g.

require multiple nodes interconnected by nvlink
prefer multiple nodes interconnected by nvlink
prefer nodes based on IB cluster network topology
- e.g. prefer all nodes from same IB leafgroup/RoCE network block, otherwise best effort

Partitioning an IB network seems in some ways similar to TPU slices, which were discussed at kubecon briefly along with partitionable devices kubernetes/enhancements#4874 -- curious if that kind of approach might make sense for GPUs but as a soft rather than hard constraint?

it's possible to do this today with soft affinities but there's a fragmentation issue it would be nice for the scheduler to try to solve (best effort allocations never necessarily align the network topology 100%, I'm not sure DRA can do much better but figured I'd throw it out)

ritazh · 2024-11-26T23:32:09Z

This might be helpful to you #97 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVlink support #174

NVlink support #174

ritazh commented Sep 26, 2024

ritazh commented Oct 15, 2024

ace-cohere commented Nov 26, 2024

ritazh commented Nov 26, 2024

NVlink support #174

NVlink support #174

Comments

ritazh commented Sep 26, 2024

ritazh commented Oct 15, 2024

ace-cohere commented Nov 26, 2024

ritazh commented Nov 26, 2024