-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl connection abort(between kubernetes pods): WARN NET/IB: read failed in ncclIbRoceGetVersionNum: Invalid argument #1573
Comments
Nccl get gid by ibv_query_gid and also check where gid is valid (not all zero and link local gid ) before ncclIbRoceGetVersionNum. If gid is invalid, then ncclIbRoceGetVersionNum will not execute which is expected. But ibv_query_gid returns the real gid which is valid. This gid should not be seen in pod, because it doesn't belong to pod's namespace.
The second solution is easier and we have tested |
Looks like the same issue as #1538 (comment). We might indeed need to avoid using |
Yes. or use __ibv_query_gid_ex. |
Thank you @limu713. The following patch adds GID index initialization to your fix and (if working) will be included in the 2.26 release. Could you verify this still works? |
Hi, I just test this patch with master branch, it works well. Result is below. |
Great! Thank you @limu713 |
I'm not a huge fan of returning success from a function that returns EINVAL. While this may work in the context of the detection loop, it doesn't make sense to call this function on a random GID and have it return success and a RoCE version of 0. Instead, it makes more sense to call validGid() first before calling RoCE version functions if using NCCLCHECK() |
@jlamanna |
We use nccl in kubernetes + rdma-device-plugin.
Pods communicate by macvlan sub interface of roce hca. Different pod has different gid index. When run miprun between two pods, connection aborts. We trace nccl code and find that nccl tries to read file /sys/class/infiniband/$device/ports/$port_num/gid_attrs/types/$index which does not exist. Actually, its relative gid is 0000:0000:0000:0000:0000:0000:0000:0000 (cat /sys/class/infiniband/$device/ports/$port_num/gids/$index ).
Here are show_gids results in one pod. Each device has existing gid of index 4,5,6,7.
root@test-macvlan-pod-2:/# show_gids
DEV PORT INDEX GID IPv4 VER DEV
mlx5_0 1 4 fe80:0000:0000:0000:a4ed:ccff:fe8b:1994 v1 net1
mlx5_0 1 5 fe80:0000:0000:0000:a4ed:ccff:fe8b:1994 v2 net1
mlx5_0 1 6 0000:0000:0000:0000:0000:ffff:0a98:000c 10.152.0.12 v1 net1
mlx5_0 1 7 0000:0000:0000:0000:0000:ffff:0a98:000c 10.152.0.12 v2 net1
mlx5_1 1 4 fe80:0000:0000:0000:9459:9bff:fe54:7704 v1 net2
mlx5_1 1 5 fe80:0000:0000:0000:9459:9bff:fe54:7704 v2 net2
mlx5_1 1 6 0000:0000:0000:0000:0000:ffff:0a98:040c 10.152.4.12 v1 net2
mlx5_1 1 7 0000:0000:0000:0000:0000:ffff:0a98:040c 10.152.4.12 v2 net2
mlx5_2 1 4 fe80:0000:0000:0000:90ea:b5ff:fec5:3f24 v1 net3
mlx5_2 1 5 fe80:0000:0000:0000:90ea:b5ff:fec5:3f24 v2 net3
mlx5_2 1 6 0000:0000:0000:0000:0000:ffff:0a98:080c 10.152.8.12 v1 net3
mlx5_2 1 7 0000:0000:0000:0000:0000:ffff:0a98:080c 10.152.8.12 v2 net3
mlx5_3 1 4 fe80:0000:0000:0000:44aa:80ff:fea7:0c99 v1 net4
mlx5_3 1 5 fe80:0000:0000:0000:44aa:80ff:fea7:0c99 v2 net4
mlx5_3 1 6 0000:0000:0000:0000:0000:ffff:0a98:0c0c 10.152.12.12 v1 net4
mlx5_3 1 7 0000:0000:0000:0000:0000:ffff:0a98:0c0c 10.152.12.12 v2 net4
mlx5_4 1 4 fe80:0000:0000:0000:68be:c8ff:feaa:39b3 v1 net5
mlx5_4 1 5 fe80:0000:0000:0000:68be:c8ff:feaa:39b3 v2 net5
mlx5_4 1 6 0000:0000:0000:0000:0000:ffff:0a98:100c 10.152.16.12 v1 net5
mlx5_4 1 7 0000:0000:0000:0000:0000:ffff:0a98:100c 10.152.16.12 v2 net5
mlx5_5 1 4 fe80:0000:0000:0000:b82d:d1ff:fef4:35fe v1 net6
mlx5_5 1 5 fe80:0000:0000:0000:b82d:d1ff:fef4:35fe v2 net6
mlx5_5 1 6 0000:0000:0000:0000:0000:ffff:0a98:140c 10.152.20.12 v1 net6
mlx5_5 1 7 0000:0000:0000:0000:0000:ffff:0a98:140c 10.152.20.12 v2 net6
mlx5_6 1 4 fe80:0000:0000:0000:4802:daff:fedf:6783 v1 net7
mlx5_6 1 5 fe80:0000:0000:0000:4802:daff:fedf:6783 v2 net7
mlx5_6 1 6 0000:0000:0000:0000:0000:ffff:0a98:180c 10.152.24.12 v1 net7
mlx5_6 1 7 0000:0000:0000:0000:0000:ffff:0a98:180c 10.152.24.12 v2 net7
mlx5_7 1 4 fe80:0000:0000:0000:5034:7aff:fea5:3dea v1 net8
mlx5_7 1 5 fe80:0000:0000:0000:5034:7aff:fea5:3dea v2 net8
mlx5_7 1 6 0000:0000:0000:0000:0000:ffff:0a98:1c0c 10.152.28.12 v1 net8
mlx5_7 1 7 0000:0000:0000:0000:0000:ffff:0a98:1c0c 10.152.28.12 v2 net8
n_gids_found=32
Gid of Other index is 0000:0000:0000:0000:0000:0000:0000:0000. For example device mlx5_2
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/0
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/1
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/2
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/3
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/4
fe80:0000:0000:0000:90ea:b5ff:fec5:3f24
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/5
fe80:0000:0000:0000:90ea:b5ff:fec5:3f24
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/6
0000:0000:0000:0000:0000:ffff:0a98:080c
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/7
0000:0000:0000:0000:0000:ffff:0a98:080c
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/8
0000:0000:0000:0000:0000:0000:0000:0000
If gid is 0000:0000:0000:0000:0000:0000:0000:0000, then it's gid_attrs file can not read and returns 'Invalid argument'.
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/0
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/0: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/1
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/1: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/2
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/2: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/3
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/3: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/4
IB/RoCE v1
mpirun logs mpirun_logs.txt
The text was updated successfully, but these errors were encountered: