Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component (openib) errors mlx4_0 device #5810

Closed
abeltre1 opened this issue Sep 29, 2018 · 22 comments
Closed

Component (openib) errors mlx4_0 device #5810

abeltre1 opened this issue Sep 29, 2018 · 22 comments
Assignees
Labels

Comments

@abeltre1
Copy link

abeltre1 commented Sep 29, 2018

Background information

In this experiment, I am using Mellanox devices and drivers to run an MPI application (osu). I keep getting the errors below when I run with openib since the driver does not load for a specific host. When I take the particular host that mpirun throws as an error from the hostfile, it proceeds to throw another host, and another etc.

What version of Open MPI are you using?

  • mpirun (Open MPI) 3.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • from Mellanox OFED software system distribution package

Please describe the system on which you are running

  • Operating system/version: Centos 7.5
  • Computer hardware: Intel Xion

Things are normal for TCP

mpirun --mca mtl_mxm_np 0 --mca btl tcp,self --mca plm_rsh_no_tree_spawn 1 --map-by node -hostfile ~/hostfile -np 72 ./osu_alltoallv 
# OSU MPI All-to-Allv Personalized Exchange Latency Test v5.3.2
# Size       Avg Latency(us)
1                      43.25
2                      44.03
4                      44.47
8                      45.48
16                     47.70
32                     53.13
64                     64.61
128                   104.95
256                   178.48
512                   317.46
1024                  607.22
2048                  334.65
4096                  641.85
8192                 1624.47
16384                3257.20
32768                6246.35
65536               12574.22
131072              23627.27
262144              38249.50
524288              72622.47
1048576            139280.63

Things are not normal for openib

mpirun --mca mtl_mxm_np 0 --mca btl self,openib --mca plm_rsh_no_tree_spawn 1 --map-by node -hostfile ~/hostfile -np 24 ./osu_alltoallv 


ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-4][[6526,1],3][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-5][[6526,1],22][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-5][[6526,1],4][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   kube-infi-4
  Local device: mlx4_0
--------------------------------------------------------------------------
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-5][[6526,1],16][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-2][[6526,1],7][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-5][[6526,1],10][btl_openib_component.c:1670:init_one_device] ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-4][[6526,1],15][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-2][[6526,1],1][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-2][[6526,1],13][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-6][[6526,1],17][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
[kube-infi-2][[6526,1],19][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-1][[6526,1],6][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-1][[6526,1],12][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-1][[6526,1],0][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-6][[6526,1],5][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-4][[6526,1],9][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-1][[6526,1],18][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-6][[6526,1],11][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-4][[6526,1],21][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-6][[6526,1],23][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-3][[6526,1],8][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-3][[6526,1],14][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-3][[6526,1],20][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[kube-infi-3][[6526,1],2][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument

# OSU MPI All-to-Allv Personalized Exchange Latency Test v5.3.2
# Size       Avg Latency(us)
1                      11.66
2                      11.40
4                      11.45
8                      11.43
16                     11.43
32                     11.61
64                     12.23
128                    13.16
256                    12.77
512                    14.50
1024                   21.37
2048                   38.99
4096                   69.41
8192                  158.51
16384                 296.76
32768                 566.62
65536                1176.81
131072               2343.93
262144               5253.02
524288              10401.06


@jladd-mlnx
Copy link
Member

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

@abeltre1
Copy link
Author

abeltre1 commented Oct 2, 2018

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

MLNX_OFED
4.4-2.0.7.0

@mvpcaozixiang
Copy link

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

MLNX_OFED
4.4-2.0.7.0

I have the same problem,but I don't know how to view the version number of MLNX_OFED.
Could you tell me ?
Have you solved this problem?
Thanks.

@abeltre1
Copy link
Author

abeltre1 commented Oct 5, 2018

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

MLNX_OFED
4.4-2.0.7.0

I have the same problem,but I don't know how to view the version number of MLNX_OFED.
Could you tell me ?
Have you solved this problem?
Thanks.

Hello @mvpcaozixiang ,

You can use ofed_info to collect all the packages installed in relation to MLNX_OFED.
Alternatively, you can simply grep: ofed_info | grep MLNX_OFED.

@abeltre1
Copy link
Author

abeltre1 commented Oct 5, 2018

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7)

@mvpcaozixiang
Copy link

mvpcaozixiang commented Oct 7, 2018

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

hello @jladd-mlnx
I have the same problem.
I run the examples of openmpi( by using "mpirun -np 2 ./hello_c"),and have the following problems.

ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[cu01][[15992,1],0][btl_openib_component.c:1648:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xffffffffffffffff valid_mask = 0x1)
[cu01][[15992,1],1][btl_openib_component.c:1648:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cu01
  Local device: mlx4_0
--------------------------------------------------------------------------
Hello, world, I am 1 of 2, (Open MPI v2.1.5, package: Open MPI root@cu01 Distribution, ident: 2.1.5, repo rev: v2.1.4-8-g697c1e9, Aug 15, 2018, 115)
Hello, world, I am 0 of 2, (Open MPI v2.1.5, package: Open MPI root@cu01 Distribution, ident: 2.1.5, repo rev: v2.1.4-8-g697c1e9, Aug 15, 2018, 115)
[cu01:03285] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[cu01:03285] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

And,I use "ofed_info | grep MLNX_OFED",the result is "MLNX_OFED_LINUX-4.4-1.0.0.0 (OFED-4.4-1.0.0)".

Thanks.

@abeltre1
Copy link
Author

abeltre1 commented Oct 9, 2018

What version of MOFED are you running? Our recommendation is to use the UCX PML (-mca pml ucx). If you have a newer version of MOFED, then UCX will be the default PML in the MOFED packaged OMPI.

@jsquyres Do you know if anyone has looked into this particular issue?

@jsquyres
Copy link
Member

jsquyres commented Oct 9, 2018

@jladd-mlnx Can you please followup?

@kaonga
Copy link

kaonga commented Oct 11, 2018

As of 2018-10-11, we also see these same errors on:
OS: CentOS 7.5
Mellanox OFED: MLNX_OFED_LINUX-4.4-2.0.7.0, MLNX_OFED_LINUX-4.4-1.0.0.0, MLNX_OFED_LINUX-4.3-3.0.2.1
OpenMPI: 3.1.2
Intel IMB Benchmarks: 2018 Update 1

Command/options:
mpirun --mca btl openib,self,vader --mca pml ucx

@kaonga
Copy link

kaonga commented Oct 12, 2018

Hello all,

My message escaped before I completed my thoughts. I wanted to add the log output to make it easier for those who may wish to see the error. The run starts with this complaint:

"ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x27800000002 valid_mas
k = 0x1)"

and then runs until it gets to 8 processes where things just break down after 1024 byte packets have been sent and it hangs. The first error there is:

"mlx5: sm-node-02: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000016 00000000 00000000 00000000
00000000 00008a12 0a000248 355849d2
[sm-node-01:5439 :0:5439] rc_verbs_iface.c:63 FATAL: send completion with erro
r: remote invalid request error
[sm-node-02:4036 :0:4036] rc_verbs_iface.c:63 FATAL: send completion with erro
r: remote invalid request error"
openmpi-mpi-part-b-benchmarks-16-50-55_on_11-10-2018.log

@abeltre1
Copy link
Author

processes where things just break down after 1024 byte packets have been sent and it hangs. The first error there is:

"mlx5: sm-node-02: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000

My OSU execution does not hang anywhere. It completes but has all the mentioned complaints/errors/device problems.

@kaonga
Copy link

kaonga commented Oct 18, 2018

Hello all,

Majd Dibbiny from Mellanox suggested that we do the following, which worked for us:

Add the following line:
memset(&device->ib_exp_dev_attr, 0, sizeof(device->ib_exp_dev_attr));

at around line 1667 in the file
/opt/ext_sw/openmpi-3.1.2/opal/mca/btl/openib/btl_openib_component.c

then rebuild Open MPI.

@abeltre1
Copy link
Author

Hello all,

Majd Dibbiny from Mellanox suggested that we do the following, which worked for us:

Add the following line:
memset(&device->ib_exp_dev_attr, 0, sizeof(device->ib_exp_dev_attr));

at around line 1667 in the file
/opt/ext_sw/openmpi-3.1.2/opal/mca/btl/openib/btl_openib_component.c

then rebuild Open MPI.

@kaonga that worked for me too! I tried it on 3.1.1

@jsquyres
Copy link
Member

@hppritcha This seems like a simple solution. Can you guys make a PR?

@hppritcha
Copy link
Member

related to #5914

@angainor
Copy link

@jsquyres Just to jump in. I ran into the same issue. Any OpenMP version 3.0.0 and above fails to initialize openib with this message. The last one working is 2.1.5.

@jladd-mlnx I am running MOFED 4.4.2, and I can use UCX pml. But if I understand correctly the problem is that openib is used for rdma osc module. So without it rdma will not work. And openshmem doesn't work without openib either, because osc ucx doesn't work on ConnectX3.

@yosefe
Copy link
Contributor

yosefe commented Oct 25, 2018

@angainor latest UCX master can work with osc/ucx on ConnectX-3

@hppritcha hppritcha self-assigned this Oct 31, 2018
hppritcha added a commit to hppritcha/ompi that referenced this issue Oct 31, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 2, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
@hppritcha hppritcha reopened this Nov 2, 2018
@hppritcha
Copy link
Member

reopen to better track going in to other release branches

hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 3, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member

Done merging to release branches

@justbennet
Copy link

Maybe the distribution tar ball at

https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.gz

did not get refreshed after this fix was implemented? I downloaded that today, 22 Dec, and compiled and I get the warnings.

ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xd8000000002 valid_mask = 0x1)
[bn01][[37143,17005],0][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0xd8100000002 valid_mask = 0x1)
[bn01][[37143,17005],1][btl_openib_component.c:1670:init_one_device] error obtaining device attributes for mlx4_0 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   bn01
  Local device: mlx4_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   bn01
  Local device: mlx4_0
--------------------------------------------------------------------------

It looks like Howard merged the fix on Dec 4, but the date listed for the 3.1.3 tarball on the open-mpi.org site is in Oct.

@blairjj
Copy link

blairjj commented Jan 31, 2019

As of 1/31/19 the tarball linked above for 3.1.3 still doesnt seem to contain the merge fix. (added line around line 1667 in the btl_openib_component.c file).

@jsquyres
Copy link
Member

It was merged in to the v3.1.x branch after 3.1.3 was released -- it's slated for v3.1.4.

You can get a v3.1.4 pre-release snapshot here: https://www.open-mpi.org/nightly/v3.1.x/

kraushm added a commit to kraushm/production that referenced this issue Jul 18, 2019
Patch to correct openib error of OpenMPI (see open-mpi/ompi#5810)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants