-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component (openib) errors mlx4_0 device #5810
Comments
What version of MOFED are you running? Our recommendation is to use the UCX PML ( |
MLNX_OFED |
I have the same problem,but I don't know how to view the version number of MLNX_OFED. |
Hello @mvpcaozixiang , You can use |
|
hello @jladd-mlnx
And,I use "ofed_info | grep MLNX_OFED",the result is "MLNX_OFED_LINUX-4.4-1.0.0.0 (OFED-4.4-1.0.0)". Thanks. |
@jsquyres Do you know if anyone has looked into this particular issue? |
@jladd-mlnx Can you please followup? |
As of 2018-10-11, we also see these same errors on: Command/options: |
Hello all, My message escaped before I completed my thoughts. I wanted to add the log output to make it easier for those who may wish to see the error. The run starts with this complaint: "ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x27800000002 valid_mas and then runs until it gets to 8 processes where things just break down after 1024 byte packets have been sent and it hangs. The first error there is: "mlx5: sm-node-02: got completion with error: |
My OSU execution does not hang anywhere. It completes but has all the mentioned complaints/errors/device problems. |
Hello all, Majd Dibbiny from Mellanox suggested that we do the following, which worked for us: Add the following line: at around line 1667 in the file then rebuild Open MPI. |
@kaonga that worked for me too! I tried it on 3.1.1 |
@hppritcha This seems like a simple solution. Can you guys make a PR? |
related to #5914 |
@jsquyres Just to jump in. I ran into the same issue. Any OpenMP version 3.0.0 and above fails to initialize openib with this message. The last one working is 2.1.5. @jladd-mlnx I am running MOFED 4.4.2, and I can use UCX pml. But if I understand correctly the problem is that openib is used for rdma osc module. So without it rdma will not work. And openshmem doesn't work without openib either, because osc ucx doesn't work on ConnectX3. |
@angainor latest UCX master can work with osc/ucx on ConnectX-3 |
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]>
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
reopen to better track going in to other release branches |
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]>
Done merging to release branches |
Maybe the distribution tar ball at https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.3.tar.gz did not get refreshed after this fix was implemented? I downloaded that today, 22 Dec, and compiled and I get the warnings.
It looks like Howard merged the fix on Dec 4, but the date listed for the 3.1.3 tarball on the open-mpi.org site is in Oct. |
As of 1/31/19 the tarball linked above for 3.1.3 still doesnt seem to contain the merge fix. (added line around line 1667 in the btl_openib_component.c file). |
It was merged in to the v3.1.x branch after 3.1.3 was released -- it's slated for v3.1.4. You can get a v3.1.4 pre-release snapshot here: https://www.open-mpi.org/nightly/v3.1.x/ |
Patch to correct openib error of OpenMPI (see open-mpi/ompi#5810)
Background information
In this experiment, I am using Mellanox devices and drivers to run an MPI application (osu). I keep getting the errors below when I run with
openib
since the driver does not load for a specific host. When I take the particular host thatmpirun
throws as an error from the hostfile, it proceeds to throw another host, and another etc.What version of Open MPI are you using?
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Please describe the system on which you are running
Things are normal for TCP
Things are not normal for openib
The text was updated successfully, but these errors were encountered: