Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMB tesuite fails when using vader #4260

Closed
nmorey opened this issue Sep 25, 2017 · 15 comments
Closed

IMB tesuite fails when using vader #4260

nmorey opened this issue Sep 25, 2017 · 15 comments

Comments

@nmorey
Copy link
Contributor

nmorey commented Sep 25, 2017

Using openmpi 2.1.2 and the Intel MPI Benchmark suite (https://software.intel.com/sites/default/files/managed/76/6c/IMB_2017_Update2.tgz) on x86 systems (multiple SUSE versions)

I get this error

mpirun -np 2  --mca btl vader,self /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1
[snip...]
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         0.60         0.60         0.60         0.00
            1         1000         0.59         0.59         0.59         3.38
            2         1000         0.55         0.55         0.55         7.31
            4         1000         0.62         0.62         0.62        12.81
            8         1000         0.64         0.64         0.64        24.94
           16         1000         0.76         0.76         0.76        41.94
           32         1000         0.56         0.56         0.56       114.88
           64         1000         0.64         0.64         0.64       200.60
          128         1000         0.65         0.65         0.65       396.01
          256         1000         1.10         1.10         1.10       463.57
          512         1000         1.49         1.50         1.49       684.71
         1024         1000         1.82         1.82         1.82      1122.39
         2048         1000         2.07         2.07         2.07      1979.64
         4096         1000         2.63         2.63         2.63      3113.39
         8192         1000         2.74         2.74         2.74      5986.21
        16384         1000         4.42         4.42         4.42      7410.65
[portia:25305] *** Process received signal ***
[portia:25305] Signal: Segmentation fault (11)
[portia:25305] Signal code: Address not mapped (1)
[portia:25305] Failing at address: 0x56dc0730
[portia:25305] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf77bdf70]
[portia:25305] [ 1] /usr/lib/mpi/gcc/openmpi2/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x133)[0xf5882e33]
[portia:25305] [ 2] /usr/lib/mpi/gcc/openmpi2/lib/openmpi/mca_btl_vader.so(+0x4251)[0xf5883251]
[portia:25305] [ 3] /usr/lib/mpi/gcc/openmpi2/lib/libopen-pal.so.20(opal_progress+0x70)[0xf7377720]
[portia:25305] [ 4] /usr/lib/mpi/gcc/openmpi2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x925)[0xf54571b5]
[portia:25305] [ 5] /usr/lib/mpi/gcc/openmpi2/lib/libmpi.so.20(MPI_Sendrecv+0x299)[0xf77279e9]
[portia:25305] [ 6] /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1(+0xbee9)[0x5663bee9]
[portia:25305] [ 7] /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1(+0x65c8)[0x566365c8]
[portia:25305] [ 8] /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1(+0x1f02)[0x56631f02]
[portia:25305] [ 9] /lib/libc.so.6(__libc_start_main+0xf3)[0xf7511743]
[portia:25305] [10] /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1(+0x1971)[0x56631971]
[portia:25305] *** End of error message ***

while mpirun -np 2 --mca btl sm,self /usr/lib/mpi/gcc/openmpi2/tests/IMB/IMB-MPI1 works fine

Tried to gdb the SEGV but no success yet.

@bosilca
Copy link
Member

bosilca commented Sep 25, 2017

The issue seems to be fixed in 2.1.3a1 and in master.

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

@bosilca Do you have a sha1 ? I need to get 2.1.2 working for our next release (2.1.3 will be too late).

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

I see the same issue with the tip of the v2.x branch

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

Note: This does not happen everytime. Sometimes it stalls, sometimes it works.

@bosilca
Copy link
Member

bosilca commented Sep 26, 2017

I used the current head at cb36cf9. I run the test you mentionned 100 times and couldn't get any segfault (on an x86).

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

I can still see the bug on this SHA. I see it on v3.0.0 too.

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

@bosilca Are you on x86 or x86_64 ? 64 bits works fine it's the 32b version that breaks.

@bosilca
Copy link
Member

bosilca commented Sep 26, 2017

I haven't built the 32 bits version in ages ...

@nmorey
Copy link
Contributor Author

nmorey commented Sep 26, 2017

I've never seen the issue on x86_64. The i586 has probably be broken for a while but the testsuite I run failed silently due to a glitch in some script.

@nmorey
Copy link
Contributor Author

nmorey commented Nov 3, 2017

Bump. This is still happening for happening for both v2.x and v3 on x86 (32b)

@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

Can confirm this is a bug when running an i386 build. Taking a look now.

@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

Looking like this is due to a missing memory barrier. Testing the fix now.

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 5, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260, open-mpi#4553.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 5, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
@nmorey
Copy link
Contributor Author

nmorey commented Dec 8, 2017

PR #4569 seems to fix the issue on both openmpi 2 and 3

@nmorey
Copy link
Contributor Author

nmorey commented Dec 8, 2017

@hjelmn Thanks for that fix

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 11, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit that referenced this issue Dec 11, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References #4260

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 11, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit a82f761)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 11, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit a82f761)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 11, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit a82f761)
Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member

Now merged on all branches. Closing.

davideberius pushed a commit to davideberius/ompi that referenced this issue Dec 14, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260

Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants