Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abyss-pe 2.1.0 segfault with Open MPI 3.1.0 #236

Closed
jdmontenegro opened this issue Jun 18, 2018 · 18 comments
Closed

abyss-pe 2.1.0 segfault with Open MPI 3.1.0 #236

jdmontenegro opened this issue Jun 18, 2018 · 18 comments
Assignees

Comments

@jdmontenegro
Copy link

Please report

System

Hi all I am using abyss 2.1.0 compiled under openmpi/3.1.0, boost/1.66 and sparsehash/2.0.3 on a CENTOS/7 cluster with 1.5 Tb of RAM and 128 threads available.

Assembly error

My abyss command line is the following:
abyss-pe name=NewAssembly G=3000000000 s=500 v=-v np=64 k=97 in="reads1.fastq reads2.fastq"
After 9 and a half hours running I get this error:

[balder-wn05:31600] *** Process received signal ***
[balder-wn05:31600] Signal: Segmentation fault (11)
[balder-wn05:31600] Signal code: Invalid permissions (2)
[balder-wn05:31600] Failing at address: 0x7f618bee27d8
[balder-wn05:31600] [ 0] /usr/lib64/libc.so.6(+0x35270)[0x7f618c2ee270]
[balder-wn05:31600] [ 1] /usr/local/appl/software/openmpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x429c)[0x7f6180b9829c]
[balder-wn05:31600] [ 2] /usr/local/appl/software/openmpi/3.1.0/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f618bb1324c]
[balder-wn05:31600] [ 3] /usr/local/appl/software/openmpi/3.1.0/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f618cf8e154]
[balder-wn05:31600] [ 4] ABYSS-P[0x40dcec]
[balder-wn05:31600] [ 5] ABYSS-P[0x40df34]
[balder-wn05:31600] [ 6] ABYSS-P[0x40f414]
[balder-wn05:31600] [ 7] ABYSS-P[0x4148c8]
[balder-wn05:31600] [ 8] ABYSS-P[0x4169d2]
[balder-wn05:31600] [ 9] ABYSS-P[0x40600a]
[balder-wn05:31600] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f618c2dac05]
[balder-wn05:31600] [11] ABYSS-P[0x40766f]
[balder-wn05:31600] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 46 with PID 31600 on node balder-wn05 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
make: *** [/data/Bioinfo/bioinfo-proj-jmontenegro/DENOVO/Dunnart/Results/Assembly/Abyss/dunnart_abyss-1.fa] Error 139

The total number of bases sequenced was 160 Gbp for a 3 Gbp diplod genome (~50X sequencing depth )

I am using the slurm scheduler and asking for 1Tb of memory and 64 cpus (64 tasks and 1 cpu per task) for this assembly. I can see that each thread is using around 8.5 Gbp, so 64 * 8.5 = 544 Gbp. That is roughly half the memory allocated for this process. The system administrator is looking into the details of the failure, but so far I cannot find a way around this. I have tried reducing the number of threads to 32 and 16 and the error is the same.

Any help would be much appreciated.

Kind regards,

@sjackman
Copy link
Collaborator

Hi, @jdmontenegro. Thanks for the detailed bug report. You have good timing. I am seeing a nearly identical error message right now and am also troubleshooting it. The only difference is that I'm not seeing code: Invalid permissions (2).

…
0: Hash load: 75929216 / 268435456 = 0.283 using 30.7 GB
24: Hash load: 75952014 / 268435456 = 0.283 using 30.7 GB
36: Hash load: 75667921 / 268435456 = 0.282 using 30.6 GB
8: Hash load: 75928865 / 268435456 = 0.283 using 30.7 GB
[hpce705:162958] *** Process received signal ***
[hpce705:162958] Signal: Segmentation fault (11)
[hpce705:162958] Signal code:  (128)
[hpce705:162958] Failing at address: (nil)
[hpce705:162958] [ 0] /gsc/btl/linuxbrew/lib/libc.so.6(+0x33070)[0x7f7b4c627070]
[hpce705:162958] [ 1] /gsc/btl/linuxbrew/Cellar/open-mpi/3.1.0/lib/openmpi/mca_btl_vader.so(+0x4bde)[0x7f7b40b8fbde]
[hpce705:162958] [ 2] /gsc/btl/linuxbrew/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f7b4be5f60c]
[hpce705:162958] [ 3] /gsc/btl/linuxbrew/lib/libmpi.so.40(PMPI_Request_get_status+0x74)[0x7f7b4d2abb54]
[hpce705:162958] [ 4] ABYSS-P[0x40dcec]
[hpce705:162958] [ 5] ABYSS-P[0x40df24]
[hpce705:162958] [ 6] ABYSS-P[0x40f384]
[hpce705:162958] [ 7] ABYSS-P[0x4148a7]
[hpce705:162958] [ 8] ABYSS-P[0x416c12]
[hpce705:162958] [ 9] ABYSS-P[0x4066aa]
[hpce705:162958] [10] /gsc/btl/linuxbrew/lib/libc.so.6(__libc_start_main+0xf5)[0x7f7b4c614825]
[hpce705:162958] [11] ABYSS-P[0x407d39]
[hpce705:162958] *** End of error message ***

ABySS 2.1.0
Open MPI 3.1.0
libevent 2.1.8
48 processes all running on a single machine.

I'm wondering whether it may be an issue with Open MPI 3.1.0. I'm going to try compiling ABySS against open-mpi 2.1.3. Are you able to try an older version of Open MPI?

@jdmontenegro
Copy link
Author

jdmontenegro commented Jun 19, 2018 via email

@sjackman
Copy link
Collaborator

We've had success with OpenMPI 1.6.3. I've seen 3.1.0 fail. The vader BTL transport was made the default in 1.8.4, and the previous default sm was removed in 3.1.0. You could try 2.1.3, which has both transports, with mpirun --mca btl vader and also mpirun --mca btl sm, and see if one or the other or both works/fails.

@jdmontenegro
Copy link
Author

jdmontenegro commented Jun 20, 2018 via email

@sjackman
Copy link
Collaborator

sjackman commented Jun 20, 2018

I tried with no luck ABySS 2.1.0 with OpenMPI 3.1.0 with mpirun --mca btl_vader_eager_limit 8192
See https://github.com/bcgsc/abyss/wiki/ABySS-Users-FAQ#2-my-abyss-assembly-jobs-hang-when-i-run-them-with-high-k-values-eg-k250

@sjackman
Copy link
Collaborator

You can also use the ABySS Bloom filter assembler, which does not require OpenMPI and reduces the memory requirement by about ten fold.
See https://github.com/bcgsc/abyss#assembling-using-a-bloom-filter-de-bruijn-graph
I'd still very much appreciate your help in troubleshooting OpenMPI though.

@sjackman
Copy link
Collaborator

sjackman commented Jun 21, 2018

@jdmontenegro Which distribution of Linux, Linux kernel version, and compiler are you using? Did you use a package manager like Conda or Linuxbrew to install ABySS, Open MPI, or any of their dependencies, or did you install manually from source?

❯❯❯ cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 
❯❯❯ uname -a
Linux hpce705 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 GNU/Linux
❯❯❯ gcc --version
gcc (Homebrew gcc 5.5.0_4) 5.5.0

@jdmontenegro
Copy link
Author

jdmontenegro commented Jun 22, 2018 via email

@sjackman
Copy link
Collaborator

I'm running a job with Open MPI 2.1.3 using the sm BTL, and so far it's working. I hope I haven't jinxed it by saying that too early.

abyss-pe mpirun='mpirun --mca btl self,sm --mca btl_sm_eager_limit 8192'

The --mca btl_sm_eager_limit 8192 may not be necessary. I'll try removing it later if this run works.

@jdmontenegro
Copy link
Author

Hi Shaun,
Abyss finished the assembly correctly after compiling with openmpi/2.1.3 I did not have to modify the assembly command:
abyss-pe name=NewAssembly G=3000000000 s=500 v=-v np=64 k=97 in="reads1.fastq reads2.fastq"
the assembly looks a bit fragmented though, so I guess some parameters need to be optimized. I am now trying to scaffold using 10X chromium linked reads, any suggestions?

Cheers,

@benvvalk
Copy link
Contributor

benvvalk commented Jun 26, 2018

@jdmontenegro To do misassembly correction and scaffolding with Chromium reads, you will need to install Tigmint and ARCS, and add the lr parameter to your abyss-pe command.

See this section of the ABySS README.md for further info: https://github.com/bcgsc/abyss#scaffolding-with-linked-reads

@sjackman
Copy link
Collaborator

That's great news, @jdmontenegro! Open MPI 2.1.3 also worked for me, but I had to use the sm BTL rather than the default vader BTL, which reproducibly segfaulted for me. I'm glad to hear that you got it working!

I'd recommend setting abyss-pe N=5-20 and trying different values of k to find the value that optimizes NG50, or NGA50 if you have a reference genome.

@sjackman
Copy link
Collaborator

@benvvalk
Copy link
Contributor

Thank you, Shaun!

@sjackman
Copy link
Collaborator

I'll report this issue upstream to the Open MPI developers next week when I get a chance.

@sjackman sjackman changed the title abyss-pe 2.1.0 segfault abyss-pe 2.1.0 segfault with Open MPI 3.1.0 Jun 26, 2018
@sjackman sjackman reopened this Jul 4, 2018
@jdwinkler-lanzatech
Copy link

I'm getting a very similar error message:

 [e763b6973123:00038] Signal: Segmentation fault (11)
 [e763b6973123:00038] Signal code: Address not mapped (1)
 [e763b6973123:00038] Failing at address: 0x7f05b16b5008
 [e763b6973123:00038] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x37840)[0x7f05b0fd6840]
 [e763b6973123:00038] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f05af8b9936]
 [e763b6973123:00038] [ 2] /usr/lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f05af8a2733]
 [e763b6973123:00038] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f05af8b95b4]
 [e763b6973123:00038] [ 4] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f05af9f346e]
 [e763b6973123:00038] [ 5] /usr/lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f05af9ab88d]
 [e763b6973123:00038] [ 6] /usr/lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f05af967d7c]
 [e763b6973123:00038] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f05afa4afe4]
 [e763b6973123:00038] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f05b0274656]
 [e763b6973123:00038] [ 9] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f05b0ee011a]
 [e763b6973123:00038] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f05b14cee62]
 [e763b6973123:00038] [11] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f05b14fd17e]
 [e763b6973123:00038] [12] ABYSS-P(+0x8b73)[0x55e38ea68b73]
 [e763b6973123:00038] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f05b0fc309b]
 [e763b6973123:00038] [14] ABYSS-P(+0xa6fa)[0x55e38ea6a6fa]
 [e763b6973123:00038] *** End of error message ***

OS: Debian Buster (docker)
OpenMPI: 3.1.3
Abyss version: ffd5e37 (commit hash)

I'm compiling from source within the container.

@sjackman
Copy link
Collaborator

I'd suggest using the ABySS Bloom filter assembler, which does not require OpenMPI and reduces the memory requirement by about ten fold. See https://github.com/bcgsc/abyss#assembling-using-a-bloom-filter-de-bruijn-graph

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Dec 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants