Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory and not producing output #204

Open
dcopetti opened this issue Jan 7, 2022 · 3 comments
Open

out of memory and not producing output #204

dcopetti opened this issue Jan 7, 2022 · 3 comments

Comments

@dcopetti
Copy link

dcopetti commented Jan 7, 2022

Hello,
I would like to use racon to error-correct chloroplast (a subset of all reads, made by aligning all reads to the assembly containing cp contigs) PacBio CLR reads and I keep running out of memory.
The reads are about 221,000, spanning 3.7 Gb. I aligned them to themselves
minimap2 -x ava-pb -t 124 -H -X ../subreads.fa ../subreads.fa >SGP5p_cp_ava.paf
obtaining a paf file of 779 GB.
I run racon as follows:
srun -c 100 racon -f -t 100 ../subreads.fa SGP5p_cp_ava.pa ../subreads.fa
on a cluster with nodes of 1 TB memory and 256 CPUs (with or without slurm does not matter), and the job gets killed after several hours without writing anything.
Thinking there are too many reads and alignments, I made subsets of 100,000 and 50,000 subreads. The former produced a paf file of 186 GB, the latter of only 53 GB.
The job with 100,00 went a bit more further than the first but still died with this message:

[racon::Polisher::initialize] loaded target sequences 8.052987 s
[racon::Polisher::initialize] loaded sequences 8.733178 s
[racon::Polisher::initialize] loaded overlaps 3075.373771 s
srun: error: EagI: task 0: Killedgning overlaps [===================>] 72227.005454 s
srun: Force Terminated job step 16095.0

I am now running the job with 50,000 sequences. I wonder why it keeps dying and how 1 TB is not sufficient for loading that data.
How can I error correct my reads?
Would it make sense to split the input in e.g. 5 smaller files and align each to the whole set of reads independently? this may create smaller paf files. Also, I don't think that having shorter headers may help, when loading a 180 GB file still uses 1 TB memory.
Looking forward to hearing your opinion,
Dario

@rvaser
Copy link
Collaborator

rvaser commented Jan 11, 2022

Hi Dario,
in error-correction Racon uses all found overlaps/alignments. I suspect that due to repetitive regions the memory just explodes as many reads have too many overlaps. Not sure what you can do about it except correcting small batches with all reads (first 1/n reads with all n, second 1/n reads with all n, etc).

Best regards,
Robert

@dcopetti
Copy link
Author

Thanks Robert,
It makes sense. Do you think that maybe reducing the number of alignments from the paf file would work? How about removing short alignments (e.g. less than a third of query length), that should be mostly aspecific?

@rvaser
Copy link
Collaborator

rvaser commented Jan 12, 2022

I guess it should help. You could also filter the paf file by allowing only N longest overlaps per read (e.g. 25?). Unfortunately, those filters need to be done outside Racon as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants