Option to use Spades and multithreading for Bowtie2 and Spades #210
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Two main changes are proposed here:
How and why is described in more details in the comments in the code. In short: assume that we have started
ariba run
with 16 threads, and we have 20 clusters. At the end of the Pool.starmap call, we would have some single-thread calls
still running, with other threads in the pool staying idle because there is nothing else to do. The same would happen if
there are only, say, two clusters to begin with. The proposed change tracks the total number of remaining clusters
through a shared counter, and adaptively increases number of threads for Bowtie2 and Spades calls. At any moment, the
sum of used threads is guaranteed to never exceed the total allocated thread count (16 in our example). It should never
result in longer wall clock time than the original single-threaded implementation.
Why Spades is sometimes useful
I have been using Ariba a bit off-label, for extracting consensus sequences for target genes in WGS datasets in microbial
surveillance studies. There is not much interest in my case in the variants reported by the Ariba itself because we instead look
in a separate step at the differences in the consensus sequences across hundreds of isolates, essentially in an MSA. I
like your approach of recruiting reads to multiple alternative references and doing local de-novo assembly. We were able
to quickly extract various exotic truncated versions of the target genes that were otherwise difficult to handle with a
pure mapping-based approach. The default fermilight assembler worked fine with WGS data across many studies and genes.
We deployed the tool in our internal Galaxy instance.
Recently, I tried to push this line of Ariba use further and assemble a RSV virus amplicon. The data was from a PCR
amplification of contiguous chunk that spanned C-term of the G gene and all of the F gene, followed by Nextera library
construction and MiSeq 300x2 sequencing. RSV comes in two major subtypes, which are then classified further into genotypes,
with some genotypes having about 60 nt insertions in the G gene. The ability to supply alternative references is quite
useful in this case, and allows us splitting those samples where co-infection of A and B has occurred, and immediately
gives us subtype assignment. The reads in that dataset had extremely skewed coverage depth (often 30,000x at the F end, down
to 200x at the G). That partly probably had something to do with occasional incorrect primer binding, but large coverage
variations are generally typical for viral amplicon sequencing.
In this challenging dataset, fermilight just could not cope - it would often generate fragmented assemblies, even after
I would perform a digital normalization to even-out the coverage depth of the input reads.
Spades, on the other hand, was able to assemble full-length amplicons (and separate amplicons in A and B mixtures)
directly from the input reads without a digital normalization, if I was using the Single Cell mode (
spades.py --sc
).So, I have re-integrated Spades into Ariba as an optional alternative to fermilight. I am quite sure that there are
going to be other challenging use cases where using a full-blown assembler like
Spades will make a decisive difference in the output quality, at the expense of longer runtimes.
I have deviated in a few places from your original Spades-related code:
ariba run
called--spades_mode
that allows selecting specialized variants of Spades suchas
--rna
or--sc
. My code then picks reasonable other options to Spades based on the--spades_mode
choice. I haverenamed your
--spades_other_options
into--spades_options
in order to reflect the fact that if this argument isprovided by the Ariba user, it completely replaces default Spades options generated based on the
--spades_mode
choice.spacers between the contigs, and my impression was that they would get into the final Ariba output and get treated like
real sequence. I might be mistaken on this point, though.