Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureRequest: Support for Smoove/Lumpy population SV calling #2652

Closed
WimSpee opened this issue Jan 28, 2019 · 2 comments
Closed

FeatureRequest: Support for Smoove/Lumpy population SV calling #2652

WimSpee opened this issue Jan 28, 2019 · 2 comments

Comments

@WimSpee
Copy link
Contributor

WimSpee commented Jan 28, 2019

Hi,

Would it be possible to add Smoove/Lumpy population SV calling to Bcbio?
Bcbio currently support Smoove/Lumpy SV calling for small (n < ~ 40) sets of samples.
https://github.com/bcbio/bcbio-nextgen/blob/2e4c888b4c092572961d30d5f2f5068f7387e043/bcbio/structural/lumpy.py#27

For more than 40 samples the Smoove github documentation recommends to run Smoove in a 2 level map reduce way:
https://github.com/brentp/smoove (section population calling )

  1. Single sample SV calling
  2. Concat, sort and merge the single sample SV calling results
  3. Single sample genotyping of all the merged SV's
  4. Paste the single sample results to a square multi-sample table

This should scale up to thousands of WGrS samples I gather from an issue on the Smoove github page. I hope the sensitivity ans specificity is also still good compared to joint SV variant calling and genotyping.

Thank you.

@chapmanb
Copy link
Member

Wim;
Thanks much for the suggestion. This is definitely something we'd like to work on for bigger sample runs but haven't had a chance to implement as it will take some restructuring. One issue is that I'm not sure how best to validate and determine the utility of joint versus single sample (or small related batch) calling. Thinking practically, an alternative we can do right now is only to group related samples during SV calling. Do you have any datasets where we could determine how considering a larger population helps with sensitivity? Thanks again for the discussion.

@WimSpee
Copy link
Contributor Author

WimSpee commented Jan 30, 2019

One (major) upside of (often) (re-)doing the Smoove&Lumpy population SV calling (from single sample Lumpy VCF) is that is results in a (up-to-date) square SV table for the expanding sets of samples that we work with.

Just merging the Smoove&Lumpy batch SV VCF files would results in a non-square (i.e. Swiss cheese) SV table. Like described here also for small variants under batch analysis:
https://gatkforums.broadinstitute.org/gatk/discussion/4150/should-i-analyze-my-samples-alone-or-together
As far as I know there is no way to get a square population SV table from multiple batch SV tables.
So one major way in which the population SV calling helps is in that it is possible to get a square SV table for multiple hundred to multiple thousand of samples. We just tried this for a first set of few hundred samples. Within a few hours we had a square table of SVs for a few hundred of samples. And some SVs of interest are present and genotyped over all samples. This was using also a few hundred CPU, 1 CPU per sample, for both the SV calling and SV genotyping step.

We don't have much public 'truth' data that we can use for testing. I am curious about the sensitivity and specificity of the population calling versus all together at once mode. Also it makes more sense to me do this validation in Human, since there probably is more 'truth' data, and the validation has more value for a bigger set of the bcbio users.

We managed for now to run Smoove&Lumpy outside of bcbio.
From an efficiency point of view it makes sense to re-analyze all samples at the same time via bcbio for small variants and SV's (small variants=GATK4 starting from GVCF, SV=Smoove&Lumpy starting from existing Lumpy single sample VCF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants