Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Use germline resources in MuTect2 to reduce artifacts in tumor-only mode #2873

Open
lbeltrame opened this issue Jul 4, 2019 · 16 comments

Comments

@lbeltrame
Copy link
Contributor

I thought this would be interesting to have, as described in:

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

However it needs yet another potential duplicate file which is part of the GATK resource bundle (and we already have gnomAD). I think however that the option might be very useful for tumor-only analyses to remove sources from being called (they are prefiltered) also reducing runtime times.

Thoughts?

@lbeltrame
Copy link
Contributor Author

The file in question (af-only-gnomad.hg38.vcf.gz) and its b37 counterpart are ~3G in size. I can't tell if it's worth downloading them or try somehow to replicate them off the gnomAD data already available.

@roryk
Copy link
Collaborator

roryk commented Jul 16, 2019

Thanks! I think that would be a good improvement. Can you point me to where you downloaded that file?

@lbeltrame
Copy link
Contributor Author

lbeltrame commented Jul 17, 2019

It's available at ftp://[email protected]/bundle/Mutect2/ (that is, a specific sublocation of the GATK resource bundle). I'd give the exact link but I can't access the URL right now.

The two files are:

  • af-only-gnomad.hg38.vcf.gz
  • af-only-gnomad.raw.sites.b37.vcf.gz

@lbeltrame
Copy link
Contributor Author

I wonder if it'd be feasible to actually process our gnomAD files to produce something that is palatable for the GATK. That would avoid maintaining yet another resource.

@roryk
Copy link
Collaborator

roryk commented Sep 6, 2019

Thanks-- if the Broad's files are kind of small, it might be easier to just use their preprocessed files. Our gnomAD prep takes hours-- @naumenko-sa do you see that as well? I think bcftools annotate in the recipe preparation is probably the culprit.

@lbeltrame
Copy link
Contributor Author

lbeltrame commented Sep 6, 2019

I haven't had yet the opportunity to test them, I think they're fairly minimal. A quick bcftools view showed this:

chr1    10067   .       T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      30.35   PASS    AC=3;AF=7.384e-05
chr1    10108   .       CAACCCT C       46514.3 PASS    AC=6;AF=0.0001525
chr1    10109   .       AACCCT  A       89837.3 PASS    AC=48;AF=0.001223
chr1    10114   .       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA  CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T      36729   PASS    AC=9,1;AF=0.0002246,2.496e-05
chr1    10119   .       CT      C       251.23  PASS    AC=5;AF=0.0001249
chr1    10120   .       T       C       14928.7 PASS    AC=10;AF=0.00025
chr1    10128   .       ACCCTAACCCTAACCCTAAC    A       285.71  PASS    AC=3;AF=7.58e-05
chr1    10131   .       CT      C       378.93  PASS    AC=7;AF=0.0001765
chr1    10132   .       TAACCC  T       18025.1 PASS    AC=2;AF=5.049e-05

Hence I guess it uses just AC and AF for each locus.

@roryk
Copy link
Collaborator

roryk commented Sep 6, 2019

I wonder if we need to do anything at all-- we have those fields in the gnomAD files so we can probably just feed those in directly, if GATK is parsing the file at all reasonably.

@lbeltrame
Copy link
Contributor Author

Indeed. I'll try to feed one of the gnomAD files I have and see if it's processed correctly.

@roryk
Copy link
Collaborator

roryk commented Sep 6, 2019

Thanks Luca!

@lbeltrame
Copy link
Contributor Author

Well, it didn't blow up at least. ;) It might be worth including it. @roryk @chapmanb As I believe gnomAD is optional, this should be done only if it's there. What's the cleanest way to check whether it is present?

@roryk
Copy link
Collaborator

roryk commented Sep 10, 2019

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/vcfanno.py#L154 has a function find either exac or gnomAD, so I'd make a version of that that just checked for gnomad_exome.

@hliu2016
Copy link

hliu2016 commented Sep 26, 2019

@roryk
This looks very useful to filter germline variants. Has the --germline-resource option been integrated into bcbio?

@roryk
Copy link
Collaborator

roryk commented Sep 26, 2019

Hello, I think we are leaning to not implementing this. I talked with Brad about it and he had a good practical take. The reasoning behind not implementing it is that if we filter them while we are calling them, the variants completely disappear and cannot be recovered. If we annotate them as being in gnomAD, then later on people can decide what they want to do with them-- whenever we add filtering, eventually folks will complain they are missing variants they are expecting to see. If we annotate then at least they can see why we filtered them, if we filter them, since at some point the variants will exist. This will slow down the mutect2 calls, since it will be caling in places it would have skipped, is the downside.

@naumenko-sa
Copy link
Contributor

better later than never :) we need this resource for t-only mutect2 and purecn

naumenko-sa added a commit that referenced this issue Aug 13, 2020
@waemm
Copy link

waemm commented Sep 16, 2020

@naumenko-sa, just looking at this, are these commits just adding the af_gnomad file or are you considering implementing the germline resources option for Mutect?

@naumenko-sa
Copy link
Contributor

yes, I'm pushing germline resources in mutect as well.

@naumenko-sa naumenko-sa reopened this Oct 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants