-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336
Comments
Maybe it's still possible to have the functionality in bcbio to still start from BAM level but have the functionality to detect if GVCF files are already in place in some directory, so to then skip the computation needed to create the GVCF files? |
Thanks much for the suggestion and starting this discussion. We've gotten feedback that this would be useful (#1513 #2068) but haven't yet gotten it implemented, so definitely know it would be great to have. Our target would be a little less aggressive than what you suggest and would take in a set of gVCFs and joint call them with the new samples calculated as inputs. We wouldn't plan to target recalculating callable regions. If you had ability and time to tackle this, the starting point would be to supply the gVCFs as individual samples with a I'm happy to provide more specific pointers and help to drive this along if it sounds worthwhile. Thanks again. |
Hi @chapmanb, Thank you very much for your answer. Specifying the existing GVCF files using the This makes the genotype concordance look low compared to the same samples variant called from FASTQ level or from BAM level. Could soft filtering also be run by default (or optionally) for variant calling analysis that start exclusively from GVCF files ? I did start the analysis also with the existing BAM files to get the mapping and variant calling MultiQC report in the end. And to get the regions to parallelize the GVCF merge on. I did not try to run a mixed analysis with some samples starting from FASTQ and some from GVCF. Below is what I exactly did. Sample CSV table
Template YAML file, not that I set
I manually added I hope that also works so I can generate the full YAML below from the combination of the sample CSV file and the template YAML.
This analysis indeed skipped the variant calling step and just squared of the existing GVCF files with the callable regions recalculated from the existing BAM files. Also I got the mapping and variant calling (BAM and VCF based) MultiQC report. This is what I wanted to have. So the only thing missing is that soft filtering did not run. This is noticed from the genotype concordance results created with Picard. I compared against the same samples run from FASTQ level with BCBio. As you can see the sensitivity of the samples processed from GVCF is 1, but the PPV only 0.92. This is because the soft filtering did no run.
See also the detailed genotype concordance stats for the first sample. This shows that there are a lot of
Thank you. |
Never mind the request previously here to update the script to create the full YAML. We can create the full YAML our selves for starting the analysis from GVCF. It's just enabling the Thank you . |
Thanks much for testing this out and for the feedback. This made better progress than I'd expected and happy that it's almost doing what you need. I'll put adding in soft filtering and improving template support for these (with less urgency) on the to do list. I appreciate the feedback and suggestions and will update when we have a chance to work on these. |
Hi Brad, Thank you for putting enabling soft filtering for analysis starting from GVCF on the to do list. I look forward to that functionality being in bcbio. I am already very happy with that analysis from GVCF already mostly works via bcbio. :) I can wait for the soft filtering, or if needed just run it as an extra manual step after bcbio for now. Thank you again also for the bcbio software in general. |
Hi @chapmanb .
In the past we typically ran analysis with bcbio from the input level of FASTQ or BAM files. We are happy with the results how scalable and reliable bcbio is.
To get a population level analysis result it's now with the availability of GATK4 much more efficient to reprocess from GVCF level for all existing (historical + new) samples than from BAM level.
Starting an analysis from GATK4 GVCF level is current not possible in bcbio as far as I know.
Also it is not obvious how the BAM level QC files in this case would be aggregated to also be included in the multiqc report.
Is reprocessing from GVCF level for expanding populations of samples functionality that you think fits with bcbio and would like to cover with bcbio?
Otherwise I'll have to (re)create this functionality myself outside of bcbio (which I rather not do), so that we get the same results for our expanding population as for our new batches:
Most of this looks straight forward enough to me, if I would need to do this.
I was just wondering if any values are extracted from the squared of VCF to fill in variables in the soft filter function for SNPs and INDELs. This does not look to be the case but I remember that in the past the full squared of VCF was scanned for depth values to fill in some variables in the soft filter functions.
bcbio-nextgen/bcbio/variation/vfilter.py
Line 215 in 584bdbb
Please let me know your thoughts on this.
Thank you.
The text was updated successfully, but these errors were encountered: