Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

WimSpee · 2018-03-16T09:40:31Z

In the past we typically ran analysis with bcbio from the input level of FASTQ or BAM files. We are happy with the results how scalable and reliable bcbio is.

To get a population level analysis result it's now with the availability of GATK4 much more efficient to reprocess from GVCF level for all existing (historical + new) samples than from BAM level.

Starting an analysis from GATK4 GVCF level is current not possible in bcbio as far as I know.
Also it is not obvious how the BAM level QC files in this case would be aggregated to also be included in the multiqc report.

Is reprocessing from GVCF level for expanding populations of samples functionality that you think fits with bcbio and would like to cover with bcbio?

Otherwise I'll have to (re)create this functionality myself outside of bcbio (which I rather not do), so that we get the same results for our expanding population as for our new batches:

Re-calculation of callable regions from GVCF or BAM for parallelization
The actual GVCF square off via GenomicsDBImport and GenotypeGVCFs
Soft filter status for SNPs and INDELs
Aggregated MultiQC

Most of this looks straight forward enough to me, if I would need to do this.
I was just wondering if any values are extracted from the squared of VCF to fill in variables in the soft filter function for SNPs and INDELs. This does not look to be the case but I remember that in the past the full squared of VCF was scanned for depth values to fill in some variables in the soft filter functions.

bcbio-nextgen/bcbio/variation/vfilter.py

Line 215 in 584bdbb

    
           return cutoff_w_expression(in_file, 'TYPE="snp" && (%s)' % " || ".join(filters), data, "GATKCutoffSNP", "SNP",

Please let me know your thoughts on this.

Thank you.

WimSpee · 2018-03-16T09:44:43Z

Maybe it's still possible to have the functionality in bcbio to still start from BAM level but have the functionality to detect if GVCF files are already in place in some directory, so to then skip the computation needed to create the GVCF files?

chapmanb · 2018-03-18T19:58:05Z

Thanks much for the suggestion and starting this discussion. We've gotten feedback that this would be useful (#1513 #2068) but haven't yet gotten it implemented, so definitely know it would be great to have. Our target would be a little less aggressive than what you suggest and would take in a set of gVCFs and joint call them with the new samples calculated as inputs. We wouldn't plan to target recalculating callable regions.

If you had ability and time to tackle this, the starting point would be to supply the gVCFs as individual samples with a vrn_file specification instead of files (https://github.com/bcbio/bcbio-nextgen/blob/master/tests/data/automated/run_info-bam.yaml#L7). Then by batching these with the samples of interest to joint call, we'd need the joint calling machinery to pick these up and include during the GenomicsDBImport and GenotypeGVCFs stages.

I'm happy to provide more specific pointers and help to drive this along if it sounds worthwhile. Thanks again.

WimSpee · 2018-03-21T13:09:26Z

Hi @chapmanb, Thank you very much for your answer. Specifying the existing GVCF files using the vrn_file field works. The only thing missing is that soft filtering is not run on the squared off multi-sample VCF file.

This makes the genotype concordance look low compared to the same samples variant called from FASTQ level or from BAM level. Could soft filtering also be run by default (or optionally) for variant calling analysis that start exclusively from GVCF files ?

I did start the analysis also with the existing BAM files to get the mapping and variant calling MultiQC report in the end. And to get the regions to parallelize the GVCF merge on.

I did not try to run a mixed analysis with some samples starting from FASTQ and some from GVCF.
We like to keep that separated, i.e. a batch analysis from FASTQ for new incoming samples, and then subsequently a distinct new analysis from GVCF for the new total set of samples in the expanded population.

Below is what I exactly did.

Sample CSV table

samplename,description,batch,phenotype,sex,variant_regions,lane,coverage_interval,quality_format
DA_1052_01,VSDA_1052_01,DA_1052,,,,DA_1052_01,genome,
DA_1052_02,VSDA_1052_02,DA_1052,,,,DA_1052_02,genome,
DA_1052_03,VSDA_1052_03,DA_1052,,,,DA_1052_03,genome,

Template YAML file, not that I set aligner: false and variantcaller: false and jointcaller: gatk-haplotype-joint

# Template for whole genome Illumina variant calling with FreeBayes
# This is a GATK-free pipeline without post-alignment BAM pre-processing
# (recalibration and realignment)
---
details:
  - analysis: variant2
    genome_build: my_reference
    description:
    # to do multi-sample variant calling, assign samples the same metadata / batch
    metadata:
      batch: DA-1052
    algorithm:
      aligner: false
      mark_duplicates: false
      recalibrate: false
      realign: false
      variantcaller: false
      jointcaller: gatk-haplotype-joint
      ploidy:
          default: 2
      nomap_split_targets: 1000
      effects: false
      tools_off:
      - gemini
      # for targetted projects, set the region
      # variant_regions: /path/to/your.bed

I manually added vrn_file field with the path to the existing GVCF files for every sample. I did not yet try to put a vrn_file column with the GVCF paths in the sample CSV table.

I hope that also works so I can generate the full YAML below from the combination of the sample CSV file and the template YAML.

details:
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_01
  files:
  - input_bam/DA_1052_01-ready.bam
  genome_build: my_reference
  lane: DA_1052_01
  vrn_file: input_gvcf/DA_1052_01.vcf.gz
  metadata:
    batch: DA_1052
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_02
  files:
  - input_bam/DA_1052_02-ready.bam
  genome_build: my_reference
  lane: DA_1052_02
  vrn_file: input_gvcf/DA_1052_02.vcf.gz
  metadata:
    batch: DA_1052
- algorithm:
    aligner: false
    coverage_interval: genome
    effects: false
    jointcaller: gatk-haplotype-joint
    mark_duplicates: false
    nomap_split_targets: 1000
    ploidy:
      default: 2
    realign: false
    recalibrate: false
    tools_off:
    - gemini
    variantcaller: false
  analysis: variant2
  description: DA_1052_03
  files:
  - input_bam/DA_1052_03-ready.bam
  genome_build: my_reference
  lane: DA_1052_03
  vrn_file: input_gvcf/DA_1052_03.vcf.gz
  metadata:
    batch: DA_1052
fc_name: DA_1052_bam_table
upload:
  dir: ../final

This analysis indeed skipped the variant calling step and just squared of the existing GVCF files with the callable regions recalculated from the existing BAM files. Also I got the mapping and variant calling (BAM and VCF based) MultiQC report. This is what I wanted to have.

So the only thing missing is that soft filtering did not run. This is noticed from the genotype concordance results created with Picard. I compared against the same samples run from FASTQ level with BCBio.

As you can see the sensitivity of the samples processed from GVCF is 1, but the PPV only 0.92. This is because the soft filtering did no run.

VARIANT_TYPE	TRUTH_SAMPLE	CALL_SAMPLE	HET_SENSITIVITY	HET_PPV	HET_SPECIFICITY	HOMVAR_SENSITIVITY	HOMVAR_PPV	HOMVAR_SPECIFICITY	VAR_SENSITIVITY	VAR_PPV	VAR_SPECIFICITY	GENOTYPE_CONCORDANCE	NON_REF_GENOTYPE_CONCORDANCE
SNP	DA_1052_01	DA_1052_01	1.00	0.92	?	1.00	0.99	?	1.00	0.97	0.91	0.81	0.81
SNP	DA_1052_02	DA_1052_02	1.00	0.92	?	1.00	0.99	?	1.00	0.96	0.92	0.75	0.75
SNP	DA_1052_03	DA_1052_03	1.00	0.90	?	1.00	0.99	?	1.00	0.97	0.89	0.87	0.87

See also the detailed genotype concordance stats for the first sample. This shows that there are a lot of VC_FILTERED HET_REF_VAR1 and MISSING HET_REF_VAR1 discordand genotypes.

VARIANT_TYPE	TRUTH_SAMPLE	CALL_SAMPLE	TRUTH_STATE	CALL_STATE	COUNT	CONTINGENCY_VALUES
SNP	DA_1052_01	DA_1052_01	HOM_VAR1	HOM_VAR1	2076047	TP
SNP	DA_1052_01	DA_1052_01	HET_REF_VAR1	HET_REF_VAR1	846153	TP,TN
SNP	DA_1052_01	DA_1052_01	VC_FILTERED	HET_REF_VAR1	664864	EMPTY
SNP	DA_1052_01	DA_1052_01	MISSING	HET_REF_VAR1	66254	FP,TN
SNP	DA_1052_01	DA_1052_01	VC_FILTERED	HOM_VAR1	37459	EMPTY
SNP	DA_1052_01	DA_1052_01	MISSING	HOM_VAR1	17872	FP

Thank you.

WimSpee · 2018-03-21T13:22:56Z

Never mind the request previously here to update the script to create the full YAML. We can create the full YAML our selves for starting the analysis from GVCF.

It's just enabling the soft filtering to run that I would need help with.

Thank you .

chapmanb · 2018-03-23T18:56:08Z

Thanks much for testing this out and for the feedback. This made better progress than I'd expected and happy that it's almost doing what you need. I'll put adding in soft filtering and improving template support for these (with less urgency) on the to do list. I appreciate the feedback and suggestions and will update when we have a chance to work on these.

WimSpee · 2018-03-26T06:52:07Z

Hi Brad, Thank you for putting enabling soft filtering for analysis starting from GVCF on the to do list. I look forward to that functionality being in bcbio.

I am already very happy with that analysis from GVCF already mostly works via bcbio. :)

I can wait for the soft filtering, or if needed just run it as an extra manual step after bcbio for now.

Thank you again also for the bcbio software in general.

roryk added the enhancement label Aug 10, 2019

This was referenced Aug 10, 2019

FR: Combine GVCF #1513

Closed

RF: configuration template for joint genotyping and VQSR #1681

Closed

This was referenced Dec 24, 2019

Improving bcftools stats performance on large variant calling VCF files #3036

Open

Read in multi-sample bcftools stats file MultiQC/MultiQC#1087

Closed

amizeranschi mentioned this issue Jan 30, 2020

GenomicsDBImport fails with 2500 samples: OSError: [Errno 7] Argument list too long: '/bin/bash' #3074

Closed

naumenko-sa mentioned this issue May 29, 2020

bcbio priorities #3242

Open

90 tasks

naumenko-sa closed this as completed May 29, 2020

WimSpee mentioned this issue Oct 21, 2020

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

WimSpee commented Mar 16, 2018

WimSpee commented Mar 16, 2018

chapmanb commented Mar 18, 2018

WimSpee commented Mar 21, 2018

WimSpee commented Mar 21, 2018 •

edited

Loading

chapmanb commented Mar 23, 2018

WimSpee commented Mar 26, 2018

Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

Running analysis from GVCF level with bcbio to get updated results for expanding population of samples #2336

Comments

WimSpee commented Mar 16, 2018

WimSpee commented Mar 16, 2018

chapmanb commented Mar 18, 2018

WimSpee commented Mar 21, 2018

WimSpee commented Mar 21, 2018 • edited Loading

chapmanb commented Mar 23, 2018

WimSpee commented Mar 26, 2018

WimSpee commented Mar 21, 2018 •

edited

Loading