Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

WimSpee · 2020-10-21T09:30:15Z

Version info

bcbio version (bcbio_nextgen.py --version): 1.1.5

To Reproduce
Exact bcbio command you have used:

bcbio_nextgen.py {CONFIG_YAML} -t ipython  -n 41 -s sge  -q main.regular -r vf=5G,h_rss=50G'

Your sample configuration file:
Repeated for many RNAseq samples.

 'rna-seq variant calling': {
        'algorithm': {
            'aligner': False,
            'strandedness': 'unstranded',
            'transcript_assembler': 'stringtie',
            'variantcaller': False,
            'jointcaller': 'gatk-haplotype-joint',
            'mark_duplicates': False,
            'nomap_split_targets': 1000,
            'ploidy': {'default': 2},
            'realign': False,
            'recalibrate': False,
            'tools_off': ['gemini'],
        },
        'analysis': 'RNA-seq',
        'description': '  SAMPLE_NAME',
        'files': ['input_bam/SAMPLE_NAME-ready.bam'],
        'genome_build': 'REFERENCE_NAME',
        'lane': 'SAMPLE_NAME',
        'metadata': {'batch': 'BATCH_NAME'},
        'vrn_file': 'input_gvcf/SAMPLE_NAME.vcf.gz',
    },

Observed behavior
Per sample:

the provided BAM and GVCF are found, STAR and GATK haplotype caller are thus skipped as intended.
the raw (.counts) and normalized counts (.sf) files are created. Did not yet try to also provide these files already as input. And thus skip salmon and the program that created the raw counts.
the GTF files are made. Did not yet try to also provide these files already as input. And thus skip Stringtie.

Over all samples:

combined raw counts file (combined.counts) is made.
combined normalized counts files are made (combined.fpkm,combined.isoform.fpkm )
No squared of multi-sample VCF is made from the provided GVCF files

Expected behavior
The same as the observed behavior. With the important additions of:

the squared off multi-sample VCF file being made
RNAseq specific soft-filtering of the multi-sample VCF.

The squared off multi-sample VCF file is made when starting with existing BAM and GVCF files for whole genome sequencing data. #2336

It would be nice if bcbio could do the same for existing RNAseq GVCF files.

Could you please have a look. Or point me to the direction were I would need to edit the bcbio RNAseq code to run the square off of the GVCF files. And indicate if there are any likely complications that would come up when enabling this.

Thank you.

Wim

The text was updated successfully, but these errors were encountered:

WimSpee · 2020-10-26T16:50:36Z

@roryk I had a look myself and I am making some progress.

I currently modified the following function as shown below and this results in the input GVCF files being squared off.
https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/pipeline/rnaseq.py#L99

Original

variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
if variantcaller and ("gatk-haplotype" in variantcaller):

New

jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
if jointcaller and 'gatk-haplotype-joint' in jointcaller:

Complete modified function:

def rnaseq_variant_calling(samples, run_parallel):
    """
    run RNA-seq variant calling using GATK
    """
    samples = run_parallel("run_rnaseq_variant_calling", samples)
    #variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    print("test1")
    print(to_single_data(samples[0]))
    jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
    print("test2")
    print(jointcaller)
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for d in joint.square_off(samples, run_parallel):
            out.extend([[to_single_data(xs)] for xs in multi.split_variants_by_sample(to_single_data(d))])
        samples = out
    #if variantcaller:
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        samples = run_parallel("run_rnaseq_ann_filter", samples)
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for data in (to_single_data(xs) for xs in samples):
            if "variants" not in data:
                data["variants"] = []
            data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
                                     "population": {"vcf": data["vrn_file"]}})
            data["vrn_file"] = data.pop("vrn_file_orig")
            out.append([data])
        samples = out
    return samples

Later the joint RNAseq analsys from GVCF crashes though.
Seems that RNAseq filtering requires a bed file with splice junctions, and this causes a downstream lookup to fail.

[2020-10-26T16:36Z] execution_node1: ipython: run_rnaseq_ann_filter
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
Sending a shutdown signal to the controller and engines.
2020-10-26 17:36:46.951 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:46.956 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
2020-10-26 17:36:48.492 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:48.497 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
Traceback (most recent call last):
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 238, in <module>
    main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 53, in run_main
    fc_dir, run_info_yaml)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 264, in rnaseqpipeline
    samples = rnaseq.rnaseq_variant_calling(samples, run_parallel)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 118, in rnaseq_variant_calling
    data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
KeyError: 'vrn_file_orig'

I'll also have a look at fixing this or working around it. Let me know if you have any tips.

After checking that joint analysis from GVCF works, and normal analysis from FASTQ still works, I am happy to submit a pull request with the changes.

Thank you.

Wim

WimSpee · 2020-10-27T15:22:24Z

Edits on 2 more places were needed to allow for setting data["vrn_file_orig"] and for running soft filtering.

bcbio-nextgen/bcbio/variation/joint.py

Line 201 in dd79b71

data["vrn_file_orig"] = data["vrn_file"]

bcbio-nextgen/bcbio/pipeline/rnaseq.py

Line 152 in a2f8fa0

if variantcaller and ("gatk-haplotype" in variantcaller):

The RNAseq analysis from GVCF now works and finishes.

I did not yet try to pass in a splice site file computed from the reference genome gene model, but I think this should work and be reasonable comparable to using the splice sites identified by STAR.

Will send you a pull request later this week with the few lines that I needed to change.

WimSpee · 2020-10-27T15:52:38Z

@roryk See this pull request #3360

See bcbio#3355

WimSpee · 2020-10-28T09:42:37Z

Created a simplified pull request in #3361

See #3355

WimSpee mentioned this issue Oct 27, 2020

3355: Allow for RNAseq analysis form GVCF level #3360

Closed

WimSpee added a commit to WimSpee/bcbio-nextgen that referenced this issue Oct 28, 2020

bcbio#3355 Allow for RNAseq analysis from GVCF level

65bd303

See bcbio#3355

WimSpee added a commit to WimSpee/bcbio-nextgen that referenced this issue Oct 28, 2020

bcbio#3355 Allow for RNAseq analysis from GVCF

7b20b9a

See bcbio#3355

WimSpee mentioned this issue Oct 28, 2020

Allow for RNAseq analysis starting from GVCF level #3361

Closed

roryk pushed a commit that referenced this issue Jan 2, 2021

#3355 Allow for RNAseq analysis from GVCF level

ebf15ab

See #3355

roryk pushed a commit that referenced this issue Jan 2, 2021

#3355 Allow for RNAseq analysis from GVCF

2ad845b

See #3355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

WimSpee commented Oct 21, 2020

WimSpee commented Oct 26, 2020

WimSpee commented Oct 27, 2020

WimSpee commented Oct 27, 2020

WimSpee commented Oct 28, 2020

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

Comments

WimSpee commented Oct 21, 2020

WimSpee commented Oct 26, 2020

WimSpee commented Oct 27, 2020

WimSpee commented Oct 27, 2020

WimSpee commented Oct 28, 2020