Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355

Open
WimSpee opened this issue Oct 21, 2020 · 4 comments

Comments

@WimSpee
Copy link
Contributor

WimSpee commented Oct 21, 2020

Version info

  • bcbio version (bcbio_nextgen.py --version): 1.1.5

To Reproduce
Exact bcbio command you have used:

bcbio_nextgen.py {CONFIG_YAML} -t ipython  -n 41 -s sge  -q main.regular -r vf=5G,h_rss=50G'

Your sample configuration file:
Repeated for many RNAseq samples.

 'rna-seq variant calling': {
        'algorithm': {
            'aligner': False,
            'strandedness': 'unstranded',
            'transcript_assembler': 'stringtie',
            'variantcaller': False,
            'jointcaller': 'gatk-haplotype-joint',
            'mark_duplicates': False,
            'nomap_split_targets': 1000,
            'ploidy': {'default': 2},
            'realign': False,
            'recalibrate': False,
            'tools_off': ['gemini'],
        },
        'analysis': 'RNA-seq',
        'description': '  SAMPLE_NAME',
        'files': ['input_bam/SAMPLE_NAME-ready.bam'],
        'genome_build': 'REFERENCE_NAME',
        'lane': 'SAMPLE_NAME',
        'metadata': {'batch': 'BATCH_NAME'},
        'vrn_file': 'input_gvcf/SAMPLE_NAME.vcf.gz',
    },

Observed behavior
Per sample:

  • the provided BAM and GVCF are found, STAR and GATK haplotype caller are thus skipped as intended.
  • the raw (.counts) and normalized counts (.sf) files are created. Did not yet try to also provide these files already as input. And thus skip salmon and the program that created the raw counts.
  • the GTF files are made. Did not yet try to also provide these files already as input. And thus skip Stringtie.

Over all samples:

  • combined raw counts file (combined.counts) is made.
  • combined normalized counts files are made (combined.fpkm,combined.isoform.fpkm )
  • No squared of multi-sample VCF is made from the provided GVCF files

Expected behavior
The same as the observed behavior. With the important additions of:

  1. the squared off multi-sample VCF file being made
  2. RNAseq specific soft-filtering of the multi-sample VCF.

The squared off multi-sample VCF file is made when starting with existing BAM and GVCF files for whole genome sequencing data. #2336

It would be nice if bcbio could do the same for existing RNAseq GVCF files.

Could you please have a look. Or point me to the direction were I would need to edit the bcbio RNAseq code to run the square off of the GVCF files. And indicate if there are any likely complications that would come up when enabling this.

Thank you.

Wim

@WimSpee
Copy link
Contributor Author

WimSpee commented Oct 26, 2020

@roryk I had a look myself and I am making some progress.

I currently modified the following function as shown below and this results in the input GVCF files being squared off.
https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/pipeline/rnaseq.py#L99

Original

variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
if variantcaller and ("gatk-haplotype" in variantcaller):

New

jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
if jointcaller and 'gatk-haplotype-joint' in jointcaller:

Complete modified function:

def rnaseq_variant_calling(samples, run_parallel):
    """
    run RNA-seq variant calling using GATK
    """
    samples = run_parallel("run_rnaseq_variant_calling", samples)
    #variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    print("test1")
    print(to_single_data(samples[0]))
    jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
    print("test2")
    print(jointcaller)
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for d in joint.square_off(samples, run_parallel):
            out.extend([[to_single_data(xs)] for xs in multi.split_variants_by_sample(to_single_data(d))])
        samples = out
    #if variantcaller:
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        samples = run_parallel("run_rnaseq_ann_filter", samples)
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for data in (to_single_data(xs) for xs in samples):
            if "variants" not in data:
                data["variants"] = []
            data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
                                     "population": {"vcf": data["vrn_file"]}})
            data["vrn_file"] = data.pop("vrn_file_orig")
            out.append([data])
        samples = out
    return samples

Later the joint RNAseq analsys from GVCF crashes though.
Seems that RNAseq filtering requires a bed file with splice junctions, and this causes a downstream lookup to fail.

[2020-10-26T16:36Z] execution_node1: ipython: run_rnaseq_ann_filter
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
Sending a shutdown signal to the controller and engines.
2020-10-26 17:36:46.951 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:46.956 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
2020-10-26 17:36:48.492 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:48.497 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
Traceback (most recent call last):
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 238, in <module>
    main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 53, in run_main
    fc_dir, run_info_yaml)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 264, in rnaseqpipeline
    samples = rnaseq.rnaseq_variant_calling(samples, run_parallel)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 118, in rnaseq_variant_calling
    data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
KeyError: 'vrn_file_orig'

I'll also have a look at fixing this or working around it. Let me know if you have any tips.

After checking that joint analysis from GVCF works, and normal analysis from FASTQ still works, I am happy to submit a pull request with the changes.

Thank you.

Wim

@WimSpee
Copy link
Contributor Author

WimSpee commented Oct 27, 2020

Edits on 2 more places were needed to allow for setting data["vrn_file_orig"] and for running soft filtering.

data["vrn_file_orig"] = data["vrn_file"]

if variantcaller and ("gatk-haplotype" in variantcaller):

The RNAseq analysis from GVCF now works and finishes.

I did not yet try to pass in a splice site file computed from the reference genome gene model, but I think this should work and be reasonable comparable to using the splice sites identified by STAR.

Will send you a pull request later this week with the few lines that I needed to change.

@WimSpee
Copy link
Contributor Author

WimSpee commented Oct 27, 2020

@roryk See this pull request #3360

WimSpee added a commit to WimSpee/bcbio-nextgen that referenced this issue Oct 28, 2020
WimSpee added a commit to WimSpee/bcbio-nextgen that referenced this issue Oct 28, 2020
@WimSpee
Copy link
Contributor Author

WimSpee commented Oct 28, 2020

Created a simplified pull request in #3361

roryk pushed a commit that referenced this issue Jan 2, 2021
roryk pushed a commit that referenced this issue Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant