Skip to content

Latest commit

 

History

History

oncoliner_assessment

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

ONCOLINER: Assessment module

ONCOLINER logo

The assessment module compares the discovered variants (SNVs, indels and SVs) from the results of the input pipelines against the validated variants. It is provided as a standalone command line tool to allow for the comparison of a series of (VCF/BCF/VCF.GZ) files generated by any variant callers against a series of (VCF/BCF/VCF.GZ) truth files. If provided in the truth files the module will also provide information about the genes affected by the variants. Check the Functional analysis section for more information.

It uses VariantExtractor under-the-hood for extracting SNVs, indels and structural variants (SVs) from VCF files in a deterministic and standard way. Different variant callers may provice slightly different formatted VCF files, that is why VariantExtractor adds a preprocessing layer to homogenize the variants extracted from the file. For more information about the preprocessing process check VariantExtractor's repository.

It is written in Python 3 (requires Python version 3.6 or higher).

Table of contents

Installation

We recommend using the ONCOLINER container or the provided Dockerfile/Singularity recipe for building the whole ONCOLINER suite to avoid dependency issues.

Functional analysis

The module will try to obtain the genes affected by the variants from the INFO field in the truth files. WARNING: ONCOLINER does not compute genes linked to false positives. ONCOLINER's assessment module is compatible with the following functional analysis tools annotations:

  • ONCOLINER.
  • VEP: Variant Effect Predictor from Ensembl.

Usage

The main executable code is in the src/ folder. There are two executable files: assessment_main.py and assessment_bulk.py. The first one is the main executable file and the second one is a wrapper for the first one that allows to execute the assessment in multiple samples at the same time taking advantage of multiple processors.

There is an example of usage in the example/ folder for each executable file: example/example_main.sh and example/example_bulk.sh.

Note: It is recommended to normalize indels and SNVs before executing the assessment. For this purpose, we recommend using pre.py from Illumina's Haplotype Comparison Tools (hap.py). We provide an standalone and containerized EUCANCan's pre.py wrapper for this purpose.

assessment_main.py

Main executable file. It allows to compare a series of (VCF/BCF/VCF.GZ) files generated by any variant callers against a series of (VCF/BCF/VCF.GZ) truth files for only one sample. It is provided as a standalone command line tool. Example of usage:

python3 -O src/assessment_main.py -t truth.vcf -v test.vcf -o output_

Check the example of usage in example/example_main.sh for more information.

Interface

usage: assessment_main.py [-h] -t TRUTHS [TRUTHS ...] -v TESTS [TESTS ...] -o OUTPUT_PREFIX -f FASTA_REF [-it INDEL_THRESHOLD] [-wr WINDOW_RADIUS] [--sv-size-bins SV_SIZE_BINS [SV_SIZE_BINS ...]]
                         [--contigs CONTIGS [CONTIGS ...]] [--keep-intermediates] [--no-gzip]

ONCOLINER Assessment

options:
  -h, --help            show this help message and exit
  -t TRUTHS [TRUTHS ...], --truths TRUTHS [TRUTHS ...]
                        Path to the VCF truth files
  -v TESTS [TESTS ...], --tests TESTS [TESTS ...]
                        Path to the VCF test files
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix path for the output_prefix VCF files
  -f FASTA_REF, --fasta-ref FASTA_REF
                        Path to reference FASTA file
  -it INDEL_THRESHOLD, --indel-threshold INDEL_THRESHOLD
                        Indel threshold, inclusive (default=100)
  -wr WINDOW_RADIUS, --window-radius WINDOW_RADIUS
                        Window ratio (default=100)
  --sv-size-bins SV_SIZE_BINS [SV_SIZE_BINS ...]
                        SV size bins for the output_prefix metrics (default=[500])
  --contigs CONTIGS [CONTIGS ...]
                        Contigs to process (default=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'])
  --keep-intermediates  Keep intermediate CSV/VCF files from input VCF files
  --no-gzip             Do not gzip output_prefix VCF files

Output

assessment_main.py outputs a series of files:

  • {OUTPUT_PREFIX}tp.[snv|indel|sv].vcf.gz: VCF files with the true positives (TP) variants. One file per variant type (SNV, indel and SV).
  • {OUTPUT_PREFIX}fp.[snv|indel|sv].vcf.gz: VCF files with the false positives (FP) variants. One file per variant type (SNV, indel and SV).
  • {OUTPUT_PREFIX}fn.[snv|indel|sv].vcf.gz: VCF files with the false negatives (FN) variants. One file per variant type (SNV, indel and SV).
  • {OUTPUT_PREFIX}metrics.csv: CSV file containing the metrics for the comparison of the test and truth VCF files. It contains the following columns:
    • variant_type: variant type, as outputted by VariantExtractor.
    • variant_size: range of variant sizes analyzed for that particular row.
    • window_radius: window radius used for the assessment.
    • recall: Recall. TP / (TP + FN).
    • precision: Precision. TP / (TP + FP).
    • f1_score: F1 score. 2 * (precision * recall) / (precision + recall).
    • tp: Number of true positives.
    • fp: Number of false positives.
    • fn: Number of false negatives.
    • protein_affected_genes_count: Number of genes affected by the variants.
    • protein_affected_driver_genes_count: Number of cancer driver genes affected by the variants.
    • protein_affected_genes: List of genes affected by the variants (separated by ;).
    • protein_affected_driver_genes: List of cancer driver genes affected by the variants (separated by ;).

assessment_bulk.py

Wrapper for assessment_main.py. It allows to compare a series of (VCF/BCF/VCF.GZ) files generated by any variant callers against a series of (VCF/BCF/VCF.GZ) truth files for multiple samples. It takes advantage of multiple processors and is also able to recover from a previous execution (if the execution was interrupted). It is provided as a standalone command line tool. Example of usage:

python3 -O src/assessment_main.py -c config.tsv -o output_

Check the example of usage in example/example_bulk.sh for more information.

Configuration file

The configuration file is a TSV file with the following columns:

  • sample_name: sample name.
  • sample_types: sample types (recall or precision), separated by ,.
  • reference_fasta_path: path to the reference FASTA file.
  • truth_vcf_paths: path(s) to the truth VCF files, separated by ,. They can also be wildcard paths (e.g. truths/*.vcf.gz).
  • example_vcf_paths: path(s) to the test VCF files, separated by ,. They can also be wildcard paths (e.g. tests/*.vcf.gz).
  • bed_mask_paths (optional): path(s) to BED files, separated by ,, describing regions where no False Positive will be computed (they will be skipped). They can also be wildcard paths (e.g. truths/*.bed).

Interface

usage: assessment_bulk.py [-h] -c CONFIG_FILE -o OUTPUT_FOLDER [-it INDEL_THRESHOLD] [-wr WINDOW_RADIUS] [--sv-size-bins SV_SIZE_BINS [SV_SIZE_BINS ...]] [--contigs CONTIGS [CONTIGS ...]] [--keep-intermediates]
                         [--no-gzip] [-p MAX_PROCESSES]

ONCOLINER Assessment Bulk

options:
  -h, --help            show this help message and exit
  -c CONFIG_FILE, --config-file CONFIG_FILE
                        Path to the config TSV file
  -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
                        Path to the output folder
  -it INDEL_THRESHOLD, --indel-threshold INDEL_THRESHOLD
                        Indel threshold, inclusive (default=100)
  -wr WINDOW_RADIUS, --window-radius WINDOW_RADIUS
                        Window radius (default=100)
  --sv-size-bins SV_SIZE_BINS [SV_SIZE_BINS ...]
                        SV size bins for the output_prefix metrics (default=[500])
  --contigs CONTIGS [CONTIGS ...]
                        Contigs to process (default=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'])
  --keep-intermediates  Keep intermediate CSV/VCF files from input VCF files
  --no-gzip             Do not gzip output_prefix VCF files
  -p MAX_PROCESSES, --max-processes MAX_PROCESSES
                        Maximum number of processes to use (defaults to 1)

Output

assessment_bulk.py outputs the same files as assessment_main.py for each sample. The output files for each sample are stored in the OUTPUT_FOLDER/samples folder in a subfolder named after the sample name.

assessment_bulk.py also outputs a aggregated_metrics.csv file, which aggregates the metrics for all the samples. It contains the same columns as assessment_main.py's metrics.csv file. Recall related metrics are calculated using the recall samples and precision related metrics are calculated using the precision samples (as described in the configuration file).