Skip to content

Commit

Permalink
feat: Include all information in VariantRecord
Browse files Browse the repository at this point in the history
  • Loading branch information
Rapsssito committed Apr 20, 2022
1 parent b2cab22 commit 894763e
Show file tree
Hide file tree
Showing 7 changed files with 163 additions and 34 deletions.
36 changes: 19 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# VCF variant extractor<!-- omit in toc -->
**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. `variant_extractor` is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs), [compound indels](#compound-indels) and [SVs](#svs) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.
**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. `variant_extractor` is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs), [compound indels](#compound-indels) and [SVs](#structural-variants) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.

While there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, the `variant_extractor` **adds a preprocessing layer to homogenize the variants** extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the [homogenization rules](#homogenization-rules) section.

Expand Down Expand Up @@ -36,26 +36,28 @@ for variant_record in variants:
print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')
```

For a more complete list of examples, check the [examples](./examples/) directory.
For a more complete list of examples, check the [examples](./examples/) directory. This folder also includes an example of a [script for normalizing VCF files](examples/normalize_vcf.py) following the [homogenization rules](#homogenization-rules).

## VariantRecord
The `VariantExtractor.read_vcf()` method returns a list of `VariantRecord`. The `VariantRecord` class is a container for the information contained in a VCF record.

| Property | Type | Description |
| ------------------ | ----------------------------- | -------------------------------------------------------------------------------------------- |
| `contig` | `str` | Contig name |
| `pos` | `int` | Position on the contig |
| `end` | `int` | End position of the variant in the contig (same as `pos` for TRN and SNV) |
| `length` | `int` | Length of the variant |
| `id` | `str` | Record identifier |
| `ref` | `str` | Reference sequence |
| `alt` | `str` | Alternative sequence |
| `qual` | `Optional[float]` | Quality score for the assertion made in ALT |
| `filter` | `pysam.VariantRecordFilter` | Filter |
| `info` | `pysam.VariantRecordInfo` | Dictionary of information fields |
| `variant_type` | [`VariantType`](#varianttype) | Variant type inferred |
| `alt_sv_bracket` | `Optional[BracketSVRecord]` | Bracketed SV info, present only for SVs with bracket notation. For example, `G]17:198982]` |
| `alt_sv_shorthand` | `Optional[ShorthandSVRecord]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>` |
| Property | Type | Description |
| ------------------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| `contig` | `str` | Contig name |
| `pos` | `int` | Position on the contig |
| `end` | `int` | End position of the variant in the contig (same as `pos` for TRN and SNV) |
| `length` | `int` | Length of the variant |
| `id` | `Optional[str]` | Record identifier |
| `ref` | `str` | Reference sequence |
| `alt` | `str` | Alternative sequence |
| `qual` | `Optional[float]` | Quality score for the assertion made in ALT |
| `filter` | `List[str]` | Filter status. `PASS` if this position has passed all filters. Otherwise, it contains the filters that failed |
| `info` | `Dict[str, Any]` | Additional information |
| `format` | `List[str]` | Specifies data types and order of the genotype information |
| `samples` | ` Dict[str, Dict[str, Any]]` | Genotype information for each sample |
| `variant_type` | [`VariantType`](#varianttype) | Variant type inferred |
| `alt_sv_bracket` | `Optional[`[`BracketSVRecord`](#bracketsvrecord)`]` | Bracketed SV info, present only for SVs with bracket notation. For example, `G]17:198982]` |
| `alt_sv_shorthand` | `Optional[`[`ShorthandSVRecord`](#shorthandsvrecord)`]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>` |

### VariantType
The `VariantType` enum is a container for the information about the type of the variant. For structural variants is inferred **only** from the bracket notation, it does not take into account any `INFO` (fields `SVTYPE` or `EVENTYPE`).
Expand Down
2 changes: 1 addition & 1 deletion examples/input_test_data/test_paired.vcf
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
##INFO=<ID=CLUSTER,Number=.,Type=String,Description="IDs of other clustered breakpoints">
##INFO=<ID=EXTRA,Number=.,Type=String,Description="Extra info for testing">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DESC
#CHROM POS ID REF ALT QUAL FILTER INFO
1 100 truthset_snv_1 G A . PASS EXTRA=DUMMY_SNV
1 1000 truthset_multiallelic_1 G C,T . PASS EXTRA=SNV_MULTIALLELIC
1 1500 truthset_mnp_1 GAA ACA . PASS EXTRA=DUMMY_MULTI_SNV
Expand Down
88 changes: 88 additions & 0 deletions examples/normalize_vcf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Copyright 2022 - Barcelona Supercomputing Center
# Author: Rodrigo Martín Posada
# BSC AS IS License
'''
Generates a normalized VCF file from a VCF
Expected usage:
$ python normalize.py <vcf_file> <output_vcf_file>
Use --help for more information.
'''
from argparse import ArgumentParser


def _extract_header(vcf_file):
header = ''
with open(vcf_file, 'r') as f:
for line in f:
if line.startswith('#'):
header += line
else:
break
return header


def _convert_info_key_value(key, value):
if value is None:
return key
elif isinstance(value, str):
return f'{key}={value}'
elif hasattr(value, '__len__'):
return key+'=' + ','.join([str(v) for v in value])
else:
return key+'='+str(value)


def _convert_sample_value(key, value):
if key == 'GT':
return '/'.join([str(v) if v is not None else '.' for v in value])
elif value is None:
return '.'
elif isinstance(value, str):
return value
elif hasattr(value, '__len__'):
return ','.join([str(v) for v in value])
else:
return str(value)


def _convert_to_vcf(variant_record):
contig = variant_record.contig
pos = variant_record.pos
id_ = variant_record.id if variant_record.id else '.'
ref = variant_record.ref
alt = variant_record.alt
qual = variant_record.qual if variant_record.qual else '.'
filter_ = ";".join(variant_record.filter) if variant_record.filter else '.'
info = ";".join([_convert_info_key_value(k, v) for k, v in variant_record.info.items()])
format_ = ":".join(variant_record.format)
samples_list = [":".join([_convert_sample_value(k, v) for k, v in variant_record.samples[sample_name].items()])
for sample_name in variant_record.samples]
samples = "\t".join(samples_list)
return f'{contig}\t{pos}\t{id_}\t{ref}\t{alt}\t{qual}\t{filter_}\t{info}\t{format_}\t{samples}\n'


if __name__ == '__main__':
import os
import sys
sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)) + '/../src/')
from variant_extractor import VariantExtractor
from variant_extractor.variants import VariantType

# Parse arguments
parser = ArgumentParser(description='Generate normalized VCF file from a VCF file')
parser.add_argument('vcf_file', help='VCF file')
parser.add_argument('output_vcf_file', help='Output VCF file')
args = parser.parse_args()

if not args.vcf_file.endswith('.vcf'):
raise ValueError('Input file must be a VCF file')

# Open output file, write as stream
with open(args.output_vcf_file, 'w') as output_vcf:
# Write header
output_vcf.write(_extract_header(args.vcf_file))
print(f'Reading {args.vcf_file}...')
# Open input file, read with variant_extractor
extractor = VariantExtractor()
for variant_record in extractor.read_vcf(args.vcf_file):
output_vcf.write(_convert_to_vcf(variant_record))
2 changes: 1 addition & 1 deletion examples/vcf_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Author: Rodrigo Martín Posada
# BSC AS IS License
'''
Generates a .CSV input from a VCF file
Generates a CSV file from an input VCF file
Expected usage:
$ python vcf_to_csv.py <vcf_file> <output_file>
Use --help for more information.
Expand Down
13 changes: 10 additions & 3 deletions src/variant_extractor/VariantExtractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,9 +138,15 @@ def __handle_bracket_sv(self, vcf_record):
return
# Mate breakend found, handle it
self.__pairs_found += 1
if compare_contigs(previous_record.contig, vcf_record.contig) == -1:
contig_comparison = compare_contigs(previous_record.contig, vcf_record.contig)
if contig_comparison == 0:
if previous_record.pos < vcf_record.pos:
self.__handle_bracket_individual_sv(previous_record)
else:
self.__handle_bracket_individual_sv(vcf_record)
elif contig_comparison == -1:
self.__handle_bracket_individual_sv(previous_record)
else:
elif contig_comparison == 1:
self.__handle_bracket_individual_sv(vcf_record)

def __handle_bracket_individual_sv(self, vcf_record):
Expand Down Expand Up @@ -191,7 +197,8 @@ def __handle_imprecise_sv(self):
self.__pending_breakends.remove(vcf_record)

def __handle_multiallelic_record(self, rec):
for alt in rec.alts:
alts = rec.alts
for alt in alts:
# WARNING: This overrides the record
rec.alts = [alt]
self.__handle_record(rec)
39 changes: 34 additions & 5 deletions src/variant_extractor/private/_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,31 @@
STANDARD_RECORD_REGEX = re.compile(r'([.A-Za-z]+)')


def _build_filter(rec):
return [f for f in rec.filter]


def _build_info(rec):
info = dict()
for key, value in rec.info.items():
info[key] = value
return info


def _build_format(rec):
return [f for f in rec.format]


def _build_samples(rec):
samples = dict()
for sample_name in rec.samples:
sample_dict = dict()
for key, value in rec.samples[sample_name].items():
sample_dict[key] = value
samples[sample_name] = sample_dict
return samples


def parse_bracket_sv(rec):
sv_match_bracket = BRACKET_SV_REGEX.fullmatch(rec.alts[0])
if not sv_match_bracket:
Expand Down Expand Up @@ -52,7 +77,8 @@ def parse_bracket_sv(rec):

# Create new record
vcf_record = VariantRecord(rec.contig, rec.pos, end_pos, length, rec.id, rec.ref, rec.alts[0],
rec.qual, rec.filter, rec.info, variant_type, alt_sv_bracket, None)
rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
_build_samples(rec), variant_type, alt_sv_bracket, None)
return vcf_record


Expand Down Expand Up @@ -86,11 +112,12 @@ def parse_shorthand_sv(rec):
elif alt_type == 'CNV':
variant_type = VariantType.CNV
else:
raise ValueError(f'Unknown variant type: {alt_type}. Skipping:\n{rec}')
raise ValueError(f'Unknown variant type: {alt_type}. Skipping:\n{rec}')

# Create new record
vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
rec.qual, rec.filter, rec.info, variant_type, None, alt_sv_shorthand)
rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
_build_samples(rec), variant_type, None, alt_sv_shorthand)
return vcf_record


Expand All @@ -102,7 +129,8 @@ def parse_sgl_sv(rec):
length = 0
# Create new record
vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
rec.qual, rec.filter, rec.info, variant_type, None, None)
rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
_build_samples(rec), variant_type, None, None)
return vcf_record


Expand All @@ -121,5 +149,6 @@ def parse_standard_record(rec):
variant_type = VariantType.DEL
# Create new record
vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
rec.qual, rec.filter, rec.info, variant_type, None, None)
rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
_build_samples(rec), variant_type, None, None)
return vcf_record
17 changes: 10 additions & 7 deletions src/variant_extractor/variants.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# Copyright 2022 - Barcelona Supercomputing Center
# Author: Rodrigo Martín Posada
# BSC AS IS License
from typing import NamedTuple, Optional, List
from typing import NamedTuple, Optional, List, Dict, Any
from enum import Enum, auto
import pysam


class VariantType(Enum):
Expand Down Expand Up @@ -54,18 +53,22 @@ class VariantRecord(NamedTuple):
"""End position of the variant in the contig (same as `pos` for TRN and SNV)"""
length: int
"""Length of the variant"""
id: str
id: Optional[str]
"""Record identifier"""
ref: str
"""Reference sequence"""
alt: str
"""Alternative sequence"""
qual: Optional[float]
"""Quality score for the assertion made in ALT"""
filter: pysam.VariantRecordFilter
"""Record filter"""
info: pysam.VariantRecordInfo
"""Dictionary of information fields"""
filter: List[str]
"""Filter status. PASS if this position has passed all filters. Otherwise, it contains the filters that failed"""
info: Dict[str, Any]
"""Additional information"""
format: List[str]
"""Specifies data types and order of the genotype information"""
samples: Dict[str, Dict[str, Any]]
"""Genotype information for each sample"""
variant_type: VariantType
"""Variant type"""
alt_sv_bracket: Optional[BracketSVRecord]
Expand Down

0 comments on commit 894763e

Please sign in to comment.