feat: Include all information in VariantRecord

EUCANCan · Apr 20, 2022 · 894763e · 894763e
1 parent b2cab22
commit 894763e
Show file tree

Hide file tree

Showing 7 changed files with 163 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # VCF variant extractor<!-- omit in toc -->
-**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. `variant_extractor` is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs), [compound indels](#compound-indels) and [SVs](#svs) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.
+**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. `variant_extractor` is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs), [compound indels](#compound-indels) and [SVs](#structural-variants) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.
 
 While there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, the `variant_extractor` **adds a preprocessing layer to homogenize the variants** extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the [homogenization rules](#homogenization-rules) section.
 
@@ -36,26 +36,28 @@ for variant_record in variants:
     print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')
 ```
 
-For a more complete list of examples, check the [examples](./examples/) directory.
+For a more complete list of examples, check the [examples](./examples/) directory. This folder also includes an example of a [script for normalizing VCF files](examples/normalize_vcf.py) following the [homogenization rules](#homogenization-rules).
 
 ## VariantRecord
 The `VariantExtractor.read_vcf()` method returns a list of `VariantRecord`. The `VariantRecord` class is a container for the information contained in a VCF record.
 
-| Property           | Type                          | Description                                                                                  |
-| ------------------ | ----------------------------- | -------------------------------------------------------------------------------------------- |
-| `contig`           | `str`                         | Contig name                                                                                  |
-| `pos`              | `int`                         | Position on the contig                                                                       |
-| `end`              | `int`                         | End position of the variant in the contig (same as `pos` for TRN and SNV)                    |
-| `length`           | `int`                         | Length of the variant                                                                        |
-| `id`               | `str`                         | Record identifier                                                                            |
-| `ref`              | `str`                         | Reference sequence                                                                           |
-| `alt`              | `str`                         | Alternative sequence                                                                         |
-| `qual`             | `Optional[float]`             | Quality score for the assertion made in ALT                                                  |
-| `filter`           | `pysam.VariantRecordFilter`   | Filter                                                                                       |
-| `info`             | `pysam.VariantRecordInfo`     | Dictionary of information fields                                                             |
-| `variant_type`     | [`VariantType`](#varianttype) | Variant type inferred                                                                        |
-| `alt_sv_bracket`   | `Optional[BracketSVRecord]`   | Bracketed SV info, present only for SVs with bracket notation. For example, `G]17:198982]`   |
-| `alt_sv_shorthand` | `Optional[ShorthandSVRecord]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>` |
+| Property           | Type                                                    | Description                                                                                                   |
+| ------------------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
+| `contig`           | `str`                                                   | Contig name                                                                                                   |
+| `pos`              | `int`                                                   | Position on the contig                                                                                        |
+| `end`              | `int`                                                   | End position of the variant in the contig (same as `pos` for TRN and SNV)                                     |
+| `length`           | `int`                                                   | Length of the variant                                                                                         |
+| `id`               | `Optional[str]`                                         | Record identifier                                                                                             |
+| `ref`              | `str`                                                   | Reference sequence                                                                                            |
+| `alt`              | `str`                                                   | Alternative sequence                                                                                          |
+| `qual`             | `Optional[float]`                                       | Quality score for the assertion made in ALT                                                                   |
+| `filter`           | `List[str]`                                             | Filter status. `PASS` if this position has passed all filters. Otherwise, it contains the filters that failed |
+| `info`             | `Dict[str, Any]`                                        | Additional information                                                                                        |
+| `format`           | `List[str]`                                             | Specifies data types and order of the genotype information                                                    |
+| `samples`          | ` Dict[str, Dict[str, Any]]`                            | Genotype information for each sample                                                                          |
+| `variant_type`     | [`VariantType`](#varianttype)                           | Variant type inferred                                                                                         |
+| `alt_sv_bracket`   | `Optional[`[`BracketSVRecord`](#bracketsvrecord)`]`     | Bracketed SV info, present only for SVs with bracket notation. For example, `G]17:198982]`                    |
+| `alt_sv_shorthand` | `Optional[`[`ShorthandSVRecord`](#shorthandsvrecord)`]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>`                  |
 
 ### VariantType
 The `VariantType` enum is a container for the information about the type of the variant. For structural variants is inferred **only** from the bracket notation, it does not take into account any `INFO` (fields `SVTYPE` or `EVENTYPE`).

diff --git a/examples/input_test_data/test_paired.vcf b/examples/input_test_data/test_paired.vcf
@@ -32,7 +32,7 @@
 ##INFO=<ID=CLUSTER,Number=.,Type=String,Description="IDs of other clustered breakpoints">
 ##INFO=<ID=EXTRA,Number=.,Type=String,Description="Extra info for testing">
 ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
-#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DESC
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
 1	100	truthset_snv_1	G	A	.	PASS	EXTRA=DUMMY_SNV
 1	1000	truthset_multiallelic_1	G	C,T	.	PASS	EXTRA=SNV_MULTIALLELIC
 1	1500	truthset_mnp_1	GAA	ACA	.	PASS	EXTRA=DUMMY_MULTI_SNV

diff --git a/examples/normalize_vcf.py b/examples/normalize_vcf.py
@@ -0,0 +1,88 @@
+# Copyright 2022 - Barcelona Supercomputing Center
+# Author: Rodrigo Martín Posada
+# BSC AS IS License
+'''
+Generates a normalized VCF file from a VCF
+Expected usage:
+    $ python normalize.py <vcf_file> <output_vcf_file>
+Use --help for more information.
+'''
+from argparse import ArgumentParser
+
+
+def _extract_header(vcf_file):
+    header = ''
+    with open(vcf_file, 'r') as f:
+        for line in f:
+            if line.startswith('#'):
+                header += line
+            else:
+                break
+    return header
+
+
+def _convert_info_key_value(key, value):
+    if value is None:
+        return key
+    elif isinstance(value, str):
+        return f'{key}={value}'
+    elif hasattr(value, '__len__'):
+        return key+'=' + ','.join([str(v) for v in value])
+    else:
+        return key+'='+str(value)
+
+
+def _convert_sample_value(key, value):
+    if key == 'GT':
+        return '/'.join([str(v) if v is not None else '.' for v in value])
+    elif value is None:
+        return '.'
+    elif isinstance(value, str):
+        return value
+    elif hasattr(value, '__len__'):
+        return ','.join([str(v) for v in value])
+    else:
+        return str(value)
+
+
+def _convert_to_vcf(variant_record):
+    contig = variant_record.contig
+    pos = variant_record.pos
+    id_ = variant_record.id if variant_record.id else '.'
+    ref = variant_record.ref
+    alt = variant_record.alt
+    qual = variant_record.qual if variant_record.qual else '.'
+    filter_ = ";".join(variant_record.filter) if variant_record.filter else '.'
+    info = ";".join([_convert_info_key_value(k, v) for k, v in variant_record.info.items()])
+    format_ = ":".join(variant_record.format)
+    samples_list = [":".join([_convert_sample_value(k, v) for k, v in variant_record.samples[sample_name].items()])
+                    for sample_name in variant_record.samples]
+    samples = "\t".join(samples_list)
+    return f'{contig}\t{pos}\t{id_}\t{ref}\t{alt}\t{qual}\t{filter_}\t{info}\t{format_}\t{samples}\n'
+
+
+if __name__ == '__main__':
+    import os
+    import sys
+    sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)) + '/../src/')
+    from variant_extractor import VariantExtractor
+    from variant_extractor.variants import VariantType
+
+    # Parse arguments
+    parser = ArgumentParser(description='Generate normalized VCF file from a VCF file')
+    parser.add_argument('vcf_file', help='VCF file')
+    parser.add_argument('output_vcf_file', help='Output VCF file')
+    args = parser.parse_args()
+
+    if not args.vcf_file.endswith('.vcf'):
+        raise ValueError('Input file must be a VCF file')
+
+    # Open output file, write as stream
+    with open(args.output_vcf_file, 'w') as output_vcf:
+        # Write header
+        output_vcf.write(_extract_header(args.vcf_file))
+        print(f'Reading {args.vcf_file}...')
+        # Open input file, read with variant_extractor
+        extractor = VariantExtractor()
+        for variant_record in extractor.read_vcf(args.vcf_file):
+            output_vcf.write(_convert_to_vcf(variant_record))
diff --git a/examples/vcf_to_csv.py b/examples/vcf_to_csv.py
@@ -2,7 +2,7 @@
 # Author: Rodrigo Martín Posada
 # BSC AS IS License
 '''
-Generates a .CSV input from a VCF file
+Generates a CSV file from an input VCF file
 Expected usage:
     $ python vcf_to_csv.py <vcf_file> <output_file>
 Use --help for more information.

diff --git a/src/variant_extractor/VariantExtractor.py b/src/variant_extractor/VariantExtractor.py
@@ -138,9 +138,15 @@ def __handle_bracket_sv(self, vcf_record):
             return
         # Mate breakend found, handle it
         self.__pairs_found += 1
-        if compare_contigs(previous_record.contig, vcf_record.contig) == -1:
+        contig_comparison = compare_contigs(previous_record.contig, vcf_record.contig)
+        if contig_comparison == 0:
+            if previous_record.pos < vcf_record.pos:
+                self.__handle_bracket_individual_sv(previous_record)
+            else:
+                self.__handle_bracket_individual_sv(vcf_record)
+        elif contig_comparison == -1:
             self.__handle_bracket_individual_sv(previous_record)
-        else:
+        elif contig_comparison == 1:
             self.__handle_bracket_individual_sv(vcf_record)
 
     def __handle_bracket_individual_sv(self, vcf_record):
@@ -191,7 +197,8 @@ def __handle_imprecise_sv(self):
             self.__pending_breakends.remove(vcf_record)
 
     def __handle_multiallelic_record(self, rec):
-        for alt in rec.alts:
+        alts = rec.alts
+        for alt in alts:
             # WARNING: This overrides the record
             rec.alts = [alt]
             self.__handle_record(rec)
diff --git a/src/variant_extractor/private/_parser.py b/src/variant_extractor/private/_parser.py
@@ -13,6 +13,31 @@
 STANDARD_RECORD_REGEX = re.compile(r'([.A-Za-z]+)')
 
 
+def _build_filter(rec):
+    return [f for f in rec.filter]
+
+
+def _build_info(rec):
+    info = dict()
+    for key, value in rec.info.items():
+        info[key] = value
+    return info
+
+
+def _build_format(rec):
+    return [f for f in rec.format]
+
+
+def _build_samples(rec):
+    samples = dict()
+    for sample_name in rec.samples:
+        sample_dict = dict()
+        for key, value in rec.samples[sample_name].items():
+            sample_dict[key] = value
+        samples[sample_name] = sample_dict
+    return samples
+
+
 def parse_bracket_sv(rec):
     sv_match_bracket = BRACKET_SV_REGEX.fullmatch(rec.alts[0])
     if not sv_match_bracket:
@@ -52,7 +77,8 @@ def parse_bracket_sv(rec):
 
     # Create new record
     vcf_record = VariantRecord(rec.contig, rec.pos, end_pos, length, rec.id, rec.ref, rec.alts[0],
-                               rec.qual, rec.filter, rec.info, variant_type, alt_sv_bracket, None)
+                               rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
+                               _build_samples(rec), variant_type, alt_sv_bracket, None)
     return vcf_record
 
 
@@ -86,11 +112,12 @@ def parse_shorthand_sv(rec):
     elif alt_type == 'CNV':
         variant_type = VariantType.CNV
     else:
-         raise ValueError(f'Unknown variant type: {alt_type}. Skipping:\n{rec}')
+        raise ValueError(f'Unknown variant type: {alt_type}. Skipping:\n{rec}')
 
     # Create new record
     vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
-                               rec.qual, rec.filter, rec.info, variant_type, None, alt_sv_shorthand)
+                               rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
+                               _build_samples(rec), variant_type, None, alt_sv_shorthand)
     return vcf_record
 
 
@@ -102,7 +129,8 @@ def parse_sgl_sv(rec):
     length = 0
     # Create new record
     vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
-                               rec.qual, rec.filter, rec.info, variant_type, None, None)
+                               rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
+                               _build_samples(rec), variant_type, None, None)
     return vcf_record
 
 
@@ -121,5 +149,6 @@ def parse_standard_record(rec):
         variant_type = VariantType.DEL
     # Create new record
     vcf_record = VariantRecord(rec.contig, rec.pos, rec.stop, length, rec.id, rec.ref, rec.alts[0],
-                               rec.qual, rec.filter, rec.info, variant_type, None, None)
+                               rec.qual, _build_filter(rec), _build_info(rec), _build_format(rec),
+                               _build_samples(rec), variant_type, None, None)
     return vcf_record
diff --git a/src/variant_extractor/variants.py b/src/variant_extractor/variants.py
@@ -1,9 +1,8 @@
 # Copyright 2022 - Barcelona Supercomputing Center
 # Author: Rodrigo Martín Posada
 # BSC AS IS License
-from typing import NamedTuple, Optional, List
+from typing import NamedTuple, Optional, List, Dict, Any
 from enum import Enum, auto
-import pysam
 
 
 class VariantType(Enum):
@@ -54,18 +53,22 @@ class VariantRecord(NamedTuple):
     """End position of the variant in the contig (same as `pos` for TRN and SNV)"""
     length: int
     """Length of the variant"""
-    id: str
+    id: Optional[str]
     """Record identifier"""
     ref: str
     """Reference sequence"""
     alt: str
     """Alternative sequence"""
     qual: Optional[float]
     """Quality score for the assertion made in ALT"""
-    filter: pysam.VariantRecordFilter
-    """Record filter"""
-    info: pysam.VariantRecordInfo
-    """Dictionary of information fields"""
+    filter: List[str]
+    """Filter status. PASS if this position has passed all filters. Otherwise, it contains the filters that failed"""
+    info: Dict[str, Any]
+    """Additional information"""
+    format: List[str]
+    """Specifies data types and order of the genotype information"""
+    samples: Dict[str, Dict[str, Any]]
+    """Genotype information for each sample"""
     variant_type: VariantType
     """Variant type"""
     alt_sv_bracket: Optional[BracketSVRecord]