-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VEP annotation to all variants #31
Open
NagaComBio
wants to merge
3
commits into
ReleaseBranch_1.2.166
Choose a base branch
from
1.2.166-5-vep_ann
base: ReleaseBranch_1.2.166
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#!/bin/bash | ||
|
||
## Annotate the variants with VEP | ||
|
||
## To run | ||
## LOCAL: sh annotate_variant_with_VEP.sh snvs_{pid}_somatic_snvs_conf_8_to_10.vcf snvs_{pid}_somatic_snvs_conf_8_to_10.VEP.vcf | ||
|
||
vep_species="homo_sapiens" | ||
vep_assembly="GRCh37" | ||
vep_out_format="vcf" | ||
|
||
input_vcf=${1} | ||
output_vcf=${2} | ||
threads=${VEP_FORKS} | ||
|
||
## Annotate the high confidence variants | ||
## Parse for the functional consequences | ||
${PERL_BINARY} ${VEP_SW_PATH} \ | ||
--input_file $input_vcf \ | ||
--species $vep_species \ | ||
--assembly $vep_assembly \ | ||
--output_file STDOUT \ | ||
--format $vep_out_format \ | ||
--fork $threads \ | ||
--fasta ${VEP_FA_INDEX} \ | ||
--everything \ | ||
--vcf \ | ||
--cache \ | ||
--offline \ | ||
--force_overwrite \ | ||
--no_stats \ | ||
--dir_cache ${VEP_CACHE_BASE} \ | ||
| ${PYTHON_BINARY} ${TOOL_PARSE_VEP} | ${BGZIP_BINARY} -f > $output_vcf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
147 changes: 147 additions & 0 deletions
147
resources/analysisTools/snvPipeline/parse_VEP_annotations.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
""" | ||
parse_VEP_annotations.py | ||
|
||
This script parses VEP annotations from the input and writes the parsed output to STDOUT. | ||
It contains functions to parse VEP format, gene consequences and HGVSc annotations, and format transcript information. | ||
|
||
Usage: | ||
cat VEP_annotated.vcf | python parse_VEP_annotations.py > VEP_annotated_parsed.vcf | ||
|
||
""" | ||
|
||
import sys | ||
|
||
def parse_vep_annotations(): | ||
""" | ||
Parses VEP annotations from the input and writes the parsed output to STDOUT. | ||
|
||
This function reads input from `sys.stdin` line by line and processes each line. | ||
If a line starts with "#" and contains "##INFO=<ID=CSQ", it extracts the VEP format. | ||
If a line starts with "#CHROM", it appends the header for the VEP gene consequence and HGVSc columns. | ||
For all other lines, it splits the line by tab and extracts the info field. | ||
The info field is then split by semicolon and parsed to extract gene consequence and HGVSc information. | ||
Finally, the transcript information is formatted and written to STDOUT. | ||
|
||
Args: | ||
None | ||
|
||
Returns: | ||
None | ||
""" | ||
vep_format = {} | ||
|
||
for line in sys.stdin: | ||
line = line.strip() | ||
gene_consequence_hgvsc_ds = {} | ||
|
||
if line.startswith("#"): | ||
if line.startswith("##INFO=<ID=CSQ"): | ||
vep_format = parse_vep_format(line) | ||
if line.startswith("#CHROM"): | ||
line = line + "\tVEP_TRANSCRIPTS" | ||
# write to STDOUT | ||
sys.stdout.write("{0}\n".format(line)) | ||
else: | ||
variant_infos = line.split("\t") | ||
info = variant_infos[7] | ||
info = info.split(";") | ||
gene_consequence_hgvsc_ds = parse_gene_consequence_hgvsc(info, vep_format, gene_consequence_hgvsc_ds) | ||
line = format_transcript_info(line, gene_consequence_hgvsc_ds) | ||
|
||
# write to STDOUT | ||
sys.stdout.write(line) | ||
|
||
def parse_vep_format(line): | ||
""" | ||
Parses the VEP format line and returns a dictionary mapping each field to its index. | ||
|
||
Args: | ||
line (str): The VEP format line. | ||
|
||
Returns: | ||
dict: A dictionary mapping each field to its index. | ||
""" | ||
vep_format = {} | ||
vep_format_line = line.split("Format: ")[1] | ||
vep_format_line = vep_format_line.split("|") | ||
for i, field in enumerate(vep_format_line): | ||
vep_format[field] = i | ||
return vep_format | ||
|
||
def parse_gene_consequence_hgvsc(info, vep_format, gene_consequence_hgvsc_ds): | ||
""" | ||
Parses gene consequences and HGVSc annotations from VEP annotations. | ||
|
||
Args: | ||
info (list): List of VEP annotations. | ||
vep_format (dict): Dictionary mapping VEP annotation fields to their indices. | ||
gene_consequence_hgvsc_ds (dict): Dictionary to store gene consequences and HGVSc annotations. | ||
|
||
Returns: | ||
dict: Updated gene_consequence_hgvsc_ds dictionary with gene consequences and HGVSc annotations. | ||
|
||
""" | ||
for i in info: | ||
if i.startswith("CSQ="): | ||
i = i.replace("CSQ=", "") | ||
i = i.split(",") | ||
for j in i: | ||
j = j.split("|") | ||
gene_name = j[vep_format["SYMBOL"]] | ||
consequence = j[vep_format["Consequence"]] | ||
transcript = j[vep_format["Feature"]] | ||
hgvsc = j[vep_format["HGVSc"]].split(":")[1] if j[vep_format["HGVSc"]] != "" else "" | ||
hgvsp = j[vep_format["HGVSp"]].split(":")[1] if j[vep_format["HGVSp"]] != "" else "" | ||
exon = j[vep_format["EXON"]].split("/")[0] if j[vep_format["EXON"]] != "" else "" | ||
intron = j[vep_format["INTRON"]].split("/")[0] if j[vep_format["INTRON"]] != "" else "" | ||
|
||
if exon != "" and intron == "": | ||
gene_part = "exon" + exon | ||
elif intron != "" and exon == "": | ||
gene_part = "intron" + intron | ||
else: | ||
gene_part = "exon" + exon + "-intron" + intron | ||
|
||
biotype = j[vep_format["BIOTYPE"]] | ||
impact = j[vep_format["IMPACT"]] | ||
|
||
# Format transcript information with exon number, HGVSc, and HGVSp | ||
tran_exon_hgvsc_p = "{0}:{1}:{2}:{3}".format(transcript, gene_part, hgvsc, hgvsp) | ||
|
||
if biotype == "protein_coding" and impact in ["HIGH", "MODERATE"]: | ||
if gene_name in gene_consequence_hgvsc_ds: | ||
if consequence in gene_consequence_hgvsc_ds[gene_name]: | ||
gene_consequence_hgvsc_ds[gene_name][consequence].append(tran_exon_hgvsc_p) | ||
else: | ||
gene_consequence_hgvsc_ds[gene_name][consequence] = [tran_exon_hgvsc_p] | ||
else: | ||
gene_consequence_hgvsc_ds[gene_name] = {consequence: [tran_exon_hgvsc_p]} | ||
return gene_consequence_hgvsc_ds | ||
|
||
def format_transcript_info(line, gene_consequence_hgvsc_ds): | ||
""" | ||
Formats the transcript information based on the gene, consequence, and HGVSc data. | ||
|
||
Args: | ||
line (str): The input line to be formatted. | ||
gene_consequence_hgvsc_ds (dict): A dictionary containing gene, consequence, and HGVSc data. | ||
|
||
Returns: | ||
str: The formatted line with transcript information. | ||
|
||
""" | ||
if len(gene_consequence_hgvsc_ds) > 0: | ||
transcript_info = "" | ||
for gene in gene_consequence_hgvsc_ds: | ||
for consequence in gene_consequence_hgvsc_ds[gene]: | ||
transcript_info += "{0}|{1}({2});".format(gene, consequence, ','.join(gene_consequence_hgvsc_ds[gene][consequence])) | ||
line = line + "\t" + transcript_info + "\n" | ||
else: | ||
line = line + "\t.\n" | ||
return line | ||
|
||
|
||
if __name__ == "__main__": | ||
|
||
# Parse VEP annotations and write to STDOUT | ||
parse_vep_annotations() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite an increase -- up to 9 hours longer runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VEP annotation is added in the FILTER step so that the VEP column is kept at the end. However, if this causes issues for OTP, I can also move it into the annotation step and change the column order here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no. For OTP the runtime really does not matter. But for the users who are waiting for the data.
Your comment about the VEP column and the column order I don't understand.