Faulty right-shifting of deletions for protein annotations #498

roland-ewald · 2020-08-16T20:33:26Z

Summary

It seems that AminoAcidChangeNormalizer#normalizeDeletion currently does not correctly right-shift protein sequences, introducing erroneous protein change HGVS annotations. Other kinds of variants (besides deletions) may be affected as well.

Problem

The current code addresses the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0.
This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case:

consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0
in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion
assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream)
since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently does not consider varAASeq at all)

Example (for corner case, see above)

Consider the following deletion (from a unit test in the PR attempting to fix this issue).

Wildtype AA sequence:      H | T | G | E | K | P |fs F| E | C |
                     GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGGAAGGAGA..
                     GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                                                 |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to KPFE>'' (as the first E in the wildtype sequence gets compared to the second E in the wildtype in the above example; afterwards K will be compared to C, which is wrong, so that shift = 1).

This approach leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution: more general approach for right-shifting

In contrast to the current right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. This approach has been implemented and tested in the accompanying PR (#499).

Now, right-shifting also happens in case a deletion does not coincide with a codon start (i.e. insAA is non-empty, as checked by the previous code to avoid normalization). This is necessary because the resulting sequence in the variant AA sequence (varAASeq) could still continue with the same amino acids as the wild type sequence would have, even for deletions that do not coincide with the beginning of a codon. For example:

Wildtype AA sequence:      H | T | G | E | K | P | F | E | C |
                     GGAAACAT|ACT|GGG|GAG|AAA|CCC|TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGGAAGGAGA..
                     GGAAACAT|ACT|GG-|--G|AAA|CCC|TTT|AGT|GTC|CCAAATGT..
                                     | G | K | P |

This deletion would not have been normalized before, although the deletion results in the varant AA sequence continuing with the same AA (G) and thus the variant needs to be right-shifted by one.

Related issues

it seems HGVS protein annotation error for deletions (may impact other mutation types e.g. ins) #481 may be related to this issue
if the above reasoning is correct, the code on which InsertionAnnotationBuilder relies for protein sequence normalization may need to be changed correspondingly (happy to help here).

The text was updated successfully, but these errors were encountered:

Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L').

This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests.

** This is a combination of 4 commits. **This is the 1st commit message:** Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L'). **This is the commit message #2:** **#498: fix normalizeDeletion(...) javadoc** **This is the commit message #3:** **498: refactor+fix shift for synonymous AA changes** This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests. Co-authored-by: Roland Ewald <[email protected]>

**This is a combination of 4 commits.** **This is the 1st commit message:** Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L'). **This is the commit message #2:** **#498: fix normalizeDeletion(...) javadoc** **This is the commit message #3:** **498: refactor+fix shift for synonymous AA changes** This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests. Co-authored-by: Roland Ewald <[email protected]>

roland-ewald mentioned this issue Aug 16, 2020

Fix right-shifting deletions on AA sequence (fixes #498) #499

Closed

3 tasks

roland-ewald added a commit to limbus-medtec/jannovar that referenced this issue Aug 17, 2020

charite#498: fix normalizeDeletion(...) javadoc

b7be688

holtgrewe mentioned this issue Jun 7, 2021

Fix for right-shifting deletions on AA sequences. #512

Merged

holtgrewe closed this as completed in #512 Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faulty right-shifting of deletions for protein annotations #498

Faulty right-shifting of deletions for protein annotations #498

roland-ewald commented Aug 16, 2020 •

edited

Loading

Faulty right-shifting of deletions for protein annotations #498

Faulty right-shifting of deletions for protein annotations #498

Comments

roland-ewald commented Aug 16, 2020 • edited Loading

Summary

Problem

Example (for corner case, see above)

Solution: more general approach for right-shifting

Related issues

roland-ewald commented Aug 16, 2020 •

edited

Loading