Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty right-shifting of deletions for protein annotations #498

Closed
roland-ewald opened this issue Aug 16, 2020 · 0 comments · Fixed by #512
Closed

Faulty right-shifting of deletions for protein annotations #498

roland-ewald opened this issue Aug 16, 2020 · 0 comments · Fixed by #512

Comments

@roland-ewald
Copy link
Contributor

roland-ewald commented Aug 16, 2020

Summary

It seems that AminoAcidChangeNormalizer#normalizeDeletion currently does not correctly right-shift protein sequences, introducing erroneous protein change HGVS annotations. Other kinds of variants (besides deletions) may be affected as well.

Problem

The current code addresses the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0.
This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case:

  • consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0
    in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion
  • assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream)
  • since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently does not consider varAASeq at all)

Example (for corner case, see above)

Consider the following deletion (from a unit test in the PR attempting to fix this issue).

Wildtype AA sequence:      H | T | G | E | K | P |fs F| E | C |
                     GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGGAAGGAGA..
                     GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                                                 |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to KPFE>'' (as the first E in the wildtype sequence gets compared to the second E in the wildtype in the above example; afterwards K will be compared to C, which is wrong, so that shift = 1).

This approach leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution: more general approach for right-shifting

In contrast to the current right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. This approach has been implemented and tested in the accompanying PR (#499).

Now, right-shifting also happens in case a deletion does not coincide with a codon start (i.e. insAA is non-empty, as checked by the previous code to avoid normalization). This is necessary because the resulting sequence in the variant AA sequence (varAASeq) could still continue with the same amino acids as the wild type sequence would have, even for deletions that do not coincide with the beginning of a codon. For example:

Wildtype AA sequence:      H | T | G | E | K | P | F | E | C |
                     GGAAACAT|ACT|GGG|GAG|AAA|CCC|TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGGAAGGAGA..
                     GGAAACAT|ACT|GG-|--G|AAA|CCC|TTT|AGT|GTC|CCAAATGT..
                                     | G | K | P |

This deletion would not have been normalized before, although the deletion results in the varant AA sequence continuing with the same AA (G) and thus the variant needs to be right-shifted by one.

Related issues

roland-ewald added a commit to limbus-medtec/jannovar that referenced this issue Aug 16, 2020
Problem:

The previous code addressed the problems with deletions introducing
frameshifts by avoiding right-shifting in those cases altogether,
by checking change.getAlt().length() != 0 in
AminoAcidChangeNormalizer#normalizeDeletion.
This may be wrong if the frame-shifted sequence introduces the same
amino acid(s) that would have followed in the wild type as well.

It also does not address the following corner case:

- consider a deletion coinciding with the start of a codon, so that
changeBeginPos.getFrameshift() == 0 in the calling code
(DeletionAnnotationBuilder), which means that insAA will be empty and
thus normalization will be attempted by
AminoAcidChangeNormalizer#normalizeDeletion

- assume the length of the deletion is not divisible by three, so it
will itself introduce a frameshift, so that varAASeq (the AA sequence
containing the variant) will differ significantly from from wtAASeq
(the wildtype AA sequence) at the position of the deletion
(and downstream)

- since AminoAcidChangeNormalizer#normalizeDeletion only operates on the
wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the
sequence of deleted amino acids starts with the same amino acid(s) as
those after the deletion in the wild type, the variant will be
erroneously right-shifted on that wild-type sequence
(the right-shifting currently did not consider varAASeq at all)

Example:

      H | T | G | E | K | P |fs F| E | C |
GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG...
GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                            |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called,
AminoAcidChange will be 'EKPF'>''. However, due to the above problem
this will be right-shifted to 'KPFE'>'', which leads to a wrong position
and a wrong AA displayed in the variant's protein change HGVS
annotation.

In this example, the protein change should be p.(Glu316Leufs*25)
(in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to
p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution:

In contrast to the right-shifting code for nucleotides, it seems a
better approach here would be to simply compare both wild-type and
variant AA sequence, as they already have been computed by the
AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into
AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype
AA sequence. This also allows us to apply right-shifting to deletions
that do not coincide with the beginning of a codon.

Tests:

This adds both an 'integration' test with the above example to
DeletionAnnotationBuilderTest and some unit tests to
AminoAcidChangeTest. The latter has also been adjusted to make it clear
that the tested operations should work on AA sequences, not nucleotide
sequences (globally replaced 'A' by 'L').
roland-ewald added a commit to limbus-medtec/jannovar that referenced this issue Aug 17, 2020
roland-ewald added a commit to limbus-medtec/jannovar that referenced this issue Jun 6, 2021
This introduces the changes from 2190296 to
BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder
as well.

Additionally, this removes now-obsolete code and moves the call to
'trim' the AA change before shifting to AminoAcidChangeNormalizer
itself, and also adds and fixes tests.
holtgrewe pushed a commit that referenced this issue Jun 7, 2021
** This is a combination of 4 commits.

**This is the 1st commit message:**

Problem:

The previous code addressed the problems with deletions introducing
frameshifts by avoiding right-shifting in those cases altogether,
by checking change.getAlt().length() != 0 in
AminoAcidChangeNormalizer#normalizeDeletion.
This may be wrong if the frame-shifted sequence introduces the same
amino acid(s) that would have followed in the wild type as well.

It also does not address the following corner case:

- consider a deletion coinciding with the start of a codon, so that
changeBeginPos.getFrameshift() == 0 in the calling code
(DeletionAnnotationBuilder), which means that insAA will be empty and
thus normalization will be attempted by
AminoAcidChangeNormalizer#normalizeDeletion

- assume the length of the deletion is not divisible by three, so it
will itself introduce a frameshift, so that varAASeq (the AA sequence
containing the variant) will differ significantly from from wtAASeq
(the wildtype AA sequence) at the position of the deletion
(and downstream)

- since AminoAcidChangeNormalizer#normalizeDeletion only operates on the
wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the
sequence of deleted amino acids starts with the same amino acid(s) as
those after the deletion in the wild type, the variant will be
erroneously right-shifted on that wild-type sequence
(the right-shifting currently did not consider varAASeq at all)

Example:

      H | T | G | E | K | P |fs F| E | C |
GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG...
GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                            |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called,
AminoAcidChange will be 'EKPF'>''. However, due to the above problem
this will be right-shifted to 'KPFE'>'', which leads to a wrong position
and a wrong AA displayed in the variant's protein change HGVS
annotation.

In this example, the protein change should be p.(Glu316Leufs*25)
(in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to
p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution:

In contrast to the right-shifting code for nucleotides, it seems a
better approach here would be to simply compare both wild-type and
variant AA sequence, as they already have been computed by the
AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into
AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype
AA sequence. This also allows us to apply right-shifting to deletions
that do not coincide with the beginning of a codon.

Tests:

This adds both an 'integration' test with the above example to
DeletionAnnotationBuilderTest and some unit tests to
AminoAcidChangeTest. The latter has also been adjusted to make it clear
that the tested operations should work on AA sequences, not nucleotide
sequences (globally replaced 'A' by 'L').

**This is the commit message #2:**

**#498: fix normalizeDeletion(...) javadoc**

**This is the commit message #3:**

**498: refactor+fix shift for synonymous AA changes**

This introduces the changes from 2190296 to
BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder
as well.

Additionally, this removes now-obsolete code and moves the call to
'trim' the AA change before shifting to AminoAcidChangeNormalizer
itself, and also adds and fixes tests.

Co-authored-by: Roland Ewald <[email protected]>
holtgrewe pushed a commit that referenced this issue Jun 7, 2021
**This is a combination of 4 commits.**

**This is the 1st commit message:**

Problem:

The previous code addressed the problems with deletions introducing
frameshifts by avoiding right-shifting in those cases altogether,
by checking change.getAlt().length() != 0 in
AminoAcidChangeNormalizer#normalizeDeletion.
This may be wrong if the frame-shifted sequence introduces the same
amino acid(s) that would have followed in the wild type as well.

It also does not address the following corner case:

- consider a deletion coinciding with the start of a codon, so that
changeBeginPos.getFrameshift() == 0 in the calling code
(DeletionAnnotationBuilder), which means that insAA will be empty and
thus normalization will be attempted by
AminoAcidChangeNormalizer#normalizeDeletion

- assume the length of the deletion is not divisible by three, so it
will itself introduce a frameshift, so that varAASeq (the AA sequence
containing the variant) will differ significantly from from wtAASeq
(the wildtype AA sequence) at the position of the deletion
(and downstream)

- since AminoAcidChangeNormalizer#normalizeDeletion only operates on the
wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the
sequence of deleted amino acids starts with the same amino acid(s) as
those after the deletion in the wild type, the variant will be
erroneously right-shifted on that wild-type sequence
(the right-shifting currently did not consider varAASeq at all)

Example:

      H | T | G | E | K | P |fs F| E | C |
GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG...
GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                            |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called,
AminoAcidChange will be 'EKPF'>''. However, due to the above problem
this will be right-shifted to 'KPFE'>'', which leads to a wrong position
and a wrong AA displayed in the variant's protein change HGVS
annotation.

In this example, the protein change should be p.(Glu316Leufs*25)
(in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to
p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution:

In contrast to the right-shifting code for nucleotides, it seems a
better approach here would be to simply compare both wild-type and
variant AA sequence, as they already have been computed by the
AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into
AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype
AA sequence. This also allows us to apply right-shifting to deletions
that do not coincide with the beginning of a codon.

Tests:

This adds both an 'integration' test with the above example to
DeletionAnnotationBuilderTest and some unit tests to
AminoAcidChangeTest. The latter has also been adjusted to make it clear
that the tested operations should work on AA sequences, not nucleotide
sequences (globally replaced 'A' by 'L').

**This is the commit message #2:**

**#498: fix normalizeDeletion(...) javadoc**

**This is the commit message #3:**

**498: refactor+fix shift for synonymous AA changes**

This introduces the changes from 2190296 to
BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder
as well.

Additionally, this removes now-obsolete code and moves the call to
'trim' the AA change before shifting to AminoAcidChangeNormalizer
itself, and also adds and fixes tests.

Co-authored-by: Roland Ewald <[email protected]>
holtgrewe pushed a commit that referenced this issue Jun 7, 2021
**This is a combination of 4 commits.**

**This is the 1st commit message:**

Problem:

The previous code addressed the problems with deletions introducing
frameshifts by avoiding right-shifting in those cases altogether,
by checking change.getAlt().length() != 0 in
AminoAcidChangeNormalizer#normalizeDeletion.
This may be wrong if the frame-shifted sequence introduces the same
amino acid(s) that would have followed in the wild type as well.

It also does not address the following corner case:

- consider a deletion coinciding with the start of a codon, so that
changeBeginPos.getFrameshift() == 0 in the calling code
(DeletionAnnotationBuilder), which means that insAA will be empty and
thus normalization will be attempted by
AminoAcidChangeNormalizer#normalizeDeletion

- assume the length of the deletion is not divisible by three, so it
will itself introduce a frameshift, so that varAASeq (the AA sequence
containing the variant) will differ significantly from from wtAASeq
(the wildtype AA sequence) at the position of the deletion
(and downstream)

- since AminoAcidChangeNormalizer#normalizeDeletion only operates on the
wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the
sequence of deleted amino acids starts with the same amino acid(s) as
those after the deletion in the wild type, the variant will be
erroneously right-shifted on that wild-type sequence
(the right-shifting currently did not consider varAASeq at all)

Example:

      H | T | G | E | K | P |fs F| E | C |
GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG...
GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT..
                            |  L | S | V |

Before AminoAcidChangeNormalizer#normalizeDeletion gets called,
AminoAcidChange will be 'EKPF'>''. However, due to the above problem
this will be right-shifted to 'KPFE'>'', which leads to a wrong position
and a wrong AA displayed in the variant's protein change HGVS
annotation.

In this example, the protein change should be p.(Glu316Leufs*25)
(in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to
p.(Lys317Serfs*24) (p.(K317Sfs*24)).

Solution:

In contrast to the right-shifting code for nucleotides, it seems a
better approach here would be to simply compare both wild-type and
variant AA sequence, as they already have been computed by the
AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into
AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype
AA sequence. This also allows us to apply right-shifting to deletions
that do not coincide with the beginning of a codon.

Tests:

This adds both an 'integration' test with the above example to
DeletionAnnotationBuilderTest and some unit tests to
AminoAcidChangeTest. The latter has also been adjusted to make it clear
that the tested operations should work on AA sequences, not nucleotide
sequences (globally replaced 'A' by 'L').

**This is the commit message #2:**

**#498: fix normalizeDeletion(...) javadoc**

**This is the commit message #3:**

**498: refactor+fix shift for synonymous AA changes**

This introduces the changes from 2190296 to
BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder
as well.

Additionally, this removes now-obsolete code and moves the call to
'trim' the AA change before shifting to AminoAcidChangeNormalizer
itself, and also adds and fixes tests.

Co-authored-by: Roland Ewald <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant