-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faulty right-shifting of deletions for protein annotations #498
Comments
roland-ewald
added a commit
to limbus-medtec/jannovar
that referenced
this issue
Aug 16, 2020
Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L').
3 tasks
roland-ewald
added a commit
to limbus-medtec/jannovar
that referenced
this issue
Aug 17, 2020
roland-ewald
added a commit
to limbus-medtec/jannovar
that referenced
this issue
Jun 6, 2021
This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests.
holtgrewe
pushed a commit
that referenced
this issue
Jun 7, 2021
** This is a combination of 4 commits. **This is the 1st commit message:** Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L'). **This is the commit message #2:** **#498: fix normalizeDeletion(...) javadoc** **This is the commit message #3:** **498: refactor+fix shift for synonymous AA changes** This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests. Co-authored-by: Roland Ewald <[email protected]>
holtgrewe
pushed a commit
that referenced
this issue
Jun 7, 2021
**This is a combination of 4 commits.** **This is the 1st commit message:** Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L'). **This is the commit message #2:** **#498: fix normalizeDeletion(...) javadoc** **This is the commit message #3:** **498: refactor+fix shift for synonymous AA changes** This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests. Co-authored-by: Roland Ewald <[email protected]>
holtgrewe
pushed a commit
that referenced
this issue
Jun 7, 2021
**This is a combination of 4 commits.** **This is the 1st commit message:** Problem: The previous code addressed the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking change.getAlt().length() != 0 in AminoAcidChangeNormalizer#normalizeDeletion. This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case: - consider a deletion coinciding with the start of a codon, so that changeBeginPos.getFrameshift() == 0 in the calling code (DeletionAnnotationBuilder), which means that insAA will be empty and thus normalization will be attempted by AminoAcidChangeNormalizer#normalizeDeletion - assume the length of the deletion is not divisible by three, so it will itself introduce a frameshift, so that varAASeq (the AA sequence containing the variant) will differ significantly from from wtAASeq (the wildtype AA sequence) at the position of the deletion (and downstream) - since AminoAcidChangeNormalizer#normalizeDeletion only operates on the wildtype AA sequence (wtAASeq), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently did not consider varAASeq at all) Example: H | T | G | E | K | P |fs F| E | C | GGAAACAT|ACT|GGG|GAG|AAA|CCC| TTT|GAG|TGT|CCCAAATGTGGGAAGTGTTACTTTCGG... GGAAACAT|ACT|GGG|---|---|---|-TTG|AGT|GTC|CCAAATGT.. | L | S | V | Before AminoAcidChangeNormalizer#normalizeDeletion gets called, AminoAcidChange will be 'EKPF'>''. However, due to the above problem this will be right-shifted to 'KPFE'>'', which leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be p.(Glu316Leufs*25) (in one-letter codes, p.(E316Lfs*25)) but is erroneously changed to p.(Lys317Serfs*24) (p.(K317Sfs*24)). Solution: In contrast to the right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the AminoAcidChangeNormalizer anyhow. So, varAASeq is now passed into AminoAcidChangeNormalizer#normalizeDeletion and compared to the wildtype AA sequence. This also allows us to apply right-shifting to deletions that do not coincide with the beginning of a codon. Tests: This adds both an 'integration' test with the above example to DeletionAnnotationBuilderTest and some unit tests to AminoAcidChangeTest. The latter has also been adjusted to make it clear that the tested operations should work on AA sequences, not nucleotide sequences (globally replaced 'A' by 'L'). **This is the commit message #2:** **#498: fix normalizeDeletion(...) javadoc** **This is the commit message #3:** **498: refactor+fix shift for synonymous AA changes** This introduces the changes from 2190296 to BlockSubstitutionAnnotationBuilder and InsertionAnnotationBuilder as well. Additionally, this removes now-obsolete code and moves the call to 'trim' the AA change before shifting to AminoAcidChangeNormalizer itself, and also adds and fixes tests. Co-authored-by: Roland Ewald <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
It seems that
AminoAcidChangeNormalizer#normalizeDeletion
currently does not correctly right-shift protein sequences, introducing erroneous protein change HGVS annotations. Other kinds of variants (besides deletions) may be affected as well.Problem
The current code addresses the problems with deletions introducing frameshifts by avoiding right-shifting in those cases altogether, by checking
change.getAlt().length() != 0
.This may be wrong if the frame-shifted sequence introduces the same amino acid(s) that would have followed in the wild type as well. It also does not address the following corner case:
changeBeginPos.getFrameshift() == 0
in the calling code (
DeletionAnnotationBuilder
), which means thatinsAA
will be empty and thus normalization will be attempted byAminoAcidChangeNormalizer#normalizeDeletion
varAASeq
(the AA sequence containing the variant) will differ significantly from fromwtAASeq
(the wildtype AA sequence) at the position of the deletion (and downstream)AminoAcidChangeNormalizer#normalizeDeletion
only operates on the wildtype AA sequence (wtAASeq
), this goes unnoticed, and in case the sequence of deleted amino acids starts with the same amino acid(s) as those after the deletion in the wild type, the variant will be erroneously right-shifted on that wild-type sequence (the right-shifting currently does not considervarAASeq
at all)Example (for corner case, see above)
Consider the following deletion (from a unit test in the PR attempting to fix this issue).
Before
AminoAcidChangeNormalizer#normalizeDeletion
gets called,AminoAcidChange
will be'EKPF'>''
. However, due to the above problem this will be right-shifted toKPFE>''
(as the firstE
in the wildtype sequence gets compared to the secondE
in the wildtype in the above example; afterwardsK
will be compared toC
, which is wrong, so thatshift = 1
).This approach leads to a wrong position and a wrong AA displayed in the variant's protein change HGVS annotation. In this example, the protein change should be
p.(Glu316Leufs*25)
(in one-letter codes,p.(E316Lfs*25)
) but is erroneously changed top.(Lys317Serfs*24)
(p.(K317Sfs*24)
).Solution: more general approach for right-shifting
In contrast to the current right-shifting code for nucleotides, it seems a better approach here would be to simply compare both wild-type and variant AA sequence, as they already have been computed by the
AminoAcidChangeNormalizer
anyhow. This approach has been implemented and tested in the accompanying PR (#499).Now, right-shifting also happens in case a deletion does not coincide with a codon start (i.e.
insAA
is non-empty, as checked by the previous code to avoid normalization). This is necessary because the resulting sequence in the variant AA sequence (varAASeq
) could still continue with the same amino acids as the wild type sequence would have, even for deletions that do not coincide with the beginning of a codon. For example:This deletion would not have been normalized before, although the deletion results in the varant AA sequence continuing with the same AA (
G
) and thus the variant needs to be right-shifted by one.Related issues
InsertionAnnotationBuilder
relies for protein sequence normalization may need to be changed correspondingly (happy to help here).The text was updated successfully, but these errors were encountered: