Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinate is out of bounds error when roundtripping GRCh38 variants via g_to_c + c_to_g and relevant_transcripts #717

Closed
mihaitodor opened this issue Jan 5, 2024 · 5 comments
Labels
question Further information is requested won't fix This will not be worked on

Comments

@mihaitodor
Copy link

mihaitodor commented Jan 5, 2024

Describe the bug

I'm parsing a g. variant, then I'm fetching some relevant transcripts for it via the GRCh38 AssemblyMapper, then I'm projecting it to any of the NM relevant transcript and then I'm projecting that back to a g. variant. Naively, what I think should happen is that it should be able to round-trip and print the input variant, but right now it's just throwing this error. Note that it does work as expected if you use, for example variant = 'NC_000011.10:g.8263343T>C'. Also, it works if I use GRCh37.

To Reproduce

Steps to reproduce the behavior:

import hgvs.assemblymapper
import hgvs.dataproviders.uta
import hgvs.parser

database_schema = 'uta_20210129b'
data_provider = hgvs.dataproviders.uta.connect(
    db_url=f"postgresql://anonymous:[email protected]/uta/{database_schema}")
b38_assembly_mapper = hgvs.assemblymapper.AssemblyMapper(
    data_provider, assembly_name='GRCh38', alt_aln_method='splign')
parser = hgvs.parser.Parser()

variant = 'NC_000001.10:g.145592073A>T'

parsed_variant = parser.parse_hgvs_variant(variant)

transcripts = b38_assembly_mapper.relevant_transcripts(parsed_variant)
print(transcripts)

relevant_transcript = 'NM_006468.7'
# relevant_transcript = 'NM_001303456.1'

var_c = b38_assembly_mapper.g_to_c(parsed_variant, relevant_transcript)

b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)

print(b38_projected_variant)

Running the above code fails with the following error:

Traceback (most recent call last):
  File "/Users/mihaitodor/Projects/genomex/spdi/./test.py", line 25, in <module>
    b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/assemblymapper.py", line 115, in c_to_g
    var_out = super(AssemblyMapper, self).c_to_g(var_c, alt_ac, alt_aln_method=self.alt_aln_method)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/variantmapper.py", line 283, in c_to_g
    pos_g = mapper.c_to_g(var_c.posedit.pos)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 277, in c_to_g
    return self.n_to_g(self.c_to_n(c_interval), strict_bounds=strict_bounds)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 263, in c_to_n
    n_start = pos_c_to_n(c_interval.start)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 260, in pos_c_to_n
    raise HGVSInvalidIntervalError(f"c.{pos} coordinate is out of bounds")
hgvs.exceptions.HGVSInvalidIntervalError: c.*578 coordinate is out of bounds

Expected behavior

The code should print NC_000001.10:g.145592073A>T at the end.

Additional context

@reece mentioned on the biocommons Slack that this looks like a bug.

@andreasprlic
Copy link
Member

andreasprlic commented Jan 6, 2024

Looking into this, it turns out that the hgvs_c string here is NM_006468.7:c.*578=, so the UTR region of this transcript (in which that variant lies) does have a different sequence than the reference genome. We call that a "transcript-ref-disagree" position.

Somehow in assembly 38 this position seems to be outside of transcript coding space, therefore there is a coordinate is out of bounds error message. To work around here, you could set hgvs.global_config.mapping.strict_bounds = False. Doing so results in NC_000001.11:g.145842814_145842815=

Looking into the underlying alignment in UTA, The last exon (that is fully coding for the UTR) has alignment issues in both assemblies. in assembly 37 this transcript has one exon more but the alignment is 183=1X377=3I1027=. Assembly 38 is one exon shorter and the second to last has an alignment of 476=1588I. I believe that's why that specific position comes out as "out of bounds".

I would conclude from that that neither assembly is a good match for the terminal UTR region here, and we should prob not assume that we can map variants in that region across assemblies.

I don't think there's a hgvs bug here, this is just the nature of the reference assemblies...

mihaitodor added a commit to FHIR/genomics-operations that referenced this issue Jan 6, 2024
This is a workaround for some liftover failures: biocommons/hgvs#717
@mihaitodor
Copy link
Author

Thank you for looking into this @andreasprlic and for providing such a detailed explanation!

Given this issue, do you think it would make sense to bubble up strict_bounds as an optional parameter for c_to_g in AssemblyMapper and VariantMapper? That way, users could disable this validation selectively. In my case, I'm OK to disable it globally and that seems to work fine.

@gostachowiak
Copy link

This issue was auto-closed due to inactivity, but I think it's related
#608

@andreasprlic
Copy link
Member

andreasprlic commented Jan 8, 2024

@mihaitodor if your goal is identifying if the location of a variant in another genome assembly, I am concerned about trying to do that in regions where the assemblies have changed a lot. Just because you can compute some coordinates does not mean these are biologically "the same". I really think in the example you provided above, the lifted over coordinates should not be trusted. As such I would not encourage you to map across assemblies when strict_bounds=False are necessary. I also would recommend to think about a QC approach to make sure that any lifted over variant can be considered to be equivalent in the context of both assemblies, otherwise treat them as distinct variants.

@mihaitodor
Copy link
Author

Thank you @andreasprlic! Indeed, in such cases it's probably better to error out. Initially, we had some code based on pyliftover which I guess would have similar issues? I opened #711 after @reece mentioned that it would be handy to have some direct liftover support in hgvs.

The code I'm trying to add some features to is just a reference implementation for now, but indeed, we'll have to be stricter in production-ready implementations.

@jsstevenson jsstevenson added the bug Something isn't working label Jan 10, 2024
@andreasprlic andreasprlic added question Further information is requested won't fix This will not be worked on and removed bug Something isn't working labels Jan 19, 2024
mihaitodor added a commit to FHIR/genomics-operations that referenced this issue Apr 4, 2024
This is a workaround for some liftover failures: biocommons/hgvs#717
mihaitodor added a commit to FHIR/genomics-operations that referenced this issue Apr 4, 2024
This is a workaround for some liftover failures: biocommons/hgvs#717
mihaitodor added a commit to FHIR/genomics-operations that referenced this issue May 16, 2024
This is a workaround for some liftover failures: biocommons/hgvs#717
mihaitodor added a commit to FHIR/genomics-operations that referenced this issue Jan 21, 2025
This is a workaround for some liftover failures: biocommons/hgvs#717
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants