Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application of Mondo mapping file is turning MONDO IDs into OMIM instead of the reverse #721

Open
kevinschaper opened this issue Sep 21, 2022 · 9 comments
Assignees

Comments

@kevinschaper
Copy link
Member

This is likely a fix that will need to happen to cat_merge, which applies mapping, or to the gene_mapping repo to change the order there, but it seems reasonable to start with an issue in this repo as the umbrella.

We're generating gene_mappings with the subject as the ID that we want to convert from, and the object as the ID that we want to convert to. So we do:

NCBI:123 skos:exactMatch HGNC:456

And the cat_merge rewiring code converts any subject or object with an NCBI:123 ID to HGNC:456.

The Mondo sssom file goes in the other direction, the subject is always a mondo ID, and the object is the ID that we're converting from.

I'm guessing the most sensible thing is to convert out gene mapping files to match the mondo mapping file, and then update cat_merge so that it will replace the mapping file object with the mapping file subject.

@matentzn What do you think? Swapping the order feels convenient, but my gut tells me that the subject/predicate order in the SSSOM file shouldn't matter and I should instead tell my mapping apply-er what prefixes are allowed.

@matentzn
Copy link
Member

I think this merits some more careful thinking! One idea is to define a flip method which respects the semantics of the mapping relations and knows about the rest of the sssom metadata as well.

I think for now (so you don't have to wait for tooling), it is probably a good idea to by fault-tolerant and build some simple aspects of flipping into your own code base (scan object and subject properties for prefixes, and then flip skos:exactMatch. Do you make use of skos:broadMatch mappings during merging?

@kevinschaper
Copy link
Member Author

I closed other issues as duplicates of this one, and I'll continue from here. I thought it might be useful to bring in this table from monarch-initiative/monarch-ingest#360

provided_by category subject_category subject_namespace subject_label subject original_subject predicate original_object object object_label object_namespace object_category
omim_gene_to_disease_edges biolink:GeneToDiseaseAssociation biolink:Disease MONDO major depressive disorder 1 MONDO:0012050 OMIM:608520 biolink:gene_associated_with_condition OMIM:608516 MONDO:0002009 major depressive disorder MONDO biolink:Disease
omim_gene_to_disease_edges biolink:GeneToDiseaseAssociation biolink:Disease MONDO major depressive disorder 2 MONDO:0012100 OMIM:608691 biolink:gene_associated_with_condition OMIM:608516 MONDO:0002009 major depressive disorder MONDO biolink:Disease
omim_gene_to_disease_edges biolink:GeneToDiseaseAssociation biolink:Disease MONDO schizophrenia 1 MONDO:0008414 OMIM:181510 biolink:risk_affected_by OMIM:181500 MONDO:0005090 schizophrenia MONDO biolink:Disease

@kevinschaper
Copy link
Member Author

kevinschaper commented Jan 26, 2023

Also interesting, is that the reverse is happening, G2D associations that are MONDO to MONDO:

mdb "select distinct subject_namespace, category, object_namespace from denormalized_edges where category like '%Disease%'"
subject_namespace category object_namespace
HGNC biolink:GeneToDiseaseAssociation HGNC
MONDO biolink:DiseaseToPhenotypicFeatureAssociation HP
HGNC biolink:DiseaseOrPhenotypicFeatureToGeneticInheritanceAssociation HP
MONDO biolink:DiseaseOrPhenotypicFeatureToGeneticInheritanceAssociation HP
HGNC biolink:DiseaseToPhenotypicFeatureAssociation HP
HGNC biolink:GeneToDiseaseAssociation MONDO
MONDO biolink:GeneToDiseaseAssociation MONDO

@matentzn
Copy link
Member

matentzn commented Jan 27, 2023

Where are these g2ds coming from? The OMIM ids at least at a quick glance are diseases.

Check this:

https://omim.org/geneMap/12/657?start=-3&limit=10&highlight=657

image

It seems the gene column even links to a disease identifier. @sabrinatoro help :P

@kevinschaper
Copy link
Member Author

kevinschaper commented Jan 27, 2023

🕵️

morbidmap has:

Major depressive disorder 1, 608516 (2)	MDD1	608520	12q22-q23.2

Which monarch ingests's omim_gene_to_disease turns into:

uuid:af11eb98-9a48-11ed-bf1e-791522c88a3d	OMIM:608520	biolink:gene_associated_with_condition	OMIM:608516	biolink:GeneToDiseaseAssociation	infores:monarchinitiative	ECO:0000177	infores:omim

608520 isn't in the hgnc file (which has an OMIM column), and it's not in the monarch-gene-mapping SSSOM file.

It is in mondo.sssom.tsv, but then, it's a disease, it should be:

MONDO:0012050	major depressive disorder 1	skos:exactMatch	OMIM:608520	semapv:UnspecifiedMatching

My assumption has been that the bug I'm solving is that I shouldn't be using mondo's SSSOM to replace IDs in the subject column of a G2D association...but maybe the problem is that the OMIM ingest shouldn't even be making this association?

If the mondo mapping weren't applied here, it would stay an OMIM ID and then get dropped as a dangling edge, which I think would be ok - though - handling that more explicitly upstream in the Koza transform is probably a cleaner solution.

@matentzn
Copy link
Member

I dont know.. My feeling is that this is a high priority issue that should be discussed in data call next week.. Can you add to agenda? It seems so weird that that i cant just universally apply a sssom file to a KG rewiring process.. I am sure @cmungall would also be against contextualising the applicability of sssom files on certain ingestibles.

@kevinschaper
Copy link
Member Author

Detective work on one of the HGNC to HGNC g2d edges.

original_subject subject predicate object original_object
NCBIGene:605 HGNC:1004 biolink:gene_associated_with_condition HGNC:1004 OMIM:601406

starts in morbidmap as:

B-cell non-Hodgkin lymphoma, high-grade (3)	BCL7A, BCL7	601406	12q24.31

This is one of the few cases in monarch ingest where we're still using a Koza map, for mim2gene:

    elif no_disease_id_match is not None:
        # this is a case where the disorder
        # a blended gene/phenotype
        # we look up the NCBIGene feature and make the association
        disorder_label, association_key = no_disease_id_match.groups()
        # make what's in the gene column the disease
        disorder_id = 'OMIM:' + gene_num
        ncbi_id = ''
        if gene_num in omim_to_gene:
            ncbi_id = omim_to_gene[gene_num]['Entrez Gene ID (NCBI)']
        if ncbi_id == '':
            koza_app.next_row()
        gene_id = 'NCBIGene:' + ncbi_id

The row in mim2gene is:

601406	gene	605	BCL7A	ENSG00000110987

monarch-gene-mapping uses the hgnc file to produce:

HGNC:1004	skos:exactMatch	NCBIGene:605	semapv:UnspecifiedMatching
HGNC:1004	skos:exactMatch	OMIM:601406	semapv:UnspecifiedMatching
HGNC:1004	skos:closeMatch	UniProtKB:Q4VC05	semapv:UnspecifiedMatching
HGNC:1004	skos:exactMatch	ENSEMBL:ENSG00000110987	semapv:UnspecifiedMatching

Which ends up picking up both sides of the association, and OMIM:601406 doesn't show up in mondo.sssom.tsv

So, again in this case, is the problem that this isn't actually a g2d association in mordbidmap?

@kevinschaper
Copy link
Member Author

kevinschaper commented Jan 30, 2023

Dipper's parsing of mimTitles to get type information seems incredibly important, but is missing from the monarch-ingest OMIM ingest.

https://github.com/monarch-initiative/dipper/blob/bf0a86c4472a2406919f96eee10afe156fe62951/dipper/sources/OMIMSource.py#L126

        populate omim_type map from an omim number to an ontology term
        the ontology terms's labels as:

        -  'gene'
            Asterisk (*)  Gene

        -   'has_affected_feature'
            Plus (+)  Gene and phenotype, combined

        -   'Phenotype'
            Number Sign (#)  Phenotype, molecular basis known

        -   'heritable_phenotypic_marker'
            Percent (%)  Phenotype or locus, molecular basis unknown

        -   'obsolete'
            Caret (^)  Entry has been removed from the database
            or moved to another entry.
                further processed into Removed or Moved & Split
                (`omim_replaced`  populated where Moved or Split )

        -   'Suspected'
            NULL (<null>)  Other, mainly phenotypes with suspected mendelian basis

        Populates dict of omim_number to ontology_curie
        Populates dict of omim_number to list to replacments

        note:
            If an omim id is neither in omim_replaced nor omim_types
            Then it was removed.

@monicacecilia
Copy link
Contributor

Next up:

  • QC: to answer whether any IDs map to both an HGNC and a Mondo ID. Downstream of that, review whether we change how we apply mappings in Monarch.
  • Good issue for @AO33 to tackle.

@monicacecilia monicacecilia transferred this issue from monarch-initiative/monarch-ingest May 29, 2024
@monicacecilia monicacecilia assigned AO33 and unassigned kevinschaper May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants