Read SSSOM #111

cbizon · 2022-01-19T15:04:40Z

Just as we want to have Babel write SSSOM, NN will need to read it.

cbizon · 2022-06-02T17:23:00Z

https://github.com/mapping-commons/sssom

We need to figure out a few things if we go to sssom:

how to preserve the ordering of the identifiers list
how to include type and information content

cbizon · 2022-06-02T17:24:22Z

@matentzn tells me that these problems can be handled with off the shelf sssom

matentzn · 2022-06-02T17:47:10Z

{"type": "biolink:Disease", "ic": "100", "identifiers": [{"i": "MONDO:0018670", "l": "symptomatic form of fragile X syndrome in female carrier"}, {"i": "ORPHANET:449291", "l": "Symptomatic form of fragile X syndrome in female carrier"}, {"i": "UMLS:CN237736"}]}

Assuming MONDO:0018670 is the clique leader (sssom 0.9.0, not sssom 1.0), a sssom file would look something like this:

subject_id	subject_label	subject_category	predicate_id	object_id	object_label	object_category	match_type	other
MONDO:0018670	symptomatic form of fragile X syndrome in female carrier	biolink:Disease	skos:exactMatch	ORPHANET:449291	Symptomatic form of fragile X syndrome in female carrier	biolink:Disease	HumanCurated	{ subject_information_content: 100 }
MONDO:0018670	symptomatic form of fragile X syndrome in female carrier	biolink:Disease	skos:exactMatch	UMLS:CN237736		biolink:Disease	HumanCurated	{ subject_information_content: 100 }

There are some features for natively supporting semantic similarity measures, see https://mapping-commons.github.io/sssom/Mapping/, but I don't think subject_information_content would qualify to that.

cbizon · 2022-06-02T17:54:57Z

Thanks! Is it required to repeat the subject_labels or categories etc when they are repeated?

If we are using the ordering of the rows as information, are we abusing the format?

matentzn · 2022-06-02T18:38:29Z

I would keep the information redundant with the labels, but nothing in sssom requires you to. I like that in general so that I can more easily combine different mappings sets, merge them etc.

I think expecting the row order to mean something is not very reliable.

If you wanted to be 100% reliable you could of course export all cliques as separate sssom files. This is what I think Chris does. But it would result in 5000 files. It's an interesting use case. Maybe if you could create an identifier for each clique, you could put it into the "other" column. Sorry maybe sssom is not ideal here, but we could consider extensions to the format to cover this use case (named groups for mappings).

cbizon · 2022-06-02T19:13:02Z

I suppose we could put a clique id of some sort in the other column. And perhaps an index to define the order if we don't want to rely on the row order...

cmungall · 2022-06-03T16:05:22Z

The goal here is to have a format for storage not for sending back to clients?

In that case, is the ordering a property of the mappings themselves, or a function that NodeNormalizer applies after the fact (ie a priority list of prefixes from biolink)? If it's a property of the mappings themselves maybe there is a more direct way to express this?

Same with IC value?

gaurav · 2022-06-10T19:30:44Z

I've written a program to convert some of the files in the Babel compendia into SSSOM so we can see look at them in my Dropbox. These files appear to pass validation on sssom-py apart from missing CURIE maps. If everybody's happy with these files, I can run my program on all the Babel compendia (which will probably take 0.5-1 days to run).

Some thoughts and questions:

I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.
We have a large number of cliques that only consist of a single individual (e.g. CHEBI:61535 "poly(1,4-phenylene oxide) polymer" from ChemicalMixture.sssom.tsv), which we would still like to load into NodeNorm so that it can be returned as the preferred identifier. I'm currently modeling these by saying this identifier is an exactMatch to itself. Is there a more elegant way of modeling this?
I'm not sure if we need a separate clique ID -- wouldn't the clique leader's ID be unique within a particular compendia file? In this run, I made up a clique ID in the format ${compendium_filename}#${line_number_starting_from_zero}.
Is there any benefit to putting the synonym information into the SSSOM files as well? I don't think so, and only used the information from the compendium files for these files.
I used match_type because the master branch of sssom-py requires that, but once that's updated to the latest SSSOM version, I'll change that to a mapping_justification of semapv:MappingChaining ("A matching process based on the traversing of multiple mappings.") since I think that best captures how Babel is built.
I didn't fill in any of the optional metadata fields (e.g. mapping_set_id, mapping_set_description, mapping_set_version; see foodie-inc-2022-05-01.sssom.tsv as an example, but I can add those easily if needed.

cbizon · 2022-06-13T18:10:19Z

I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.

There's a check in babel against the biolink prefixes for each type. So it will potentially write out anything in the biolink yaml for each type, and should not write out anything that isn't in that prefix list.

cbizon assigned gaurav Jan 19, 2022

cbizon assigned jdr0887 and unassigned gaurav Jun 2, 2022

gaurav mentioned this issue Nov 27, 2022

SSSOM output TranslatorSRI/Babel#33

Open

gaurav mentioned this issue Dec 8, 2022

Add an adapter for Translator SRI NodeNormalizer endpoint INCATools/ontology-access-kit#397

Closed

gaurav added this to the NodeNorm December 2023 milestone Jun 8, 2023

gaurav modified the milestones: NodeNorm January 2024, NodeNorm November 2023 Sep 27, 2023

gaurav modified the milestones: NodeNorm November 2023, NodeNorm January 2024 Dec 3, 2023

gaurav modified the milestones: NodeNorm January 2024, NodeNorm July 2024 May 17, 2024

gaurav mentioned this issue Aug 4, 2024

Adding SSSOM export TranslatorSRI/Babel#328

Draft

gaurav modified the milestones: NodeNorm November 2024, NodeNorm January 2025 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read SSSOM #111

Read SSSOM #111

cbizon commented Jan 19, 2022

cbizon commented Jun 2, 2022

cbizon commented Jun 2, 2022 •

edited

Loading

matentzn commented Jun 2, 2022

cbizon commented Jun 2, 2022

matentzn commented Jun 2, 2022

cbizon commented Jun 2, 2022

cmungall commented Jun 3, 2022

gaurav commented Jun 10, 2022

cbizon commented Jun 13, 2022

Read SSSOM #111

Read SSSOM #111

Comments

cbizon commented Jan 19, 2022

cbizon commented Jun 2, 2022

cbizon commented Jun 2, 2022 • edited Loading

matentzn commented Jun 2, 2022

cbizon commented Jun 2, 2022

matentzn commented Jun 2, 2022

cbizon commented Jun 2, 2022

cmungall commented Jun 3, 2022

gaurav commented Jun 10, 2022

cbizon commented Jun 13, 2022

cbizon commented Jun 2, 2022 •

edited

Loading