Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SSSOM #111

Open
cbizon opened this issue Jan 19, 2022 · 9 comments
Open

Read SSSOM #111

cbizon opened this issue Jan 19, 2022 · 9 comments
Assignees

Comments

@cbizon
Copy link
Contributor

cbizon commented Jan 19, 2022

Just as we want to have Babel write SSSOM, NN will need to read it.

@cbizon cbizon assigned jdr0887 and unassigned gaurav Jun 2, 2022
@cbizon
Copy link
Contributor Author

cbizon commented Jun 2, 2022

https://github.com/mapping-commons/sssom

We need to figure out a few things if we go to sssom:

  1. how to preserve the ordering of the identifiers list
  2. how to include type and information content

@cbizon
Copy link
Contributor Author

cbizon commented Jun 2, 2022

@matentzn tells me that these problems can be handled with off the shelf sssom

@matentzn
Copy link

matentzn commented Jun 2, 2022

{"type": "biolink:Disease", "ic": "100", "identifiers": [{"i": "MONDO:0018670", "l": "symptomatic form of fragile X syndrome in female carrier"}, {"i": "ORPHANET:449291", "l": "Symptomatic form of fragile X syndrome in female carrier"}, {"i": "UMLS:CN237736"}]}

Assuming MONDO:0018670 is the clique leader (sssom 0.9.0, not sssom 1.0), a sssom file would look something like this:

subject_id subject_label subject_category predicate_id object_id object_label object_category match_type other
MONDO:0018670 symptomatic form of fragile X syndrome in female carrier biolink:Disease skos:exactMatch ORPHANET:449291 Symptomatic form of fragile X syndrome in female carrier biolink:Disease HumanCurated { subject_information_content: 100 }
MONDO:0018670 symptomatic form of fragile X syndrome in female carrier biolink:Disease skos:exactMatch UMLS:CN237736 biolink:Disease HumanCurated { subject_information_content: 100 }

There are some features for natively supporting semantic similarity measures, see https://mapping-commons.github.io/sssom/Mapping/, but I don't think subject_information_content would qualify to that.

@cbizon
Copy link
Contributor Author

cbizon commented Jun 2, 2022

Thanks! Is it required to repeat the subject_labels or categories etc when they are repeated?

If we are using the ordering of the rows as information, are we abusing the format?

@matentzn
Copy link

matentzn commented Jun 2, 2022

I would keep the information redundant with the labels, but nothing in sssom requires you to. I like that in general so that I can more easily combine different mappings sets, merge them etc.

I think expecting the row order to mean something is not very reliable.

If you wanted to be 100% reliable you could of course export all cliques as separate sssom files. This is what I think Chris does. But it would result in 5000 files. It's an interesting use case. Maybe if you could create an identifier for each clique, you could put it into the "other" column. Sorry maybe sssom is not ideal here, but we could consider extensions to the format to cover this use case (named groups for mappings).

@cbizon
Copy link
Contributor Author

cbizon commented Jun 2, 2022

I suppose we could put a clique id of some sort in the other column. And perhaps an index to define the order if we don't want to rely on the row order...

@cmungall
Copy link

cmungall commented Jun 3, 2022

The goal here is to have a format for storage not for sending back to clients?

In that case, is the ordering a property of the mappings themselves, or a function that NodeNormalizer applies after the fact (ie a priority list of prefixes from biolink)? If it's a property of the mappings themselves maybe there is a more direct way to express this?

Same with IC value?

@gaurav
Copy link
Contributor

gaurav commented Jun 10, 2022

I've written a program to convert some of the files in the Babel compendia into SSSOM so we can see look at them in my Dropbox. These files appear to pass validation on sssom-py apart from missing CURIE maps. If everybody's happy with these files, I can run my program on all the Babel compendia (which will probably take 0.5-1 days to run).

Some thoughts and questions:

  1. I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.
  2. We have a large number of cliques that only consist of a single individual (e.g. CHEBI:61535 "poly(1,4-phenylene oxide) polymer" from ChemicalMixture.sssom.tsv), which we would still like to load into NodeNorm so that it can be returned as the preferred identifier. I'm currently modeling these by saying this identifier is an exactMatch to itself. Is there a more elegant way of modeling this?
  3. I'm not sure if we need a separate clique ID -- wouldn't the clique leader's ID be unique within a particular compendia file? In this run, I made up a clique ID in the format ${compendium_filename}#${line_number_starting_from_zero}.
  4. Is there any benefit to putting the synonym information into the SSSOM files as well? I don't think so, and only used the information from the compendium files for these files.
  5. I used match_type because the master branch of sssom-py requires that, but once that's updated to the latest SSSOM version, I'll change that to a mapping_justification of semapv:MappingChaining ("A matching process based on the traversing of multiple mappings.") since I think that best captures how Babel is built.
  6. I didn't fill in any of the optional metadata fields (e.g. mapping_set_id, mapping_set_description, mapping_set_version; see foodie-inc-2022-05-01.sssom.tsv as an example, but I can add those easily if needed.

@cbizon
Copy link
Contributor Author

cbizon commented Jun 13, 2022

  1. I don't think we have a definitive list of CURIE prefixes anywhere in Babel (the closest thing I could find was the prefixes file at https://github.com/TranslatorSRI/Babel/blob/master/src/prefixes.py). To generate CURIE maps for each of the compendia file, I'm planning to (1) hard-code the list of prefixes that are currently used in those files, and (2) make the code generate a warning if it sees a CURIE that doesn't use one of the predefined prefixes.

There's a check in babel against the biolink prefixes for each type. So it will potentially write out anything in the biolink yaml for each type, and should not write out anything that isn't in that prefix list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants