Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Water results in "WATER O 15" (PUBCHEM.COMPOUND:10129877) in NameRes because of a conflation issue #264

Closed
gaurav opened this issue Apr 15, 2024 · 2 comments · Fixed by #266
Assignees

Comments

@gaurav
Copy link
Collaborator

gaurav commented Apr 15, 2024

This is because NameRes entries are all based on DrugConflated results, and the conflation for water is:

["PUBCHEM.COMPOUND:10129877", "CHEBI:15377", "CHEBI:33813", "RXCUI:150985", "RXCUI:204918", "RXCUI:340584", "RXCUI:379002", "RXCUI:1043588", "RXCUI:1045437", "RXCUI:1045439", "RXCUI:1053147", "RXCUI:1053148", "RXCUI:1053172", "RXCUI:1053173", "RXCUI:1053428", "RXCUI:1053429", "RXCUI:1151100", "RXCUI:1151101", "RXCUI:1161792", "RXCUI:1161794", "RXCUI:1161795", "RXCUI:1180556", "RXCUI:1235498", "RXCUI:1235499", "RXCUI:1235500", "RXCUI:1235501", "RXCUI:1235502", "RXCUI:1235503", "RXCUI:1235504", "RXCUI:1310241", "RXCUI:1314884", "RXCUI:1423320", "RXCUI:1423321", "RXCUI:1424601", "RXCUI:1424602", "RXCUI:1424603", "RXCUI:1424604", "RXCUI:1424605", "RXCUI:1425974", "RXCUI:1425975", "RXCUI:1425976", "RXCUI:1425977", "RXCUI:1425978", "RXCUI:1489375", "RXCUI:1489376", "RXCUI:1489377", "RXCUI:1489378", "RXCUI:1539535", "RXCUI:1549855", "RXCUI:2108561", "RXCUI:2282752", "RXCUI:2282753", "RXCUI:2360606", "RXCUI:2360607", "RXCUI:2360608", "RXCUI:2360609", "RXCUI:2360610", "RXCUI:2601721", "RXCUI:2601722", "UMLS:C0359299", "UMLS:C3857954", "UMLS:C1883551"]

So why is PUBCHEM.COMPOUND:10129877 ("WATER O 15") ranked above CHEBI:15377 ("water")? This is because after we generate the initial conflation, the leading ID is RXCUI:1425974 ("Opticlear"), which is a biolink:Drug. As a biolink:Drug, PUBCHEM.COMPOUND is a more preferred prefix than CHEBI:

INFO:src.createcompendia.drugchemical:Leading ID RXCUI:1425974 normalized to RXCUI:1425974 (type biolink:Drug) with prefixes: ['ncats.drug', 'RXCUI', 'NDC', 'UMLS', 'PUBCHEM.COMPOUND', 'CHEMBL.COMPOUND', 'UNII', 'CHEBI', 'MESH', 'CAS', 'GTOPDB', 'HMDB', 'KEGG', 'KEGG.COMPOUND', 'ChemBank', 'PUBCHEM.SUBSTANCE', 'INCHI', 'INCHIKEY', 'KEGG.GLYCAN', 'KEGG.ENVIRON', 'SIDER.DRUG', 'BIGG.METABOLITE', 'foodb.compound']

So, options for fixing this:

  1. Conflate fewer things together, so things like Opticlear won't get conflated with water. But this will be much trickier to implement.
  2. We could try determining the type not by a random ID but by some sort of consensus calculation, but given all the RXCUIs I suspect everything will end up as a biolink:Drug.
  3. I don't think we ever want RXCUIs to affect the type calculation. So we could filter them all out (along with all the biolink:ChemicalEntity CURIEs), then base the type on a consensus of the other IDs.
  4. ???
@cbizon
Copy link
Contributor

cbizon commented Apr 15, 2024

I think 3 is fine. But I wonder if we can handle this at a per-clique level. We're merging a series of cliques, and each clique has a type. Can we have a preferred series of types and then just choose our favorite type from across the cliques? Drug at bottom, small molecule at top?

@gaurav
Copy link
Collaborator Author

gaurav commented Apr 15, 2024

Discussion result:

  1. Determine conflation type based on a preferred list of types: SmallMolecule is most preferred, ChemicalEntity is least preferred, Drug is somewhere near the bottom.
  2. Another potential approach would be to pick the clique leader using the number of identifiers in the clique, BUT this could push us towards lots of Drugs (if a single drug has a ton of formulations, say), so we should document this as a potential solution but try the conflation type approach first.

@gaurav gaurav self-assigned this Apr 16, 2024
gaurav added a commit that referenced this issue Apr 22, 2024
We previously used a randomly chosen identifier from each DrugChemical conflation to choose the Biolink type for the entire conflation, which would also determine the order of prefixes within the conflation. This lead to issues where we used an RXCUI to determine that a conflation should be considered a biolink:Drug, when really biolink:SmallMolecule would be a better type. Instead, this PR replaces that approach with a preferred-type approach.

Also replaces COMPLEX_CHEMICAL_MIXTURE with COMPLEX_MOLECULAR_MIXTURE, which is what Biolink calls it now.

Closes #264.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants