-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible error when using drug_chemical_conflate #221
Comments
The problem appears to be that PUBCHEM.COMPOUND:440055 is conflated with RXCUI:2642190, which is missing from NodeNorm for some reason: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=RXCUI:2642190&conflate=true&drug_chemical_conflate=false&description=false This apparently causes an exception to be thrown when trying to figure out the types of this identifier:
|
It looks like there are 41 cliques on https://nodenormalization-sri.renci.org/ (current NodeNorm Dev) that currently cause this error:
|
more of them: |
Note the oddity of PUBCHEM.COMPOUND:124220636, which should only be conflated with PUBCHEM.COMPOUND:165411920 and UMLS:C5402366, all three of which have type information. Update: this appears to be caused by the middle ID, but then how come it works when conflation is turned off without problems?
Update: PUBCHEM.COMPOUND:165411920 is part of the PUBCHEM.COMPOUND:124220636 clique. Perhaps that is confusing the conflation algorithm somehow? |
Right, so PUBCHEM.COMPOUND:165411920 really doesn't have a type. The type database only uses the preferred identifier to retrieve the type. The problem is that the conflation contains something other than a preferred identifier. In conflation, it is assuming that those are all preferred, and will therefore all have entries in the right places. It works when you just go with *20 because it first hits the main index, finds the preferred ID and then uses that to go look up the type. The upshot of which: Your more careful response to the error is all that can be done in nodenorm, and the real problem is in Babel (oops). |
Here are the cliques for those three identifiers from our previous run. As you said, it's only two cliques, since PUBCHEM.COMPOUND:124220636 is the clique leader for PUBCHEM.COMPOUND:165411920. {"type": "biolink:MolecularMixture", "identifiers": [{"i": "PUBCHEM.COMPOUND:124220636", "l": "Copper dotatate Cu-64", "d": []}, {"i": "PUBCHEM.COMPOUND:165411920", "l": "Detectnet", "d": []}, {"i": "CHEMBL.COMPOUND:CHEMBL4297339", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "UNII:N3858377KC", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "DRUGBANK:DB15873", "d": []}, {"i": "MESH:C000718307", "l": "copper dotatate CU-64", "d": []}, {"i": "MESH:C575629", "l": "64Cu-DOTATATE", "d": []}, {"i": "DrugCentral:5411", "l": "copper dotatate Cu-64", "d": []}, {"i": "HMDB:HMDB0304900", "l": "Copper Cu 64 Dotatate", "d": []}, {"i": "INCHIKEY:IJRLLVFQGCCPPI-NVGRTJHCSA-L", "d": []}, {"i": "UMLS:C3502191", "l": "copper oxodotreotide CU-64", "d": []}, {"i": "RXCUI:2396442", "d": []}]}
{"type": "biolink:ChemicalEntity", "identifiers": [{"i": "UMLS:C5402366", "l": "dodatate", "d": []}, {"i": "RXCUI:2396443", "d": []}]} The Babel regeneration stalled sometime this morning, but I've restarted it now. Once that's done I'll confirm that it also has the incorrect conflation, but I think I can assume that that will be the case and look for issues in the DrugChemical conflation code in the meantime. |
I've modified the code in PR TranslatorSRI/Babel#191 to skip identifiers where the object isn't present in either drug_rxcui_to_clique or chemical_rxcui_to_clique, which are mappings from RXCUIs to clique leaders. This seems to have eliminated the PUBCHEM.COMPOUND identifiers that we don't actually want conflated, but I'm pretty sure that it's skipping too many non-RXCUI mappings that we actually do want in these conflated cliques. Here is a list of the 11,737 subject-object pairs being skipped by this change: in 4,508 cases this is a mapping from an identifier to itself (e.g. PUBCHEM.COMPOUND:6914273 to PUBCHEM.COMPOUND:6914273), while in the other 7,229 cases it's between two different identifiers. The dumb next move would to remove the RXCUI filters entirely from load_cliques(), so then we can figure out the clique leader for every identifier we want to conflate (at the cost of loading all the chemical identifiers back into memory, but whatever). @cbizon Do you think that makes sense, or can you come up with a smarter way to make sure the DrugChemical file only contains clique leaders? |
If we were concerned about keeping memory low, the only other thing I can think of would be to generate the DrugChemical conflation as before, but then do a post-process on it where we load that whole thing into memory and then cycle over the chemical cliques. Each time we read a new clique leader we check the memory DC conflation, and if we find it in there, then we move it over into a new, cleaned conflation, which we write out at the end. But honestly, it seems like overkill. Probably best just to load all the clique leader IDs and be done with it. I think you're right that the code in the PR is removing too much - the object for some of these is still a clique leader, even if it's not in those maps. |
I've been thinking about this issue some, and I wonder if there's a simpler, more comprehensive fix that we can make in NodeNorm itself. The error is the following:
Which appears to be triggered by these lines in NodeNorm: The issue is in the following chunk of code: NodeNormalization/node_normalizer/normalizer.py Lines 485 to 486 in 68096b2
So it looks like the one of the values in What would happen if we changed any Nones in that list to @cbizon Do you think that would be a better solution than my overkill PR (PR TranslatorSRI/Babel#217)? |
I am ok with this solution in that it would work as a short-term, and it's preferable to the Babel solution that removes too much. But I think that the right answer is to fix Babel so that it doesn't make bad conflations (but in a more careful way than 217). |
This PR provides a workaround for #221 by: - Logging an error when an ID doesn't have type information for some reason. - Reporting such an identifier back to the user as not found, without causing the entire batch to fail (closes #222) - Assuming that any identifier missing type information can be assumed to be a `biolink:NamedThing` (closes #221). Closes #221.
As reported by @EvanDietzMorris, querying PUBCHEM.COMPOUND:440055 without drug_chemical_conflate works fine: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=false&description=false
But querying it with drug_chemical_conflate turned on consistently causes an error during processing: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=true&description=false
The text was updated successfully, but these errors were encountered: