Reproducible error when using drug_chemical_conflate #221

gaurav · 2023-10-09T18:24:20Z

As reported by @EvanDietzMorris, querying PUBCHEM.COMPOUND:440055 without drug_chemical_conflate works fine: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=false&description=false

But querying it with drug_chemical_conflate turned on consistently causes an error during processing: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=PUBCHEM.COMPOUND%3A440055&conflate=true&drug_chemical_conflate=true&description=false

gaurav · 2023-10-09T21:02:19Z

The problem appears to be that PUBCHEM.COMPOUND:440055 is conflated with RXCUI:2642190, which is missing from NodeNorm for some reason: https://nodenormalization-sri.renci.org/1.4/get_normalized_nodes?curie=RXCUI:2642190&conflate=true&drug_chemical_conflate=false&description=false

This apparently causes an exception to be thrown when trying to figure out the types of this identifier:

[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
  File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
    eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
  File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
    a = app.state.toolkit.get_ancestors(input_type)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
    case = guess_casing(s)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
    if "_" in s:
TypeError: argument of type 'NoneType' is not iterable

gaurav · 2023-10-10T04:17:01Z

It looks like there are 41 cliques on https://nodenormalization-sri.renci.org/ (current NodeNorm Dev) that currently cause this error:

PUBCHEM.COMPOUND:14969
PUBCHEM.COMPOUND:62816
PUBCHEM.COMPOUND:2657
PUBCHEM.COMPOUND:3420
PUBCHEM.COMPOUND:184933
PUBCHEM.COMPOUND:16132446
PUBCHEM.COMPOUND:16158474
PUBCHEM.COMPOUND:6452712
PUBCHEM.COMPOUND:121492004
PUBCHEM.COMPOUND:124220636
PUBCHEM.COMPOUND:3325
PUBCHEM.COMPOUND:426756
PUBCHEM.COMPOUND:64738
PUBCHEM.COMPOUND:2274
PUBCHEM.COMPOUND:657250
PUBCHEM.COMPOUND:20056431
PUBCHEM.COMPOUND:76943386
PUBCHEM.COMPOUND:163285897
PUBCHEM.COMPOUND:24883445
PUBCHEM.COMPOUND:23666342
PUBCHEM.COMPOUND:60714
PUBCHEM.COMPOUND:60754
PUBCHEM.COMPOUND:72057
PUBCHEM.COMPOUND:11158972
PUBCHEM.COMPOUND:22002932 <-- continuous series of 500 errors; not sure why that happened but it might be a clue!
PUBCHEM.COMPOUND:74763937
PUBCHEM.COMPOUND:60196433
PUBCHEM.COMPOUND:118115473
PUBCHEM.COMPOUND:145722621
PUBCHEM.COMPOUND:98941
PUBCHEM.COMPOUND:13730
PUBCHEM.COMPOUND:440055
PUBCHEM.COMPOUND:2734019
PUBCHEM.COMPOUND:2724369
PUBCHEM.COMPOUND:71587456
PUBCHEM.COMPOUND:24868287
PUBCHEM.COMPOUND:21121987
PUBCHEM.COMPOUND:107488
PUBCHEM.COMPOUND:135398592
PUBCHEM.COMPOUND:16218792
PUBCHEM.COMPOUND:2736435

EvanDietzMorris · 2023-10-10T18:40:11Z

more of them:
PUBCHEM.COMPOUND:5352133
PUBCHEM.COMPOUND:5353622
PUBCHEM.COMPOUND:5742832
PUBCHEM.COMPOUND:5702160
PUBCHEM.COMPOUND:162533872
PUBCHEM.COMPOUND:5284447
PUBCHEM.COMPOUND:5479530

gaurav · 2023-10-12T19:42:48Z

Note the oddity of PUBCHEM.COMPOUND:124220636, which should only be conflated with PUBCHEM.COMPOUND:165411920 and UMLS:C5402366, all three of which have type information.

Update: this appears to be caused by the middle ID, but then how come it works when conflation is turned off without problems?

No type information found for 'PUBCHEM.COMPOUND:165411920' with eqids: None.

Update: PUBCHEM.COMPOUND:165411920 is part of the PUBCHEM.COMPOUND:124220636 clique. Perhaps that is confusing the conflation algorithm somehow?

cbizon · 2023-10-12T20:38:56Z

Right, so PUBCHEM.COMPOUND:165411920 really doesn't have a type. The type database only uses the preferred identifier to retrieve the type. The problem is that the conflation contains something other than a preferred identifier. In conflation, it is assuming that those are all preferred, and will therefore all have entries in the right places.

It works when you just go with *20 because it first hits the main index, finds the preferred ID and then uses that to go look up the type.

The upshot of which: Your more careful response to the error is all that can be done in nodenorm, and the real problem is in Babel (oops).

gaurav · 2023-10-13T18:22:56Z

Here are the cliques for those three identifiers from our previous run. As you said, it's only two cliques, since PUBCHEM.COMPOUND:124220636 is the clique leader for PUBCHEM.COMPOUND:165411920.

{"type": "biolink:MolecularMixture", "identifiers": [{"i": "PUBCHEM.COMPOUND:124220636", "l": "Copper dotatate Cu-64", "d": []}, {"i": "PUBCHEM.COMPOUND:165411920", "l": "Detectnet", "d": []}, {"i": "CHEMBL.COMPOUND:CHEMBL4297339", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "UNII:N3858377KC", "l": "COPPER OXODOTREOTIDE CU-64", "d": []}, {"i": "DRUGBANK:DB15873", "d": []}, {"i": "MESH:C000718307", "l": "copper dotatate CU-64", "d": []}, {"i": "MESH:C575629", "l": "64Cu-DOTATATE", "d": []}, {"i": "DrugCentral:5411", "l": "copper dotatate Cu-64", "d": []}, {"i": "HMDB:HMDB0304900", "l": "Copper Cu 64 Dotatate", "d": []}, {"i": "INCHIKEY:IJRLLVFQGCCPPI-NVGRTJHCSA-L", "d": []}, {"i": "UMLS:C3502191", "l": "copper oxodotreotide CU-64", "d": []}, {"i": "RXCUI:2396442", "d": []}]}
{"type": "biolink:ChemicalEntity", "identifiers": [{"i": "UMLS:C5402366", "l": "dodatate", "d": []}, {"i": "RXCUI:2396443", "d": []}]}

The Babel regeneration stalled sometime this morning, but I've restarted it now. Once that's done I'll confirm that it also has the incorrect conflation, but I think I can assume that that will be the case and look for issues in the DrugChemical conflation code in the meantime.

gaurav · 2023-10-14T20:47:34Z

I've modified the code in PR TranslatorSRI/Babel#191 to skip identifiers where the object isn't present in either drug_rxcui_to_clique or chemical_rxcui_to_clique, which are mappings from RXCUIs to clique leaders. This seems to have eliminated the PUBCHEM.COMPOUND identifiers that we don't actually want conflated, but I'm pretty sure that it's skipping too many non-RXCUI mappings that we actually do want in these conflated cliques. Here is a list of the 11,737 subject-object pairs being skipped by this change: in 4,508 cases this is a mapping from an identifier to itself (e.g. PUBCHEM.COMPOUND:6914273 to PUBCHEM.COMPOUND:6914273), while in the other 7,229 cases it's between two different identifiers.

The dumb next move would to remove the RXCUI filters entirely from load_cliques(), so then we can figure out the clique leader for every identifier we want to conflate (at the cost of loading all the chemical identifiers back into memory, but whatever). @cbizon Do you think that makes sense, or can you come up with a smarter way to make sure the DrugChemical file only contains clique leaders?

cbizon · 2023-10-16T13:23:55Z

If we were concerned about keeping memory low, the only other thing I can think of would be to generate the DrugChemical conflation as before, but then do a post-process on it where we load that whole thing into memory and then cycle over the chemical cliques. Each time we read a new clique leader we check the memory DC conflation, and if we find it in there, then we move it over into a new, cleaned conflation, which we write out at the end. But honestly, it seems like overkill. Probably best just to load all the clique leader IDs and be done with it.

I think you're right that the code in the PR is removing too much - the object for some of these is still a clique leader, even if it's not in those maps.

gaurav · 2023-12-07T03:42:19Z

I've been thinking about this issue some, and I wonder if there's a simpler, more comprehensive fix that we can make in NodeNorm itself. The error is the following:

[2023-10-09 20:48:09 +0000] [13] [INFO] 172.25.13.150:55558 - "GET /docs HTTP/1.1" 200
2023-10-09 20:48:10,600 | ERROR | normalizer:get_normalized_nodes | Exception: Traceback (most recent call last):
  File "/code/node_normalizer/normalizer.py", line 559, in get_normalized_nodes
    eqids2, types2 = await get_eqids_and_types(app, all_other_ids)
  File "/code/node_normalizer/normalizer.py", line 486, in get_eqids_and_types
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 486, in <listcomp>
    types = [get_ancestors(app, t) for t in types]
  File "/code/node_normalizer/normalizer.py", line 24, in get_ancestors
    a = app.state.toolkit.get_ancestors(input_type)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 120, in wrapper
    case = guess_casing(s)
  File "/usr/local/lib/python3.9/site-packages/bmt/util.py", line 65, in guess_casing
    if "_" in s:
TypeError: argument of type 'NoneType' is not iterable

Which appears to be triggered by these lines in NodeNorm:

The issue is in the following chunk of code:

NodeNormalization/node_normalizer/normalizer.py

Lines 485 to 486 in 68096b2

    
           types = await app.state.redis_connection2.mget(*canonical_nonan, encoding='utf-8') 
        
           types = [get_ancestors(app, t) for t in types]

So it looks like the one of the values in types is coming out as None.

What would happen if we changed any Nones in that list to biolink:NamedThing? Since the conflation code combines types from every identifier in the conflation, that should skip over any IDs that aren't clique leaders, and as long as at least one clique leader has a type, we should end up using that for the entire conflation. And if there are any conflations without a single clique leader, we would report that as having a type of biolink:NamedThing without causing a 500 error, which would still flag to us that something is wrong.

@cbizon Do you think that would be a better solution than my overkill PR (PR TranslatorSRI/Babel#217)?

cbizon · 2023-12-07T14:15:47Z

I am ok with this solution in that it would work as a short-term, and it's preferable to the Babel solution that removes too much. But I think that the right answer is to fix Babel so that it doesn't make bad conflations (but in a more careful way than 217).

See decision at TranslatorSRI/NodeNormalization#221 (comment)

…THING. As proposed in #221 (comment)

This PR provides a workaround for #221 by: - Logging an error when an ID doesn't have type information for some reason. - Reporting such an identifier back to the user as not found, without causing the entire batch to fail (closes #222) - Assuming that any identifier missing type information can be assumed to be a `biolink:NamedThing` (closes #221). Closes #221.

gaurav self-assigned this Oct 9, 2023

gaurav added this to the NodeNorm October 2023 milestone Oct 9, 2023

gaurav mentioned this issue Oct 12, 2023

Workaround for missing type information #223

Merged

gaurav mentioned this issue Oct 13, 2023

Possible fixes for DrugChemical conflation bug TranslatorSRI/Babel#191

Closed

gaurav mentioned this issue Nov 2, 2023

Test issue 221 helxplatform/translator-devops#782

Closed

gaurav modified the milestones: NodeNorm October 2023, NodeNorm November 2023 Dec 3, 2023

gaurav mentioned this issue Dec 7, 2023

Fix conflation bug TranslatorSRI/Babel#217

Closed

gaurav added a commit to TranslatorSRI/Babel that referenced this issue Dec 10, 2023

Objects not mapped to a RxCUI now warnings instead of skip.

6b0286c

See decision at TranslatorSRI/NodeNormalization#221 (comment)

gaurav added a commit that referenced this issue Dec 10, 2023

Identifiers missing type information now assumed to be BIOLINK_NAMED_…

4b826ba

…THING. As proposed in #221 (comment)

gaurav mentioned this issue Dec 15, 2023

Non-leader identifiers are being incorporated into drug-chemical conflation TranslatorSRI/Babel#220

Closed

gaurav closed this as completed in #223 Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible error when using drug_chemical_conflate #221

Reproducible error when using drug_chemical_conflate #221

gaurav commented Oct 9, 2023

gaurav commented Oct 9, 2023

gaurav commented Oct 10, 2023 •

edited

Loading

EvanDietzMorris commented Oct 10, 2023 •

edited

Loading

gaurav commented Oct 12, 2023 •

edited

Loading

cbizon commented Oct 12, 2023 •

edited

Loading

gaurav commented Oct 13, 2023

gaurav commented Oct 14, 2023

cbizon commented Oct 16, 2023

gaurav commented Dec 7, 2023

cbizon commented Dec 7, 2023

Reproducible error when using drug_chemical_conflate #221

Reproducible error when using drug_chemical_conflate #221

Comments

gaurav commented Oct 9, 2023

gaurav commented Oct 9, 2023

gaurav commented Oct 10, 2023 • edited Loading

EvanDietzMorris commented Oct 10, 2023 • edited Loading

gaurav commented Oct 12, 2023 • edited Loading

cbizon commented Oct 12, 2023 • edited Loading

gaurav commented Oct 13, 2023

gaurav commented Oct 14, 2023

cbizon commented Oct 16, 2023

gaurav commented Dec 7, 2023

cbizon commented Dec 7, 2023

gaurav commented Oct 10, 2023 •

edited

Loading

EvanDietzMorris commented Oct 10, 2023 •

edited

Loading

gaurav commented Oct 12, 2023 •

edited

Loading

cbizon commented Oct 12, 2023 •

edited

Loading