Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Babel v1.3 ongoing fixes #201

Merged
merged 45 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
55a8a79
Add object normalization for PubChem IDs in drugchemical conflation.
gaurav Oct 13, 2023
269546e
Replaced exceptions with warnings.
gaurav Oct 14, 2023
115187d
Reverted one of the warnings back to an exception.
gaurav Oct 14, 2023
43e076f
Prevent and log KeyError in anatomy.
gaurav Nov 3, 2023
618452d
Reverted anatomy hiding code.
gaurav Nov 3, 2023
8b8ecca
Moved UniProtKB downloads into Snakefile as wget commands.
gaurav Nov 3, 2023
2c67b41
Improved progress output so it can be mixed with other logs better.
gaurav Nov 3, 2023
9d357d0
Deleted redundant output filename from pull_panther_pathways().
gaurav Nov 5, 2023
6286d0e
Took out --progress=dot as this somehow makes it even harder to read.
gaurav Nov 5, 2023
c9bb80e
Added `-k` to gunzip so we keep the gzip file in case it's needed.
gaurav Nov 5, 2023
99c28af
Upgraded UMLS and RxNorm versions.
gaurav Nov 20, 2023
bb5ca26
Switched HGNC to HTTP from FTP.
gaurav Nov 26, 2023
c5fc007
Fixed FTP -> HTTP change for HGNC.
gaurav Nov 26, 2023
a38ef74
Added code to skip hgfemale_gene_ensembl.
gaurav Nov 30, 2023
c7e8e30
Improved debugging.
gaurav Dec 1, 2023
7c449b3
Removed unnecessary import.
gaurav Dec 2, 2023
4c0312b
Upgraded to the latest version of Biolink Model.
gaurav Dec 3, 2023
71a3944
Updated version in the downloaded file.
gaurav Oct 19, 2023
7a6063d
First stab at a KGX exporter.
gaurav Dec 2, 2023
fa3b465
Added basic KGX export.
gaurav Dec 2, 2023
34d0e2d
Improved logic.
gaurav Dec 2, 2023
2fbf0f4
Improved documentation.
gaurav Dec 2, 2023
4ff3bc0
Added export everything to KGX to the overall `all` rule.
gaurav Dec 2, 2023
4e45f01
Implemented get_all_compendia(), moved some code around as a result.
gaurav Dec 2, 2023
39ce128
Increased Babel output PVC to accomodate KGX export.
gaurav Dec 2, 2023
753de90
Merge branch 'improve-uniprotkb-downloads' into babel-1.3
gaurav Dec 10, 2023
6b0286c
Objects not mapped to a RxCUI now warnings instead of skip.
gaurav Dec 10, 2023
b0b0534
Merge branch 'fix-infores-curies' into babel-1.3
gaurav Dec 10, 2023
6a131bd
First stab at choosing a better preferred name for NameRes cliques.
gaurav Dec 4, 2023
6ace4cc
First stab at a more sophisticated preferred_name chooser.
gaurav Dec 10, 2023
aaea16c
Upgraded RxNorm to 12042023.
gaurav Dec 10, 2023
c648d87
Merge branch 'pick-better-preferred-names' into babel-1.3
gaurav Dec 10, 2023
91f6ddf
Reduced log of empty synonym list to debug().
gaurav Dec 10, 2023
87df665
Changed warning for no preferred name to debug.
gaurav Dec 10, 2023
7863ee8
Fixed typo.
gaurav Dec 10, 2023
46679de
Removed unnecessary argument from get_panther_pathways.
gaurav Dec 10, 2023
d7944f3
Fixed bug in boost prefix sort, improve exception trace.
gaurav Dec 10, 2023
16d7d95
Added support for gzipped KGX files on the fly.
gaurav Dec 11, 2023
30f8d6a
Increased batch size, suppressed per-batch log.
gaurav Dec 11, 2023
1a36e6f
Moved PUBCHEM.COMPOUND to the back of the list.
gaurav Dec 13, 2023
204711b
Removed extraneous blank lines.
gaurav Dec 14, 2023
12b74fb
Upgraded Biolink Model to latest version.
gaurav Jan 2, 2024
f3f270d
Merge branch 'master' into babel-1.3
gaurav Jan 2, 2024
754dfc0
Uh oh, we're supposed to be on Biolink v3.6.0. Also updated RxNorm.
gaurav Jan 5, 2024
a8c4bbe
Merge branch 'master' into babel-1.3
gaurav Jan 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ include: "src/snakefiles/taxon.snakefile"
include: "src/snakefiles/genefamily.snakefile"
include: "src/snakefiles/leftover_umls.snakefile"
include: "src/snakefiles/macromolecular_complex.snakefile"
include: "src/snakefiles/exports.snakefile"

rule all:
input:
Expand All @@ -28,6 +29,8 @@ rule all:
config['output_directory'] + '/reports/umls_done',
config['output_directory'] + '/reports/macromolecular_complex_done',
config['output_directory'] + '/reports/drugchemical_done',
# Check if we have exported the compendia as KGX.
config['output_directory'] + '/kgx/done',
output:
x = config['output_directory'] + '/reports/all_done'
shell:
Expand Down
21 changes: 18 additions & 3 deletions config.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
"intermediate_directory": "babel_outputs/intermediate",
"output_directory": "babel_outputs",

"biolink_version": "3.5.4",
"umls_version": "2023AA",
"rxnorm_version": "08072023",
"biolink_version": "3.6.0",
"umls_version": "2023AB",
"rxnorm_version": "01022024",

"ncbi_files": ["gene2ensembl.gz", "gene_info.gz", "gene_orthologs.gz", "gene_refseq_uniprotkb_collab.gz", "mim2gene_medgen"],
"ubergraph_ontologies": ["UBERON", "CL", "GO", "NCIT", "ECO", "ECTO", "ENVO", "HP", "UPHENO","BFO","BSPO","CARO","CHEBI","CP","GOREL","IAO","MAXO","MONDO","PATO","PR","RO","UBPROP"],
Expand Down Expand Up @@ -57,10 +57,25 @@
"genefamily_ids": ["PANTHER.FAMILY","HGNC.FAMILY"],
"genefamily_outputs": ["GeneFamily.txt"],

"umls_outputs": ["umls.txt"],
"macromolecularcomplex_outputs": ["MacromolecularComplex.txt"],
"ubergraph_iri_stem_to_prefix_map": {
"https://identifiers.org/ncbigene/": "NCBIGene",
"http://www.ncbi.nlm.nih.gov/gene/": "NCBIGene",
"http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=": "HGNC",
"http://www.informatics.jax.org/marker/MGI:": "MGI"
},

"preferred_name_boost_prefixes": {
"biolink:ChemicalEntity": [
"DRUGBANK",
"GTOPDB",
"DrugCentral",
"CHEMBL.COMPOUND",
"RXCUI",
"CHEBI",
"HMDB",
"PUBCHEM.COMPOUND"
]
}
}
2 changes: 1 addition & 1 deletion kubernetes/babel-outputs.k8s.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@ spec:
- ReadWriteOnce
resources:
requests:
storage: 400Gi
storage: 500Gi
storageClassName: basic
76 changes: 70 additions & 6 deletions src/babel_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
import subprocess
import traceback
from ftplib import FTP
from io import BytesIO
import gzip
Expand Down Expand Up @@ -273,6 +274,27 @@ def pull_via_wget(
raise RuntimeError(f'Expected uncompressed file {uncompressed_filename} does not exist.')


def sort_identifiers_with_boosted_prefixes(identifiers, prefixes):
"""
Given a list of identifiers (with `identifier` and `label` keys), sort them using
the following rules:
- Any identifier that has a prefix in prefixes is sorted based on its order in prefixes.
- Any identifier that does not have a prefix in prefixes is left in place.

:param identifiers: A list of identifiers to sort. This is a list of dictionaries
containing `identifier` and `label` keys, and possible others that we ignore.
:param prefixes: A list of prefixes, in the order in which they should be boosted.
We assume that CURIEs match these prefixes if they are in the form `{prefix}:...`.
:return: The list of identifiers sorted as described above.
"""

# Thanks to JetBrains AI.
return sorted(
identifiers,
key=lambda identifier: prefixes.index(identifier['identifier'].split(':', 1)[0]) if identifier['identifier'].split(':', 1)[0] in prefixes else len(prefixes)
)


def write_compendium(synonym_list,ofname,node_type,labels={},extra_prefixes=[],icrdf_filename=None):
"""
:param synonym_list:
Expand All @@ -294,6 +316,10 @@ def write_compendium(synonym_list,ofname,node_type,labels={},extra_prefixes=[],i
node_factory = NodeFactory(make_local_name(''),biolink_version)
synonym_factory = SynonymFactory(make_local_name(''))

# Load the preferred_name_boost_prefixes -- this tells us which prefixes to boost when
# coming up with a preferred label for a particular Biolink class.
preferred_name_boost_prefixes = config['preferred_name_boost_prefixes']

# Create an InformationContentFactory based on the specified icRDF.tsv file. Default to the one in the download
# directory.
if not icrdf_filename:
Expand Down Expand Up @@ -334,14 +360,51 @@ def write_compendium(synonym_list,ofname,node_type,labels={},extra_prefixes=[],i
# Why are we running the synonym list through set() again? Because get_synonyms returns unique pairs of (relation, synonym).
# So multiple identical synonyms may be returned as long they have a different relation. But since we don't care about the
# relation, we should get rid of any duplicated synonyms here.
synonyms_list = sorted(set(synonyms), key=lambda x:len(x))
synonyms_list = sorted(set(synonyms), key=lambda x: len(x))
try:
types = node_factory.get_ancestors(node["type"])
document = {"curie": curie,
"names": synonyms_list,
"types": [ t[8:] for t in node_factory.get_ancestors(node["type"])]} #remove biolink:

if "label" in node["identifiers"][0]:
document["preferred_name"] = node["identifiers"][0]["label"]
"types": [t[8:] for t in types]} # remove biolink:

# To pick a preferred label for this clique, we need to do three things:
# 1. We sort all labels in the preferred-name order. By default, this should be
# the preferred CURIE order, but if this clique is in one of the Biolink classes in
# preferred_name_boost_prefixes, we boost those prefixes in that order to the top of the list.
# 2. We filter out any suspicious labels.
# (If this simple filter doesn't work, and if prefixes are inconsistent, we can build upon the
# algorithm proposed by Jeff at
# https://github.com/NCATSTranslator/Feedback/issues/259#issuecomment-1605140850)
# 3. We choose the first label that isn't blank. If no labels remain, we generate a warning.

# Step 1.1. Sort labels in boosted prefix order if possible.
possible_labels = []
for typ in types:
if typ in preferred_name_boost_prefixes:
# This is the most specific matching type, so we use this.
possible_labels = map(lambda identifier: identifier.get('label', ''),
sort_identifiers_with_boosted_prefixes(
node["identifiers"],
preferred_name_boost_prefixes[typ]
))
break

# Step 1.2. If we didn't have a preferred_name_boost_prefixes, just use the identifiers in their
# Biolink prefix order.
if not possible_labels:
possible_labels = map(lambda identifier: identifier.get('label', ''), node["identifiers"])

# Step 2. Filter out any suspicious labels.
filtered_possible_labels = [l for l in possible_labels if
l and # Ignore blank or empty names.
not l.startswith('CHEMBL') # Some CHEMBL names are just the identifier again.
]

# Step 3. Pick the first label that isn't blank.
if filtered_possible_labels:
document["preferred_name"] = filtered_possible_labels[0]
else:
logging.debug(f"No preferred name for {node}")

# We previously used the shortest length of a name as a proxy for how good a match it is, i.e. given
# two concepts that both have the word "acetaminophen" in them, we assume that the shorter one is the
Expand All @@ -351,7 +414,7 @@ def write_compendium(synonym_list,ofname,node_type,labels={},extra_prefixes=[],i

# Since synonyms_list is sorted,
if len(synonyms_list) == 0:
logging.warning(f"Synonym list for {node} is empty: no valid name. Skipping.")
logging.debug(f"Synonym list for {node} is empty: no valid name. Skipping.")
continue
else:
document["shortest_name_length"] = len(synonyms_list[0])
Expand All @@ -371,6 +434,7 @@ def write_compendium(synonym_list,ofname,node_type,labels={},extra_prefixes=[],i
print(f"Exception thrown while write_compendium() was generating {ofname}: {ex}")
print(node["type"])
print(node_factory.get_ancestors(node["type"]))
traceback.print_exc()
exit()

def glom(conc_set, newgroups, unique_prefixes=['INCHIKEY'],pref='HP',close={}):
Expand Down
15 changes: 14 additions & 1 deletion src/createcompendia/drugchemical.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,11 +273,24 @@ def build_conflation(rxn_concord,umls_concord,pubchem_rxn_concord,drug_compendiu
x = line.strip().split('\t')
subject = x[0]
object = x[2]
#object is a PUBCHEM. It's by definition a clique_leader.

if subject in drug_rxcui_to_clique:
subject = drug_rxcui_to_clique[subject]
elif subject in chemical_rxcui_to_clique:
subject = chemical_rxcui_to_clique[subject]
else:
raise RuntimeError(f"Unknown identifier in drugchemical conflation as subject: {subject}")

if object in drug_rxcui_to_clique:
object = drug_rxcui_to_clique[object]
elif object in chemical_rxcui_to_clique:
object = chemical_rxcui_to_clique[object]
else:
logging.warning(
f"Object in subject-object pair ({subject}, {object}) isn't mapped to a RxCUI"
)
# raise RuntimeError(f"Unknown identifier in drugchemical conflation as object: {object}")

pairs.append((subject, object))
print("glom")
gloms = {}
Expand Down
1 change: 1 addition & 0 deletions src/createcompendia/protein.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ def write_ensembl_ids(ensembl_dir, outfile):
dlpath = os.path.join(ensembl_dir, dl)
if os.path.isdir(dlpath):
infname = os.path.join(dlpath, 'BioMart.tsv')
print(f'write_ensembl_ids for input filename {infname}')
if os.path.exists(infname):
# open each ensembl file, find the id column, and put it in the output
with open(infname, 'r') as inf:
Expand Down
6 changes: 6 additions & 0 deletions src/datahandlers/ensembl.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,17 @@
# just what we need.
def pull_ensembl(complete_file):
f = find_datasets()

skip_dataset_ids = {'hgfemale_gene_ensembl'}

cols = {"ensembl_gene_id", "ensembl_peptide_id", "description", "external_gene_name", "external_gene_source",
"external_synonym", "chromosome_name", "source", "gene_biotype", "entrezgene_id", "zfin_id_id", 'mgi_id',
'rgd_id', 'flybase_gene_id', 'sgd_gene', 'wormbase_gene'}
for ds in f['Dataset_ID']:
print(ds)
if ds in skip_dataset_ids:
print(f'Skipping {ds} as it is included in skip_dataset_ids: {skip_dataset_ids}')
continue
outfile = make_local_name('BioMart.tsv', subpath=f'ENSEMBL/{ds}')
# Really, we should let snakemake handle this, but then we would need to put a list of all the 200+ sets in our
# config, and keep it up to date. Maybe you could have a job that gets the datasets and writes a dataset file,
Expand Down
12 changes: 9 additions & 3 deletions src/datahandlers/hgnc.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
from src.babel_utils import make_local_name, pull_via_ftp
from src.babel_utils import make_local_name, pull_via_urllib
import json

def pull_hgnc():
outfile='HGNC/hgnc_complete_set.json'
pull_via_ftp('ftp.ebi.ac.uk', '/pub/databases/genenames/new/json', 'hgnc_complete_set.json',outfilename=outfile)
# On 2023nov26, I would get an error trying to download this file using FTP on Python (although
# weirdly enough, I could download the file without any problem using macOS Finder). So I changed
# it to use HTTP instead.
pull_via_urllib(
'https://ftp.ebi.ac.uk/pub/databases/genenames/new/json/',
'hgnc_complete_set.json',
decompress=False,
subpath="HGNC")

def pull_hgnc_labels_and_synonyms(infile):
with open(infile,'r') as data:
Expand Down
Loading