BioConceptVec:
creating and evaluating literature-based biomedical concept embeddings on a large scale
- Text corpora
- Named Entity Recognition (NER) tools
- BioConceptVec: embeddings and concept files
- Tutorial
- Datasets
- References
- Acknowledgments
We created BioConceptVec using the entire PubMed. The texts were split and tokenized using NLTK. We also lowercased all the words.
We employed PubTator to annotate biomedical concepts in the PubMed. It covers genes, mutations, chemicals, diseases and cellines. The trained embeddings contain over 400,000 concepts.
We release four versions of BioConceptVec (cbow, skip-gram, glove and fastText). For each version, we make both the embedding(contains concepts and other words) in binary format and the concept-only file in json format available.
- BioConceptVec cbow: embedding (2.4GB) and concept-only (798MB).
- BioConceptVec skip-gram: embedding (2.4GB) and concept-only (812MB).
- BioConceptVec glove: embedding (2.4GB) and concept-only (835MB).
- BioConceptVec fastText: embedding (2.4GB) and concept-only (813MB).
You can find this tutorial on how to use BioConceptVec (for both embedding and concept-only files) for a quick start.
We also make all the 9 evaluation datasets publicly available. It covers 4 applications:
-
Drug-Gene interactions. The dataset contains (1) ID: the instance ID, (2) num_of_genes: the number of genes for this instance, (3) pos_rel_genes: the IDs of related genes, and (4) neg_rel_genes: the IDs of unrelated genes.
-
Gene-Gene interactions. 5 datasets on gene-gene interactions have the same format as above.
-
Protein-Protein interaction. It contains two datasets: (1) combined: protein-protein interactions created based on STRING combined scores and (2) exp700: protein-protein interactions created based on STRING experimental scores over 700. Both datasets contain train, valid and test files. The file contains (1) query: query protein ID, (2) subject: subject protein ID, (3) score: STRING score and (4) label: whether it is a protein-protein interaction.
-
Drug-Drug interaction. This dataset is from Drug-Drug interaction semeval-2013. Please see the details there.
When using our resources, please cite the following papers:
Chen, Q., Lee, K., Yan, S., Kim, S., Wei, C. H., & Lu, Z. (2019). BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. To appear in PLOS Computational Biology.
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine.