CogKGE: A Knowledge Graph Embedding Toolkit and Benckmark for Representing Multi-source and Heterogeneous Knowledge
Demo system and more information is available at http://cognlp.com/cogkge
- Mar 2022: We have supported node classification framework.
- Jan 2022: We have released CogKGE v1.0.
CogKGE is a knowledge graph embedding toolkit that aims to represent multi-source and heterogeneous knowledge. CogKGE currently supports 17 models, 11 datasets including two multi-source heterogeneous KGs, five evaluation metrics, four knowledge adapters, four loss functions, three samplers and three built-in data containers.
This easy-to-use python package has the following advantages:
-
Multi-source and heterogeneous knowledge representation. CogKGE explores the unified representation of knowledge from diverse sources. Moreover, our toolkit not only contains the triple fact-based embedding models, but also supports the fusion representation of additional information, including text descriptions, node types and temporal information.
-
Comprehensive models and benchmark datasets. CogKGE implements lots of classic KGE models in the four categories of translation distance models, semantic matching models, graph neural network-based models and transformer-based models. Besides out-of-the-box models, we release two large benchmark datasets for further evaluating KGE methods, called EventKG240K and CogNet360K.
-
Extensible and modularized framework. CogKGE provides a programming framework for KGE tasks. Based on the extensible architecture, CogKGE can meet the requirements of module extension and secondary development, and pre-trained knowledge embeddings can be directly applied to downstream tasks.
-
Open source and visualization demo. Besides the toolkit, we also release an online system to discover knowledge visually. Source code, datasets and pre-trained embeddings are publicly available.
# clone CogKGE
git clone https://github.com/jinzhuoran/CogKGE.git
# install CogKGE
cd cogkge
pip install -e .
pip install -r requirements.txt
pip install cogkge
from cogkge import *
# loader lut
device = init_cogkge(device_id="0", seed=1)
loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()
processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
reprocess=True,
type=False, time=False, description=False, path=False,
time_unit="year",
pretrain_model_name="roberta-base", token_len=10,
path_len=10)
node_lut, relation_lut, time_lut = processor.process_lut()
# loader model
model = BoxE(entity_dict_len=len(node_lut),
relation_dict_len=len(relation_lut),
embedding_dim=50)
# load predictor
predictor = Predictor(model_name="BoxE",
data_name="EVENTKG2M",
model=model,
device=device,
node_lut=node_lut,
relation_lut=relation_lut,
pretrained_model_path="data/BoxE_Model.pkl",
processed_data_path="data",
reprocess=False,
fuzzy_query_top_k=10,
predict_top_k=10)
# fuzzy query node
result_node = predictor.fuzzy_query_node_keyword('champion')
print(result_node)
# fuzzy query relation
result_relation = predictor.fuzzy_query_relation_keyword("instance")
print(result_relation)
# query similary nodes
similar_node_list = predictor.predict_similar_node(node_id=0)
print(similar_node_list)
# given head and relation, query tail
tail_list = predictor.predcit_tail(head_id=0, relation_id=0)
print(tail_list)
# given tail and relation, query head
head_list = predictor.predict_head(tail_id=0, relation_id=0)
print(head_list)
# given head and tail, query relation
relation_list = predictor.predict_relation(head_id=0, tail_id=0)
print(relation_list)
# dimensionality reduction and visualization of nodes
visual_list = predictor.show_img(node_id=100, visual_num=1000)
import torch
from torch.utils.data import RandomSampler
from cogkge import *
device = init_cogkge(device_id="0", seed=1)
loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()
processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
reprocess=True,
type=True, time=False, description=False, path=False,
time_unit="year",
pretrain_model_name="roberta-base", token_len=10,
path_len=10)
train_dataset = processor.process(train_data)
valid_dataset = processor.process(valid_data)
test_dataset = processor.process(test_data)
node_lut, relation_lut, time_lut = processor.process_lut()
train_sampler = RandomSampler(train_dataset)
valid_sampler = RandomSampler(valid_dataset)
test_sampler = RandomSampler(test_dataset)
model = TransE(entity_dict_len=len(node_lut),
relation_dict_len=len(relation_lut),
embedding_dim=50)
loss = MarginLoss(margin=1.0, C=0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
metric = Link_Prediction(link_prediction_raw=True,
link_prediction_filt=False,
batch_size=5000000,
reverse=False)
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', patience=3, threshold_mode='abs', threshold=5,
factor=0.5, min_lr=1e-9, verbose=True
)
negative_sampler = UnifNegativeSampler(triples=train_dataset,
entity_dict_len=len(node_lut),
relation_dict_len=len(relation_lut))
trainer = Trainer(
train_dataset=train_dataset,
valid_dataset=valid_dataset,
train_sampler=train_sampler,
valid_sampler=valid_sampler,
model=model,
loss=loss,
optimizer=optimizer,
negative_sampler=negative_sampler,
device=device,
output_path="../dataset",
lookuptable_E=node_lut,
lookuptable_R=relation_lut,
metric=metric,
lr_scheduler=lr_scheduler,
log=True,
trainer_batch_size=100000,
epoch=3000,
visualization=1,
apex=True,
dataloaderX=True,
num_workers=4,
pin_memory=True,
metric_step=200,
save_step=200,
metric_final_model=True,
save_final_model=True,
load_checkpoint=None
)
trainer.train()
evaluator = Evaluator(
test_dataset=test_dataset,
test_sampler=test_sampler,
model=model,
device=device,
metric=metric,
output_path="../dataset",
train_dataset=train_dataset,
valid_dataset=valid_dataset,
lookuptable_E=node_lut,
lookuptable_R=relation_lut,
log=True,
evaluator_batch_size=50000,
dataloaderX=True,
num_workers=4,
pin_memory=True,
trained_model_path=None
)
evaluator.evaluate()
Category | Model | Conference | Paper |
---|---|---|---|
Translation Distance Models | TransE | NIPS 2013 | Translating embeddings for modeling multi-relational data |
TransH | AAAI 2014 | Knowledge Graph Embedding by Translating on Hyperplanes | |
TransR | AAAI 2015 | Learning Entity and Relation Embeddings for Knowledge Graph Completion | |
TransD | ACL 2015 | Knowledge Graph Embedding via Dynamic Mapping Matrix | |
TransA | AAAI 2015 | TransA: An Adaptive Approach for Knowledge Graph Embedding | |
BoxE | NIPS 2020 | BoxE: A Box Embedding Model for Knowledge Base Completion | |
PairRE | ACL 2021 | PairRE: Knowledge Graph Embeddings via Paired Relation Vectorss | |
Semantic Matching Models | RESCAL | ICML 2011 | A Three-Way Model for Collective Learning on Multi-Relational Data |
DistMult | ICLR 2015 | Embedding Entities and Relations for Learning and Inference in Knowledge Bases | |
SimplE | NIPS 2018 | SimplE Embedding for Link Prediction in Knowledge Graphs | |
TuckER | ACL 2019 | TuckER: Tensor Factorization for Knowledge Graph Completion | |
RotatE | ICLR 2019 | RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space | |
Graph Neural Network-based Models | R-GCN | ESWC 2018 | Modeling Relational Data with Graph Convolutional Networks |
CompGCN | ICLR 2020 | Composition-based Multi-Relational Graph Convolutional Networks | |
Transformer-based Models | HittER | EMNLP 2021 | HittER: Hierarchical Transformers for Knowledge Graph Embeddings |
KEPLER | TACL 2021 | KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation |
EventKG is a event-centric temporal knowledge graph, which incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations. To our best knowledge, EventKG240K is the first event-centric KGE dataset. We use EventKG V3.0 data to construct the dataset. First, we filter entities and events based on their degrees. Then, we select the triple facts when both nodes' degrees are greater than 10. At last, we add text descriptions and node types for nodes and translate triples to quadruples by temporal information. The whole dataset contains 238,911 nodes, 822 relations and 2,333,986 triples.
CogNet is a multi-source heterogeneous KG dedicated to integrating linguistic, world and commonsense knowledge. To build a subset as the dataset, we count the number of occurrences for each node. Then, we sort frame instances by the minimum occurrences of their connected nodes. After we have the sorted list, we filter the triple facts according to the preset frame categories. Finally, we find the nodes that participate in these triple facts and complete their information. The final dataset contains 360,637 nodes and 1,470,488 triples.