Skip to content

CogKGE: A Knowledge Graph Embedding Toolkit and Benchmark for Representing Multi-source and Heterogeneous Knowledge. ACL 2022

License

Notifications You must be signed in to change notification settings

jinzhuoran/CogKGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


CogKGE: A Knowledge Graph Embedding Toolkit and Benckmark for Representing Multi-source and Heterogeneous Knowledge

CogKGE: A Knowledge Graph Embedding Toolkit and Benckmark for Representing Multi-source and Heterogeneous Knowledge

Demo system and more information is available at http://cognlp.com/cogkge

What's New?

  • Mar 2022: We have supported node classification framework.
  • Jan 2022: We have released CogKGE v1.0.

Description

CogKGE is a knowledge graph embedding toolkit that aims to represent multi-source and heterogeneous knowledge. CogKGE currently supports 17 models, 11 datasets including two multi-source heterogeneous KGs, five evaluation metrics, four knowledge adapters, four loss functions, three samplers and three built-in data containers.

This easy-to-use python package has the following advantages:

  • Multi-source and heterogeneous knowledge representation. CogKGE explores the unified representation of knowledge from diverse sources. Moreover, our toolkit not only contains the triple fact-based embedding models, but also supports the fusion representation of additional information, including text descriptions, node types and temporal information.

  • Comprehensive models and benchmark datasets. CogKGE implements lots of classic KGE models in the four categories of translation distance models, semantic matching models, graph neural network-based models and transformer-based models. Besides out-of-the-box models, we release two large benchmark datasets for further evaluating KGE methods, called EventKG240K and CogNet360K.

  • Extensible and modularized framework. CogKGE provides a programming framework for KGE tasks. Based on the extensible architecture, CogKGE can meet the requirements of module extension and secondary development, and pre-trained knowledge embeddings can be directly applied to downstream tasks.

  • Open source and visualization demo. Besides the toolkit, we also release an online system to discover knowledge visually. Source code, datasets and pre-trained embeddings are publicly available.

Install

Install from git

# clone CogKGE   
git clone https://github.com/jinzhuoran/CogKGE.git

# install CogKGE   
cd cogkge
pip install -e .   
pip install -r requirements.txt

Install from pip

pip install cogkge

Quick Start

Pre-trained Embedder for Knowledge Discovery

from cogkge import *

# loader lut
device = init_cogkge(device_id="0", seed=1)
loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()
processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
                               reprocess=True,
                               type=False, time=False, description=False, path=False,
                               time_unit="year",
                               pretrain_model_name="roberta-base", token_len=10,
                               path_len=10)
node_lut, relation_lut, time_lut = processor.process_lut()

# loader model
model = BoxE(entity_dict_len=len(node_lut),
             relation_dict_len=len(relation_lut),
             embedding_dim=50)

# load predictor
predictor = Predictor(model_name="BoxE",
                      data_name="EVENTKG2M",
                      model=model,
                      device=device,
                      node_lut=node_lut,
                      relation_lut=relation_lut,
                      pretrained_model_path="data/BoxE_Model.pkl",
                      processed_data_path="data",
                      reprocess=False,
                      fuzzy_query_top_k=10,
                      predict_top_k=10)

# fuzzy query node
result_node = predictor.fuzzy_query_node_keyword('champion')
print(result_node)

# fuzzy query relation
result_relation = predictor.fuzzy_query_relation_keyword("instance")
print(result_relation)

# query similary nodes
similar_node_list = predictor.predict_similar_node(node_id=0)
print(similar_node_list)

# given head and relation, query tail
tail_list = predictor.predcit_tail(head_id=0, relation_id=0)
print(tail_list)

# given tail and relation, query head
head_list = predictor.predict_head(tail_id=0, relation_id=0)
print(head_list)

# given head and tail, query relation
relation_list = predictor.predict_relation(head_id=0, tail_id=0)
print(relation_list)

# dimensionality reduction and visualization of nodes
visual_list = predictor.show_img(node_id=100, visual_num=1000)

Programming Framework for Training Models

import torch
from torch.utils.data import RandomSampler
from cogkge import *

device = init_cogkge(device_id="0", seed=1)

loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()

processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
                               reprocess=True,
                               type=True, time=False, description=False, path=False,
                               time_unit="year",
                               pretrain_model_name="roberta-base", token_len=10,
                               path_len=10)
train_dataset = processor.process(train_data)
valid_dataset = processor.process(valid_data)
test_dataset = processor.process(test_data)
node_lut, relation_lut, time_lut = processor.process_lut()

train_sampler = RandomSampler(train_dataset)
valid_sampler = RandomSampler(valid_dataset)
test_sampler = RandomSampler(test_dataset)

model = TransE(entity_dict_len=len(node_lut),
               relation_dict_len=len(relation_lut),
               embedding_dim=50)

loss = MarginLoss(margin=1.0, C=0)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)

metric = Link_Prediction(link_prediction_raw=True,
                         link_prediction_filt=False,
                         batch_size=5000000,
                         reverse=False)

lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', patience=3, threshold_mode='abs', threshold=5,
    factor=0.5, min_lr=1e-9, verbose=True
)

negative_sampler = UnifNegativeSampler(triples=train_dataset,
                                       entity_dict_len=len(node_lut),
                                       relation_dict_len=len(relation_lut))

trainer = Trainer(
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    train_sampler=train_sampler,
    valid_sampler=valid_sampler,
    model=model,
    loss=loss,
    optimizer=optimizer,
    negative_sampler=negative_sampler,
    device=device,
    output_path="../dataset",
    lookuptable_E=node_lut,
    lookuptable_R=relation_lut,
    metric=metric,
    lr_scheduler=lr_scheduler,
    log=True,
    trainer_batch_size=100000,
    epoch=3000,
    visualization=1,
    apex=True,
    dataloaderX=True,
    num_workers=4,
    pin_memory=True,
    metric_step=200,
    save_step=200,
    metric_final_model=True,
    save_final_model=True,
    load_checkpoint=None
)
trainer.train()

evaluator = Evaluator(
    test_dataset=test_dataset,
    test_sampler=test_sampler,
    model=model,
    device=device,
    metric=metric,
    output_path="../dataset",
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    lookuptable_E=node_lut,
    lookuptable_R=relation_lut,
    log=True,
    evaluator_batch_size=50000,
    dataloaderX=True,
    num_workers=4,
    pin_memory=True,
    trained_model_path=None
)
evaluator.evaluate()

Model

Category Model Conference Paper
Translation Distance Models TransE NIPS 2013 Translating embeddings for modeling multi-relational data
TransH AAAI 2014 Knowledge Graph Embedding by Translating on Hyperplanes
TransR AAAI 2015 Learning Entity and Relation Embeddings for Knowledge Graph Completion
TransD ACL 2015 Knowledge Graph Embedding via Dynamic Mapping Matrix
TransA AAAI 2015 TransA: An Adaptive Approach for Knowledge Graph Embedding
BoxE NIPS 2020 BoxE: A Box Embedding Model for Knowledge Base Completion
PairRE ACL 2021 PairRE: Knowledge Graph Embeddings via Paired Relation Vectorss
Semantic Matching Models RESCAL ICML 2011 A Three-Way Model for Collective Learning on Multi-Relational Data
DistMult ICLR 2015 Embedding Entities and Relations for Learning and Inference in Knowledge Bases
SimplE NIPS 2018 SimplE Embedding for Link Prediction in Knowledge Graphs
TuckER ACL 2019 TuckER: Tensor Factorization for Knowledge Graph Completion
RotatE ICLR 2019 RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space
Graph Neural Network-based Models R-GCN ESWC 2018 Modeling Relational Data with Graph Convolutional Networks
CompGCN ICLR 2020 Composition-based Multi-Relational Graph Convolutional Networks
Transformer-based Models HittER EMNLP 2021 HittER: Hierarchical Transformers for Knowledge Graph Embeddings
KEPLER TACL 2021 KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

Dataset

EventKG is a event-centric temporal knowledge graph, which incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations. To our best knowledge, EventKG240K is the first event-centric KGE dataset. We use EventKG V3.0 data to construct the dataset. First, we filter entities and events based on their degrees. Then, we select the triple facts when both nodes' degrees are greater than 10. At last, we add text descriptions and node types for nodes and translate triples to quadruples by temporal information. The whole dataset contains 238,911 nodes, 822 relations and 2,333,986 triples.

CogNet is a multi-source heterogeneous KG dedicated to integrating linguistic, world and commonsense knowledge. To build a subset as the dataset, we count the number of occurrences for each node. Then, we sort frame instances by the minimum occurrences of their connected nodes. After we have the sorted list, we filter the triple facts according to the preset frame categories. Finally, we find the nodes that participate in these triple facts and complete their information. The final dataset contains 360,637 nodes and 1,470,488 triples.

Other KGE open-source project

About

CogKGE: A Knowledge Graph Embedding Toolkit and Benchmark for Representing Multi-source and Heterogeneous Knowledge. ACL 2022

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages