Additional Material for the publication:
An Evaluation of State-of-the-Art Approaches to Relation Extraction for Usage on Domain-Specific Corpora
Christoph Brandl, Jens Albrecht and Renato Budinich
This publication was created as part of the research group Future Engineering.
The folder 'fe-training-data' contains all available examples from our manually labelled Future Engineering data. They are splitted into training, test and evaluation data files. The data set is based on articles extracted from electrive.com, a news provider targeting decision-makers, manufacturers and service providers in the e-mobility sector.
In addition, the folder 'fewrel-training-data' contains the used training and evaluation data from the FewRel data set, as described in the conference papers.
This repository contains different approaches for the Relation Extraction task from text. At the moment the repository contains working implementations of the following approaches :
- Entity-aware BLSTM based on this GitHub repository
- ERNIE based on this GitHub repository
- R-BERT based on this GitHub repository
- Matching the Blanks BERT based on the this GitHub repository
- BERT Pair based on this GitHub repository
In addition the repository contains a converter for parsing TSV files from the INCEptTION annotation tool transfering them into a data format similar to the format of FewRel data.
- python == 3.6
- torch >= 1.5.0
- transformers == 3.0.0
- nltk >= 3.2.5
- rdflib >= 5.0.0
- tagme >= 0.1.3
- flair >= 0.6.0
- wptools >= 0.4.17
- pydotplus >= 2.0.2
- graphviz >= 0.10.1
- lime >= 0.2.0.1
There is a requiremets.txt file included in the repository for installing all needed libraries in the correct version.
However, note that some of the libraries can not be installed via a requirements file and have to be installed seperately. In particular, PyTorch, Flair and PyCurl.
In order to use the approaches in this repository some additional files like pretraining checkpoints or additional data sources of the approaches have to be downloaded.
The Matching the Blanks GitHub repository provides a data file for the pre-training process of the BERT model:
The authors of the ERNIE approach provide additional data:
The used data for fine-tuning the approaches to the specific tasks are also provided:
The Entity-aware BLSTM approach uses pre-trained Glove vectors for word representation (the extracted file should be located in a resource folder inside the approaches folder):
The dowloaded data can be extracted and moved into the corresponding folder of the approach in the repository.
Each of the above approaches is included in an own Jupyter notebook. There the approach can be trained on one of the datasets (fine-tuning). At the end of those notebooks all needed information including the trained model weights and additional resources is stored in checkpoint files. This training step is a prerequisite for using the models later for the inference of new sentences in the Text2RelationGraph notebook.
The notebook Text2RelationGraph contains a complete processing from a not annotated text to RDF-Triples building a knowledge graph. Therefore one of the approaches can be chosen dynamically within the notebook. The notebook uses the previously trained and stored information from the approaches individual notebooks.
Additionally an evaluation of all approaches can be done with different datasets. Metrics as accuracy, precision, recall and F1 score are calculated and a confusion matrix is plotted.