- Introduction
- Getting started
- Performances
- Developer guide
- Training and evaluation
- Training data
- Generation of training data
- Acknowledgment
- Licence
- References
The goal of this GROBID module is to identify, extract, and link materials with their properties from scientific literature.
In particular, the current goal the tool has been built is to extract superconductors material and their properties, such a Critical Temperature (Tc) and their type (type of measurement technique, or if it's a prediction/calculation) and the applied pressure when available. Furthermore, this tool identifies also space groups, crystal structure when specified.
A running demo is available at https://lfoppiano-grobid-superconductors.hf.space NOTE: If the space is sleeping (after 30 minutes without any request), might take few seconds to start back up.
The system is divided into two main steps (Extraction and Linking):
The Extraction step is a Named Entities Recognition (NER) task and is performed using machine learning. As other Grobid modules, it can use linear CRF (via Wapiti JNI integration) or Deep Learning model such as BiLSTM-CRF or transformers like BERT or SciBERT (via DeLFT JNI integration).
The Linking is a relation extraction (RE) tasks and is implemented via rule-based using the SpaCy library. The implementation is integrated via microservices and can be found here here.
Grobid-superconductors provides both an API and a User Interface (UI).
The extracted materials and properties are summarised in a table with snippet of the original sentence:
and each extracted entity is visualised directly on the PDF document:
As experimental feature, the system provides a summary of all the materials from their composition and the form they appear in the document:
The response is a JSON representation of the document and it includes the main bibliographic data (title, authors, DOI, publisher and journals) which are extracted via the underline Grobid library.
See the References for more information.
The quickest way to get started is to use directly docker-compose contained in the project directory.
Just run the command:
docker compose up
Should spawn grobid-superconductors and its microservices.
In order to run each service individually, is possible to run them separately:
-
Chem data Extractor
docker run -t --rm --init -p 0876:8080 lfoppiano/chemdataextractor:1.0
-
Python service for linking and other functions:
docker run -t --rm --init -p 8090:8080 lfoppiano/linking-module:1.0
-
Grobid superconductors core service
- no GPU
docker run -t --rm --init -p 8072:8072 -p 8073:8073 -v grobid-superconductors/resources/config/config-docker.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro lfoppiano/grobid-superconductors:0.3.0
- GPU
docker run --rm --gpus all --init -p 8072:8072 -p 8073:8073 -v grobid-superconductors/resources/config/config-docker.yml.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro lfoppiano/grobid-superconductors:0.3.0
Note: the file in resources/config/config-docker.yml
can be edited and the configurations are applied directly to the docker image.
For example is possible to switch between Deep Learning and CRF by just changing the individual models' configuration.
Obviously this works only if the model for the requested architecture has been provided.
In the following table are listed the models (in resources/models/
) that are currently provided.
Model name | Description | Provided architecture |
---|---|---|
superconductors | extract the superconductors materials and properties such as temperature, pressure | CRF, BidLSTM_CRF, BidLSTM_CRF_FEATURE, scibert |
material | segment the material names | CRF, BidLSTM_CRF, BidLSTM_CRF_FEATURE |
entityLinking-material-tcValue | links materials and superconducting critical temperature | CRF |
entityLinking-tcValue-pressure | links superconducting critical temperature and pressure | CRF |
entityLinking-tcValue-me_methods | superconducting critical temperature and measurement method | CRF |
Below, in the Section accuracy, we present the accuracies for each model.
Grobid is designed for fast processing using a lightweight and tighly integrated system. Grobid-superconductors contains more moving parts which are separated from the main application. The linking module, the class classifier and the chem data extractors are provided on different micro-services. To reduce the overhead of the http connection all sentences of a document are bundled together and sent with one http requests.
The performance are summarised in the table below (RPS: request per second, FPS: failure per second) The detailed reports and explanation can be found here.
-
Pdf processing:
- CRF: 1.2/0.1
- BidLSTM_CRF_FEATURE: 1.2/0.1 RPS/FPS (4Gb GPU), 1.1/0.2 RPS/FPS (16Gb GPU)
- scibert: 1.0/0.9 RPS/FPS (4Gb GPU), 1.0/0.8 RPS/FPS (16Gb GPU)
-
python services: ~40 RPS
Evaluation made on the 15/07/2022. The results (Precision, Recall, F-score) have been obtained with train/evaluation using holdout validation of 34 papers. The DL results are the average of 5 train/evaluation runs.
Labels | CRF | BidLSTM_CRF | BidLSTM_CRF_FEATURES | SciBERT | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | Precision | Recall | F1-Score | Precision | Recall | F1-Score | Precision | Recall | F1-Score | Precision | Recall | F1-Score |
<class> |
79.74 | 66.79 | 79.01 | 79.01 | 72.62 | 75.66 | 77.84 | 72.40 | 74.97 | 72.95 | 75.28 | 74.09 |
<material> |
79.00 | 72.15 | 79.25 | 79.25 | 76.94 | 78.06 | 81.07 | 75.10 | 77.94 | 80.15 | 81.42 | 80.77 |
<me_method> |
60.25 | 68.73 | 56.41 | 56.41 | 79.49 | 65.92 | 55.86 | 80.45 | 65.90 | 56.26 | 81.52 | 66.56 |
<pressure> |
46.15 | 29.27 | 49.45 | 49.45 | 58.05 | 52.53 | 50.25 | 60.49 | 54.36 | 41.72 | 52.68 | 46.51 |
<tc> |
84.36 | 83.57 | 83.96 | 78.61 | 82.54 | 80.48 | 79.19 | 82.07 | 80.60 | 74.46 | 82.66 | 78.35 |
<tcValue> |
69.8 | 66.24 | 67.97 | 70.36 | 75.16 | 72.67 | 68.95 | 76.56 | 72.52 | 70.90 | 79.74 | 75.06 |
All (micro avg) | 76.88 | 72.77 | 74.77 | 74.59 | 77.67 | 76.09 | 75.17 | 76.79 | 75.96 | 73.69 | 80.69 | 77.03 |
Labels | CRF | BidLSTM_CRF | BidLSTM_CRF_FEATURES | SciBERT | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Metrics | Precision | Recall | F1-Score | Precision | Recall | F1-Score | Precision | Recall | F1-Score | Precision | Recall | F1-Score |
<doping> |
67.98 | 62.42 | 64.95 | 67.98 | 62.42 | 64.95 | 69.00 | 62.34 | 65.43 | 63.58 | 62.79 | 63.16 |
<fabrication> |
23.61 | 5.91 | 9.24 | 23.61 | 5.91 | 9.24 | 37.33 | 9.09 | 14.48 | 22.51 | 13.18 | 16.52 |
<formula> |
82.59 | 84.14 | 83.35 | 82.59 | 84.14 | 83.35 | 83.83 | 85.14 | 84.47 | 84.53 | 86.56 | 85.53 |
<name> |
76.29 | 78.76 | 77.43 | 76.29 | 78.76 | 77.43 | 74.51 | 80.38 | 77.33 | 77.18 | 81.86 | 79.44 |
<shape> |
90.93 | 95.79 | 93.29 | 90.93 | 95.79 | 93.29 | 90.33 | 95.74 | 92.96 | 89.67 | 97.20 | 93.28 |
<substrate> |
54.31 | 32.43 | 40.44 | 54.31 | 32.43 | 40.44 | 60.08 | 33.38 | 42.82 | 56.32 | 41.22 | 47.59 |
<value> |
84.81 | 89.33 | 86.99 | 84.81 | 89.33 | 86.99 | 85.16 | 90.15 | 87.58 | 83.14 | 85.92 | 84.50 |
<variable> |
95.19 | 97.77 | 96.46 | 95.19 | 97.77 | 96.46 | 96.32 | 97.90 | 97.10 | 96.22 | 96.52 | 96.37 |
All (micro avg) | 67.98 | 62.42 | 64.95 | 82.76 | 83.50 | 83.13 | 83.20 | 84.33 | 83.76 | 83.11 | 85.23 | 84.15 |
Detailed evaluation measures are tracked here.
See the documentation of DeLFT for more details about the models and reproducing all these evaluations.
Name | Method | Task | Description | Precision | Recall | F1 |
---|---|---|---|---|---|---|
rb-supermat-baseline | Rule-based | material-tcValue | eval against SuperMat | 88 | 74 | 81 |
crf-10fold-baseline | CRF | material-tcValue | 10 fold cross-validation | 68.52 | 70.11 | 69.16 |
crf-10fold-baseline | CRF | tcValue-pressure | 10 fold cross-validation | 72.92 | 67.67 | 69.76 |
crf-10fold-baseline | CRF | tcValue-me_method | 10 fold cross-validation | 49.99 | 45.21 | 44.65 |
Corpus of 500 PDF papers (500-papers) from American Institute of Physics (AIP), American Physical Society (APS) and Institute of Physics (IOP). Results are calculated based on manual correction on the output data:
Precision | Support |
---|---|
72.60 | 847 |
The application is composed by three components:
- the grobid-superconductors java web application
- the chemdataextraction API - we recommend to run this with docker
- the python utilities (linking, formula classifier, etc.. )
NOTE: This module requires 8 < JDK < 11.
-
Install and build the latest development version of GROBID as explained by the documentation.
-
The modules should be installed inside the grobid directory
cd grobid
-
Clone the grobid-superconductor repository inside the grobid directory
git clone ....
-
Copy the provided pre-trained model in the standard grobid-home path:
cd grobid/grobid-superconductors/
./gradlew copyModels
-
Try compiling everything with:
./gradlew clean build
-
To run the service:
java -jar build/libs/grobid-superconductor-{version}.onejar.jar server config/config.yml
The linking module and other python utilities are used as a microservices by the grobid-superconductors java application.
The URL can be configured from the configuration file resources/config/config.yml
or via environment variables.
To install the python utilities:
-
create a virtual environment (for example with conda, specifying python 3.7):
conda create -name grobidSuperconductors pip python=3.7
-
activate your environment
conda activate grobidSuperconductors
-
make sure you are using the pip within the conda environment and not the global conda pip:
which pip
and shall return a path that is a subdirectory of your environment, for example
/Users/lfoppiano/opt/anaconda3/envs/test/bin/pip
-
clone grobid-superconductors-toos
git clone https://github.com/lfoppiano/grobid-superconductors-tools
-
install the requirements using pip (feel free to find your way using conda, however it may cause troubles)
cd grobid-superconductors-tools/linking
pip install -f requirements.linux.txt
The grobid home will be used from the default location ../grobid-home
.
For training the superconductors model with all the available training data:
cd PATH-TO-GROBID/grobid/grobid-superconductors/
./gradlew train_superconductors
or
java -jar build/lib/grobid-supercoductors-*onejar.jar training -a train resources/config/config.yml
The training data must be under grobid-superconductors/resources/dataset/superconductors/corpus
.
The following commands will split automatically and randomly the available annotated data (under resources/dataset/software/corpus/
) into a training set and an evaluation set, train a model based on the first set and launch an evaluation based on the second set.
The current implementation only supports 80/20 partition.
java -jar build/lib/grobid-supercoductors-*onejar.jar training -a train_eval resources/config/config.yml
In this mode, by default, 90% of the available data is used for training and the remaining for evaluation. This default ratio can be changed with the parameter -Ps
. By default, the training will use the available number of threads of the machine, but it can also be specified by the parameter -Pt
. The grobid home can be optionally specified with parameter -PgH
. By default it will take ../grobid-home
.
For n-fold evaluation using the available annotated data (under resources/dataset/software/corpus/
), use the command:
java -jar build/lib/grobid-supercoductors-*onejar.jar training -a nfold --fold-count n resources/config/config.yml
where --fold-count
is the parameter for the number of folds, by default 10.
For evaluating under the labeled data under grobid-astro/resources/dataset/software/evaluation
(fixed "holdout set" approach), use the command:
java -jar build/lib/grobid-supercoductors-*onejar.jar training -a holdout resources/config/config.yml
By default, the report is written in files placed in the logs directory under grobid-superconductors
, to disable the writing on the log
directory, use the option --onlyPrint
java -jar build/libs/grobid-superconductor-0.1.onejar.jar training -a 10fold -m superconductors --onlyPrint config/config.yml
The training data are located in a private repository (for copyright reasons). A reduced corpus of Open Access documents will be made available. Feel free to contact us if you need the data.
Grobid supports the automatic generation of pre-annotated training data in XML/TEI, using the current model, from a list of text or PDF files in a input repository:
java -Xmx4G -jar build/libs/grobid-superconductor-0.1.onejar.jar trainingGeneration -dIn input_directory -dOut output_directory -m superconductors resources/config/config.yml
It's possible to create training data for Delft via command line:
java -jar build/libs/grobid-superconductor-0.1.onejar.jar prepare-delft-training --delft /delft/root/path -m superconductors config/config.yml
or with a general output directory
java -jar build/libs/grobid-superconductor-0.1.onejar.jar prepare-delft-training --output /a/directory -m superconductors config/config.yml
It's possible also to specify an input directory for the corpus. NOTE: the script will look for files in a subdirectory called final
:
java -jar build/libs/grobid-superconductor-0.1.onejar.jar prepare-delft-training --input /b/directory --output /a/directory -m superconductors config/config.yml
will look for training data in /b/directory/final
.
The Inter Annotation Agreement (IIA) should be calculated from directory in the following way:
- root
- annotator1
- file1
- file2
- annotator2
- file1
- file2
- annotator3
- file1
- file2
- annotator1
the filenames file1
, file1
names should match. The name will be used to match different annotation of the same original file.
The IIA can be calculated using the following command
java -Xmx4G -jar build/libs/grobid-superconductor-0.1.onejar.jar iia --input root --verbose --mode {coding, unitizing} --one-vs-all reference_directory --output output_directory] resources/config/config.yml
The argument --one-vs-all reference-folder
allows to perform only IAA between every forlder and the reference-folder
.
The result can be structured in four sections:
- the list of "annotators" (in the example before
annotation1
,annotation2
,annotation3
) - the general results (average and by label)
- the pairwise comparison between each annotators (when they are more than 2)
- the debugging information showing the detailed annotation of each annotators in the text
See example of detailed result from the IAA processing
Calculating IAA between the following directories:
/Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/guidelines/test3_annotated/Luca,
/Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/guidelines/test3_annotated/Suzuki,
/Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/guidelines/test3_annotated/Dieb
> /Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/guidelines/test3_annotated/Luca/1609.04957.xml
INFO [2020-03-02 05:07:44,669] org.grobid.trainer.annotationAgreement.InterAnnotationAgreementUnitizingProcessor: 2 files to be processed.
INFO [2020-03-02 05:07:44,671] org.grobid.trainer.annotationAgreement.InterAnnotationAgreementUnitizingProcessor: Processing:
> /Users/lfoppiano/development/projects/grobid/grobid-superconductors/resources/dataset/superconductors/guidelines/test3_annotated/Luca/1903.04321.xml
== General evaluation (considering all the annotators) ==
Krippendorf alpha agreements: 0.8140284745559881
Krippendorf alpha agreement by category:
material: 0.8741470581075966
pressure: 0.3946813981022542
class: 0.7267966596591673
tcValue: 0.9500274286972142
tc: 0.8416155306220746
0 vs 1
General Agreement: 0.7994996668504567
Agreement by categories:
material: 0.8147732510674073
pressure: 0.7229604056428696
class: 0.7350360835185756
tcValue: 0.9211328346104319
tc: 0.7981279136158688
0 vs 2
[..]
1 vs 2
[..]
Though the superconducting phase [...] Department of Sceience and Technology, Govt of India.
class0 *********************** *********************** *********************** **********************
class1 ****************** ******************* *********************** **********************
class2 *********************** *********************** *** *** ******** *** *** *** *********************** *******
material0 ***** ********************* ****** ****** ****** ****** ****** ****** ****** ********* ****** ********* ********************** ********* ****** *********************** ****************** *********************** ****** *********** ********************** ********* ********* ********************** *********** ********* ****** ********* ******************* ******** *********** *********** *********** ********* ********************************* ****** ********************** *********** ******
material
material
tc0 ******************** ************* ******************** ********** *** *** *** *** *** *** *** *** *** *** *** ******************** ***************
tc1 ******************** ************* ******************** *** *** ********** ********* *** ********* ********** ********** *** *** *** *** ******************** ***
tc2 ******************** ************* ******************** ********** *** ********** ********* *** ********* ********** ********** ********** ********** *** *** *** *** ******************** ***
tcValue0 ******************* *********** ***** ******* ******* ***************
tcValue1 ******************* *********** ***** ******* *******
tcValue2 ******************* *********** ***** ******* ******* ***************
pressure0
pressure1
pressure2
[..]
Our warmest thanks to Patrice Lopez from Science-miner: Author of Grobid, Delft and tons of other interesting open source projects.
This project has been developed at the National Institute for Materials Science, in Tsukuba, Japan.
GROBID and grobid-superconductors are distributed under Apache 2.0 license.
Contact: Luca Foppiano (FOPPIANO.Luca AT nims.go.jp)
We described the framework around the system in the following articles (the latest on top):
-
Automatic Extraction of Materials and Properties from Superconductors Scientific Literature
@article{doi:10.1080/27660400.2022.2153633, author = {Luca Foppiano and Pedro Baptista Castro and Pedro Ortiz Suarez and Kensei Terashima and Yoshihiko Takano and Masashi Ishii}, title = {Automatic extraction of materials and properties from superconductors scientific literature}, journal = {Science and Technology of Advanced Materials: Methods}, volume = {3}, number = {1}, pages = {2153633}, year = {2023}, publisher = {Taylor & Francis}, doi = {10.1080/27660400.2022.2153633}, URL = { https://doi.org/10.1080/27660400.2022.2153633 }, eprint = { https://doi.org/10.1080/27660400.2022.2153633 } }
-
SuperMat: construction of a linked annotated dataset from superconductors-related publications
@article{doi:10.1080/27660400.2021.1918396, author = {Luca Foppiano and Sae Dieb and Akira Suzuki and Pedro Baptista de Castro and Suguru Iwasaki and Azusa Uzuki and Miren Garbine Esparza Echevarria and Yan Meng and Kensei Terashima and Laurent Romary and Yoshihiko Takano and Masashi Ishii}, title = {SuperMat: construction of a linked annotated dataset from superconductors-related publications}, journal = {Science and Technology of Advanced Materials: Methods}, volume = {1}, number = {1}, pages = {34-44}, year = {2021}, publisher = {Taylor & Francis}, doi = {10.1080/27660400.2021.1918396}, URL = { https://doi.org/10.1080/27660400.2021.1918396 }, eprint = { https://doi.org/10.1080/27660400.2021.1918396 } }
-
"Proposal for Automatic Extraction of Superconductors properties from scientific literature": PDF
@inproceedings{foppiano2019proposal, address = {Tsukuba}, title = {Proposal for {Automatic} {Extraction} {Framework} of {Superconductors} {Related} {Information} from {Scientific} {Literature}}, volume = {119}, copyright = {All rights reserved}, abstract = {The automatic collection of materials information from research papers using Natural Language Processing (NLP) is highly required for rapid materials development using big data, namely materials informatics (MI). The difficulty of this automatic collection is mainly caused by the variety of expressions in the papers, a robust system with tolerance to such variety is required to be developed. In this paper, we report an ongoing interdisciplinary work to construct a system for automatic collection of superconductor-related information from scientific literature using text mining techniques. We focused on the identification of superconducting material names and their critical temperature (Tc) key property. We discuss the construction of a prototype for extraction and linking using machine learning (ML) techniques for the physical information collection. From the evaluation using 500 sample documents, we define a baseline and a direction for future improvements.}, language = {eng}, booktitle = {Letters and {Technology} {News}, vol. 119, no. 66, {SC}2019-1 (no.66)}, author = {Foppiano, Luca and Thaer, M. Dieb and Suzuki, Akira and Ishii, Masashi}, month = may, year = {2019}, note = {ISSN: 2432-6380}, pages = {1--5} }