Citation Parser is a Python package designed to process raw citation texts and link them to scholarly knowledge graphs like OpenAlex, OpenAIRE, and PubMed. It leverages advanced natural language processing techniques powered by three small, fine-tuned language models to deliver accurate and robust citation parsing and linking.
Citation Parser follows a structured multi-step process to achieve accurate citation linking:
- Pre-Screening: a classification model based on
distilbert/distilbert-base-multilingual-cased
determines whether the given text is a valid citation or not. - Citation Parsing (NER): sophisticated Named Entity Recognition (NER) extracts key fields from the citation. The citation is parsed into structured fields using a fine-tuned Named Entity Recognition model. The extracted fields can include:
TITLE
,AUTHORS
,VOLUME
,ISSUE
,YEAR
,DOI
,ISSN
,ISBN
,FIRST_PAGE
,LAST_PAGE
,JOURNAL
, andEDITOR
.
- Candidate Identification: a set of carefully crafted queries to the OpenAlex API retrieves one or more candidate publications based on the parsed citation fields. The parsed information is used to construct a series of queries to the OpenAlex API, retrieving one or more potential matches for the citation.
- Pairwise Classification: a pairwise classification model predicts the likelihood of the identified candidates matching the original citation. This model is fine-tuned on a dataset of citation pairs in the format:
"CITATION 1 [SEP] CITATION 2"
. If multiple candidates are retrieved, the publication with the highest likelihood score is returned.
The best-matching candidate is selected based on the likelihood score and returned as the final linked publication.
pip install git+https://github.com/sirisacademic/citation-parser.git
Here’s a basic example of how to use Citation Parser:
from citation_parser import CitationParser
# Initialize the parser
parser = CitationParser()
# Raw citation text
citation = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》"
# Parse and link the citation
result = parser.link_citation(citation, api_target = "openalex", output = 'simple')
The output would look like this:
{'result': 'Hiroto Murakami, Keisuke Futashima, Minoru Nanchi, et al. (2010). Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate. European Polymer Journal, 47 378-384. doi: 10.1016/j.eurpolymj.2010.12.012',
'score': 0.9997150301933289,
'id': 'https://openalex.org/W2082866977'}
- api_target: Specifcy knowledge graphs to query. Options include:
openalex
- [default] Links to OpenAlexopenaire
- Links to OpenAIREpubmed
- Links to PubMed
- output: Specifies the type of result returned:
simple
– Returns a concise, structured citation match.full
– Returns a detailed, full citation with additional metadata.
- device:
cpu
: Utilises the CPU for model inference, suitable for environments without GPU support. Recommended for smaller workloads or when GPU is unavailable.cuda
: Utilises GPUs via CUDA for faster inference. Recommended for environments where GPUs are available and high performance is required.
- result: dict with the folling attributues:
result
: Citation from the linked source.score
: Similarity score with the input citatioid
:publication_id
in the target Scholarly Knowledge Graph (OpenAlex, OpenAIRE, or PubMed)- (if
output='full'
)full-publication
: Publication object from the target API
Ensure you have all necessary dependencies installed. You can install them using the following command:
pip install -r requirements.txt
Citation Parser is ideal for:
- Automated metadata enrichment: extract structured metadata from raw citation texts.
- Citation Validation: verify the correctness of citations in manuscripts.
- Scholarly Database Integration: link citations to knowledge graphs like OpenAlex and OpenAIRE.
- 🤗 TYPE model available at: https://huggingface.co/SIRIS-Lab/citation-parser-TYPE
- 🤗 NER model available at: https://huggingface.co/SIRIS-Lab/citation-parser-ENTITY
- 🤗 SELECT model available at: https://huggingface.co/SIRIS-Lab/citation-parser-SELECT
The performance of each model used in the Citation Parser is evaluated using the F1 score. Below are the F1 scores for each of the key models involved in citation parsing and linking:
Model | F1 Score |
---|---|
TYPE Model (Citation Pre-screening) | 0.941638 |
NER Model (Citation Citation) | 0.949772 |
SELECT Model (Candidate Selection) | 0.846972 |
- Improved candidate retrieval: advanced query strategies for ambiguous or incomplete citations.
- Translation to multilingual input to do multiple searches in both input language and English
For further information, please contact [email protected].
This work is distributed under a Apache License, Version 2.0.