This repository contains a proof of concept workflow for optical character recognition (OCR), named entity recognition (NER), named entity disambiguation and linking (NED) and transformation of digitized historical newspapers for historical network analysis (HNA).
The workflow was developed by the Berlin State Library (SBB), the Berlin School of Library and Information Science (IBI) and the German Research Center for Artificial Intelligence (DFKI) in the context of the SoNAR (IDH) project.
The main aims were to explore the technical feasibility, quality and usability of the results for scholarly use cases in historical network analysis and data visualization.
The individual components are based on state-of-the-art open source technologies from the OCR-D and QURATOR projects.
The workflow was evaluated and the results are published here (German).
The workflow includes the following steps:
1. Access images of digitized newspapers from Zefys
2. Apply OCR to the images using the OCR-D framework
3. Transform the OCR output into TSV format
4. Recognize named entities in the OCRed text
5. Disambiguate and link entities to Wikidata-IDs
6. Manually inspect or edit the results in a browser
7. Transform the results for use in a graph db
To install and test the workflow, the following prerequisites must be met.
Setup a Python3 virtualenv
and activate it
python3 -m venv /path_to_venv
source /path_to_venv/bin/activate
Update pip
pip install -U pip
You need either local or remote access to the digitised newspaper images from Zefys
mkdir zefys
mount -o ro,noload /zefys/archive /zeyfs
Download images using the API
Install OCR-D via ocrd-galley
git clone https://github.com/qurator-spk/ocrd-galley
cd ocrd-galley
./build
You can now use zdb2ocr
to OCR digitised newspapers from Zefys based
on their zdb-id
(with any -
removed) and date of issue yyyymmdd
zdb2ocr 27974534 19010712
Install page2tsv
git clone https://github.com/qurator-spk/page2tsv
cd page2tsv
pip install .
You can now use page2tsv to transform the
PAGE-XML output of the OCR into a tab-separated-values (tsv
) format
page2tsv SNP27974534-19010712-0-1-0-0.xml SNP27974534-19010712-0-1-0-0.tsv
If images are served via iiif
, the OCR coordinates can be used to
generate according image urls by also providing the --image-url
page2tsv SNP27974534-19010712-0-1-0-0.xml SNP27974534-19010712-0-1-0-0.tsv \
--image-url=https://content.staatsbibliothek-berlin.de/zefys/SNP27974534-19010712-0-1-0-0/full/full/0/default.jpg
Apply named entity recognition with sbb_ner
page2tsv SNP27974534-19010712-0-1-0-0.tsv --ner-rest-endpoint
Apply named entity disambiguation and linking with sbb_ned
page2tsv SNP27974534-19010712-0-1-0-0.tsv --ned-rest-endpoint
Use the browser-based neat to inspect, correct or annotate tsv
files
git clone https://github.com/qurator-spk/neat
cd neat
firefox neat.html
Install trs
git clone https://github.com/sonar-idh/Transformer
Follow the instructions provided
Information provided by the tsv
filename:
SNP{zdb-id
}-{yyyymmdd
}-{issue
}-{page
}-{article
}-{version
}.tsv
zdb-id
(any-
removed)- date of issue (
yyyymmdd
) - issue number (
0
= morning issue,1
= evening issue etc., default0
) - page/image number
- article id (not used, default
0
) - version number (not used, default
0
)
Example: SNP27974534-19010712-0-1-0-0.tsv
Information provided in the tsv
file columns:
iiif_url
placeholder injected as a comment under the column headersNo.
indicates the sentence position (≥1
,0
marks sentence boundaries)TOKEN
contains the token text (utf-8
encoded)NE-TAG
contains the surface entity label (BIO
chunking)NE-EMB
contains the embedded entity label (BIO
chunking)ID
contains the surface entity wikidata ID (ranked candidates are separated by|
)url_id
is replaced with theiiif_url
left
,top
,width
,height
hold the token OCR coordinates as absolute pixel values
Example (see also example):
No. TOKEN NE-TAG NE-EMB ID url_id left top width height
# https://iiif.url
36 bekannter O O - - 157 181 643 660
37 Comédie B-ORG B-LOC Q61460498 - 197 262 643 661
38 françaiſe I-ORG I-LOC Q61460498 - 277 345 642 661
39 anvertraut O O - - 359 440 644 659