This project aims to provide automated reading of natural history labels such as Herbarium labels and archive cards.
The project includes Python scripts and ideas for automated reading of machine typed labels (not yet for handwritten labels) and Data Matrix codes or QR codes.
The following must be installed on the system. On macOS, I install this via MacPorts.
For OCR using tesseract:
tesseract
tesseract-dan
tesseract-eng
tesseract-deu
tesseract-lat
For reading PDF files:
imagemagick
Create a virtual environment
python3 -m venv venv
Install requirements via pip into virtual environment
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
The binary wheels in the pypi repository of the current version 2.2.0 of zxing-cpp has a problem and must therefore be build from the source code package by
pip uninstall zxing-cpp
python -m pip install zxing-cpp==2.2.0 --no-binary zxing-cpp
To run the tests using pytest do the following from the same directory as this README file.
source venv/bin/activate
pytest tests
Check the output for any failures.
To create wheel and source packages ready for distribution do:
source venv/bin/activate
pip install --upgrade -r build_requirements.txt
python -m build
This creates a dist directory with the two package files. To install the wheel file into another virtual environment do
python -m venv venv2
source venv2/bin/activate
pip install --upgrade pip
pip install dist/NHMDlabelreader-0.0.1-py3-none-any.whl
For more instructions on how to configure setup.cfg, see the setuptools quickstart.
We are not currently publishing this package to PiPI. To upload to PyPI follow these instructions.
Currently there are two github actions workflow that both need to be activated manually in the repository on github.com. For more advanced workflows see Ole Engstrøms IKPLS repository
Additional documentation can be found in docs.
This script parses archive cards from the Ole Bøggild collection of Danish spiders.
This script parses a table of taxa from the butterfly atlas book.
This script attempts to parse information on archive cards from the C-SAD Botany collection at NHMD.