The goal of this project is to create a combined part-of-speech tagger and lemmatizer for Icelandic using the revised fine-grained tagging schema for Icelandic. For further information about the schema see MIM-Gold on CLARIN-IS (the description pdf).
This work is based on the ABLTagger (in References) but with considerable model modifications and runs on Python 3.8, PyTorch 1.7.0+ and transformers 4.1.1+.
Table of contents generated with markdown-toc
See releases
To use a pretrained model follow the instructions below.
# Using v3.1.0 - consider using the latest version: [releases](https://github.com/cadia-lvl/POS/releases)
pip install git+https://github.com/cadia-lvl/[email protected]
The models will be downloaded automatically when needed. The models are stored in ~/.cache/torch/hub
, for more information see Torch hub documentation
Instructions for further development can be found in Contributing.
The models expect input to be tokenized and a tokenizer is not bundled with this package. We reccomend tokenizer version 2.0+.
There are three pretrained models available.
- A small PoS tagger:
pos tag example.txt tagged.txt
- A large PoS tagger:
pos tag-large example.txt tagged.txt
- A small lemmatzier :
pos lemma example.txt tagged.txt
Below is a table with some rough numbers (they are dependant on hardware and text domain).
Accuracy (MIM-Gold) | Disk space | CPU speed | GPU speed | |
---|---|---|---|---|
PoS small | ~96.7% | ~60MB | 360 | 10000 |
PoS large | ~97.8% | ~425MB | 20 | 1100 |
Lemmatizer small | ~98.3% | ~72MB | 360 | 10000 |
- The models are currently not trained on "noisy" text, thus they might not preform as well on text which is far from the data in MIM-Gold.
- The
batch_size
parameter works best with GPUs. - The accuracy of the lemmatizer is acceptable on MIM-GOLD but it does not generalize well and errors returned by the model are sometimes hard to accept. We rather recommend using Nefnir as a main lemmatizer with a fallback to the neural Lemmatizer.
Note that the input and output should be paths (i.e. not stdin or stdout).
example.txt
is correctly formatted input file: One token per line and sentences are separated with an empty line.
cat example.txt
Þar
sem
jökulinn
ber
við
loft
hættir
landið
að
vera
jarðneskt
,
en
jörðin
fær
hlutdeild
í
himninum
,
þar
búa
ekki
framar
neinar
sorgir
og
þess
vegna
er
gleðin
ekki
nauðsynleg
,
þar
ríkir
fegurðin
ein
,
ofar
hverri
kröfu
.
Halldór
Laxness
Tagging this file
pos tag-large example.txt example_tagged.txt
...
cat example_tagged.txt
Þar aa
sem c
jökulinn nkeog
ber sfg3en
við af
loft nheo
hættir sfg3en
landið nheng
að cn
vera sng
jarðneskt lhensf
, pk
en c
jörðin nveng
fær sfg3en
hlutdeild nveo
í af
himninum nkeþg
, pk
þar aa
búa sfg3fn
ekki aa
framar aam
neinar fovfn
sorgir nvfn
og c
þess fphee
vegna af
er sfg3en
gleðin nveng
ekki aa
nauðsynleg lvensf
, pk
þar aa
ríkir sfg3en
fegurðin nveng
ein lvensf
, pk
ofar afm
hverri foveþ
kröfu nveþ
. pl
Halldór nken-s
Laxness nken-s
And then adding the lemmas:
pos lemma example_tagged.txt # If you have previously been using and older version of the PoS tagger and this fails. Try adding the "--force_reload" flag to this command (once).
...
Þar aa þar
sem c sem
jökulinn nkeog jökull
ber sfg3en bera
við af við
loft nheo loft
hættir sfg3en hætta
landið nheng land
að cn að
vera sng vera
jarðneskt lhensf jarðneskur
, pk ,
en c en
jörðin nveng jörð
fær sfg3en fá
hlutdeild nveo ílutdeild
í af í
himninum nkeþg himinn
, pk ,
þar aa þar
búa sfg3fn búa
ekki aa ekki
framar aam framar
neinar fovfn neinn
sorgir nvfn sorg
og c og
þess fphee það
vegna af vegna
er sfg3en vera
gleðin nveng gleði
ekki aa ekki
nauðsynleg lvensf nauðsynlegur
, pk ,
þar aa þar
ríkir sfg3en ríkja
fegurðin nveng regurð
ein lvensf einn
, pk ,
ofar afm ofar
hverri foveþ hver
kröfu nveþ krafa
. pl .
Halldór nken-s Ialldór
Laxness nken-s Laxness
For additional flags and further details see pos tag --help
Usage example of the tagger in another Python module example.py.
"""An example of the POS tagger as a module."""
import torch
import pos
# Initialize the tagger
device = torch.device("cpu") # CPU
tagger: pos.Tagger = torch.hub.load(
repo_or_dir="cadia-lvl/POS",
model="tag", # This specifies which model to use. Set to 'tag_large' for large model.
device=device,
force_reload=False,
force_download=False,
)
# Tag a single sentence
tags = tagger.tag_sent(("Þetta", "er", "setning", "."))
print(tags)
# ('fahen', 'sfg3en', 'nven', 'pl')
# Tuple[str, ...]
# Tag multiple sentences at the same time (faster).
tags = tagger.tag_bulk(
(("Þetta", "er", "setning", "."), ("Og", "önnur", "!")), batch_size=2
) # Batch size works best with GPUs
print(tags)
# (('fahen', 'sfg3en', 'nven', 'pl'), ('c', 'foven', 'pl'))
# Tuple[Tuple[str, ...], ...]
# Tag a correctly formatted file.
dataset = pos.FieldedDataset.from_file("example.txt")
tags = tagger.tag_bulk(dataset)
print(tags)
# (('aa', 'ct', 'nkeog', 'sfg3en', 'af', 'nheo', 'sfg3en', 'nheng', 'cn', 'sng', 'lhensf', 'pk', 'c', 'nveng', 'sfg3en', 'nveo', 'af', 'nkeþg', 'pk', 'aa', 'sfg3fn', 'aa', 'aam', 'fovfn', 'nvfn', 'c', 'fphee', 'af', 'sfg3en', 'nveng', 'aa', 'lvensf', 'pk', 'aa', 'sfg3en', 'nveng', 'lvensf', 'pk', 'afm', 'foveþ', 'nveþ', 'pl'), ('nken-s', 'nken-s'))
# Tuple[Tuple[str, ...], ...]
For additional information, see the docstrings provided.
- Haukur Páll Jónsson (current maintainer)
- Örvar Kárason
- Steinþór Steingrímsson
- Reykjavík University
This project was funded (partly) by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.
For more involved installation instructions and how to train different models.
We use poetry to manage dependencies and to build wheels. Install poetry and do poetry install
.
To activate the environment within the current shell call poetry shell
.
To run the tests simply run pytest
within the poetry
environment.
To run without starting the environment, run poetry run pytest
.
This will run all the unit-tests and skip a few tests which rely on external data (model files).
To include these tests make sure to add additional options to the pytest
command.
pytest --electra_model="electra_model/"
a directory containing all necessary files to load an electra model.pytest --tagger="tagger.pt" --dictionaries="dictionaries.pickle"
, the necessary files to load a pretrained tagging model.
This project uses GitHub actions to run a number of checks (linting, testing) when a change is pushed to GitHub.
If a change does not pass the checks, a code fix is expected.
See .github/workflows/python-package.yml
for the checks involved.
The training data is a text file wich contains PoS-tagged sentences. The file has one token per line, as well as its corresponding tag. The sentences are separated by an empty line.
Við fp1fn
höfum sfg1fn
góða lveosf
aðstöðu nveo
fyrir af
barnavagna nkfo
og c
kerrur nvfo
. pl
Börnin nhfng
geta sfg3fn
sofið sþghen
úti aa
ef c
vill sfg3en
. pl
For Icelandic we used the IDF and MIM-GOLD. We use the 10th fold (in either dataset) for hyperparameter selection.
We provide some additional data which is used to train the model:
data/extra/characters_training.txt
contains all the characters which the model knows. Unknown characters are mapped to<unk>
We represent the information contained in the morphological lexicon with n-hot vectors.
To generate the n-hot vectors, different scripts will have to be written for different morphological lexicons.
We use the DMII morphological lexicon for Icelandic.
The script, pos/vectorize_dim.py
is used to create n-hot vectors from DMII.
We first download the data in SHsnid format.
After unpacking the SHsnid.csv
to ./data/extra
.
To generate the n-hot vectors we run the script:
python3 ./pos/vectorize_dim.py
The script takes two parameters:
Parameters | Default | Description |
---|---|---|
-i --input | ./data/extra/SHsnid.csv | The file containing the DIM morphological lexicon in SHsnid format. |
-o --output | ./data/extra/dmii.vectors | The file containing the DIM n-hot vectors. |
Since the morphological lexicon contains more words than will be seen during training and testing, it is useful to filter out unseen words.
pos filter-embedding data/raw/mim/* data/raw/otb/* data/extra/dmii.vectors data/extra/dmii.vectors_filtered bin
For an explanation of the parameters run pos filter-embedding --help
A model can be trained by invoking the following command.
pos train-and-tag \
training_data/*.tsv \
testing_data.tsv \
out # A directory to write out training results
For a description of all the arguments and options, run pos train-and-tag --help
.
Parameters with default values (options) are prefixed with --
.
It is also useful to look at the BASH scripts in bin/
Augmenting a BiLSTM Tagger with a Morphological Lexicon and a Lexical Category Identification Step
@inproceedings{steingrimsson-etal-2019-augmenting,
title = "Augmenting a {B}i{LSTM} Tagger with a Morphological Lexicon and a Lexical Category Identification Step",
author = {Steingr{\'\i}msson, Stein{\th}{\'o}r and
K{\'a}rason, {\"O}rvar and
Loftsson, Hrafn},
booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)",
month = sep,
year = "2019",
address = "Varna, Bulgaria",
url = "https://www.aclweb.org/anthology/R19-1133",
doi = "10.26615/978-954-452-056-4_133",
pages = "1161--1168",
}