All of the experiments were conducted on Universal Dependencies:
- corpus main page
- data download page
- Universal Dependencies v2.2 (we use 46 languages in v2.2 for training; see data/train_langs.txt)
- Universal Dependencies v2.5 (we use 8 treebanks in v2.5 for testing; see data/test_tbs-v2.5.txt)
- Clone this repository:
git clone https://github.com/Chung-I/maml-parsing.git
- Set up conda environment:
conda create -n maml-parsing python=3.6
conda activate maml-parsing
- Install python package requirements:
pip install -r requirements.txt
UD_GT
: Root path of ground truth universal dependencies treebank files used for evaluation.UD_ROOT
: Root path of treebank files used for training. For scenarios that use ground truth universal dependencies treebank files for training, simply set it the same asUD_GT
. For those who would like to use their own POS taggers as input features for training, put all pos-tagged conllu files in a singler folder and setUD_ROOT
to it. We provide Universal Dependencies v2.2 preprocessed by stanfordnlp (stanfordnlp package) for those who would like to compare their result with paper), which use predicted tags of their POS taggers for training.CONFIG_NAME
: json file storing training configuration such as dataset paths, model hyperparameter settings, training schedule, etc. See delexicalized parsing models and lexicalized parsing models for examples of configuration files to choose from.- Normal usage: Simply extract Universal Dependencies v2.2 to some
folder
, then setUD_GT="folder/**/"
andUD_ROOT="folder/**/"
.
UD_GT="path/to/your/ud-treebanks-v2.2/**/" UD_ROOT="path/to/your/pos-tagged/conllu-files/" python -W ignore run.py train $CONFIG_NAME -s <serialization_dir> --include-package src
- Multi-task baseline
CONFIG_NAME=
training_config/multi-pos.jsonnet- pre-trained model: multi-pos.tar.gz
- MAML
CONFIG_NAME=
training_config/maml-pos.jsonnet- pre-trained model: maml-pos.tar.gz
- FOMAML
CONFIG_NAME=
training_config/fomaml-pos.jsonnet- pre-trained model: fomaml-pos.tar.gz
- Reptile:
CONFIG_NAME=
training_config/reptile-pos.jsonnet- pre-trained model: reptile-pos.tar.gz
- Multi-task baseline
CONFIG_NAME=
training_config/multi-lex.jsonnet- pretrained model: multi-lex.tar.gz
- Reptile
CONFIG_NAME=
training_config/reptile-lex.jsonnet- pretrained model (inner step K=2): reptile-lex-K2.tar.gz
- pretrained model (inner step K=4): reptile-lex-K4.tar.gz
num_gradient_accumulation_steps
: meta-learning inner steps
UD_GT
: Same as pre-training.UD_ROOT
: Root path of treebank files used for testing. For scenarios that use ground truth text segmentation and POS tags as inputs to the parser, simply set it the same asUD_GT
. For users who would like to compare their results with CoNLL 2018 shared task submission, which scores not only parser accuracies but also the whole preprocessing pipeline (tokenization, lemmatization, POS/morphological features tagging, multi-word expansion) before dependency parsing, they can use their own preprocessing pipeline to process raw text and put all preprocessed conllu files in a singler folder and setUD_ROOT
to it. The parser will read the test files in it to generate system output. For users who don't want to develop their own preprocessing pipeline but still want to compare their result with CoNLL 2018 submission, we provide preprocessed Universal Dependencies v2.2 by stanfordnlp preprocessing pipeline (stanfordnlp package). Preprocessed Universal Dependencies v2.5 by stanza preprocessing pipeline (stanza package) is also provided for users who'd like to parse treebanks in UD v2.5 and compare their results with stanza, stanford's multilingual NLP system trained on UD v2.5.EPOCH_NUM
: Which pre-training epoch checkpoint to perform zero-shot transfer from.ZS_LANG
: Language code of target transfer language (e.g. wo, te, cop, ..., etc.).SUFFIX
: Suffix of folder names storing results.<serialization_dir>
: Directory of model to perform zero-shot transfer from. For example, if one would like to perform zero-shot transfer from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set<serialization_dir>
to that folder.
UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/conllu-files/" bash zs-eval.sh <serialization_dir> $EPOCH_NUM $ZS_LANG 0 $SUFFIX
Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${ZS_LANG}_${SUFFIX}
.
UD_GT
: Same as pre-training.UD_ROOT
: Same as zero-shot transfer.EPOCH_NUM
: Which pre-training epoch checkpoint to perform fine-tuning from.ZS_LANG
: Code of target transfer language (e.g. wo, te, cop, ..., etc.).NUM_EPOCHS
: Perform fine-tuning for this many number of epochs.SUFFIX
: Suffix of folder names storing results.<serialization_dir>
: Directory of model to perform fine-tuning from. For example, if one would like to perform fine-tuning from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set<serialization_dir>
to that folder.
UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/testset/" bash fine-tune.sh <serialization_dir> $EPOCH_NUM $FT_LANG $NUM_EPOCHS $SUFFIX
Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${FT_LANG}_${SUFFIX}
.
train-result.conllu
: System prediction of training set ($UD_GT/$ZS_LANG*-train.conllu
).dev-result.conllu
: System prediction of development set ($UD_GT/$ZS_LANG*-dev.conllu
).result.conllu
: System prediction of testing set ($UD_ROOT/$ZS_LANG*-test.conllu
).result-gt.conllu
: System prediction of testing set ($UD_GT/$ZS_LANG*-test.conllu
).result.txt
: Performance (LAS, UAS, etc.) ofresult.conllu
computed byutils/conll18_ud_eval.py
, which is provided by CoNLL 2018 Shared Task.result-gt.txt
: Performance (LAS, UAS, etc.) ofresult-gt.conllu
computed byutils/error_analysis.py
, which is modified from CoNLL 2018 Shared Task. Scores grouped by sentence length (LASlen[sentence length lower bound][sentence length upper bound]
) and dependency length(LASdep[dependency length]
) are added.