This repository contains the code for our (ACL 2022 paper) that synthesizes monolingual and labeled data for languages with limited or not textural data. We use these synthetic data to adapt pretrained multilingual models to languages with constraint textural data, which leads to significant improvements for these languages.
We prepared the extracted lexicons and other data used in the expriments and you can get it here. To extract new lexicons from PanLex database directly, you can first download the sql database from here(in case the website doesn't work, here is the older version of the database we used), and run
python src/panlex_extract_lexicon.py --source_language=eng --target_language=$LAN --output_directory=data/lexicons
The lexicon extraction code is adapted from this repository.
To generate synthetic data from the preprocessed Wikipedia sentences in English to another language, run the script
python src/make_pseudo_mono.py $LAN
where $LAN is a language code with a corresponding lexicon file under the folder data/lexicons/
For finetuning, please first download and prepare the task specific data following the XTREME repo To generate synthetic data for a language and a task, use
python src/make_pseudo_label.py $LAN $TASK
where $TASK is could be [pos|panx] for [POS tagging|Wiki NER].
Please cite our paper as:
@inproceedings{wang2022expand,
title={Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation},
author={Wang, Xinyi and
Ruder, Sebastian and
Neubig, Graham},
booktitle={ACL},
year={2022}
}