forked from CPJKU/wechsel
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
451649d
commit bbc4bfd
Showing
8 changed files
with
967 additions
and
294 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,70 @@ | ||
# wechsel | ||
# WECHSEL | ||
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. | ||
|
||
Code will be published here once it is ready. | ||
ArXiv: https://arxiv.org/abs/2112.06598 | ||
|
||
Models from the paper will be available on the Huggingface Hub. | ||
|
||
# Installation | ||
|
||
We distribute a Python Package via PyPI: | ||
|
||
``` | ||
pip install wechsel | ||
``` | ||
|
||
Alternatively, clone the repository, install `requirements.txt` and run the code in `wechsel/`. | ||
|
||
# Example usage | ||
|
||
Transferring English `roberta-base` to Swahili: | ||
|
||
```python | ||
import torch | ||
from transformers import AutoModel, AutoTokenizer | ||
from datasets import load_dataset | ||
from wechsel import WECHSEL, load_embeddings | ||
|
||
source_tokenizer = AutoTokenizer.from_pretrained("roberta-base") | ||
model = AutoModel.from_pretrained("roberta-base") | ||
|
||
target_tokenizer = source_tokenizer.train_new_from_iterator( | ||
load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"], | ||
vocab_size=len(source_tokenizer) | ||
) | ||
|
||
wechsel = WECHSEL( | ||
load_embeddings("en"), | ||
load_embeddings("sw"), | ||
bilingual_dictionary="swahili" | ||
) | ||
|
||
target_embeddings, info = wechsel.apply( | ||
source_tokenizer, | ||
target_tokenizer, | ||
model.get_input_embeddings().weight.detach().numpy(), | ||
) | ||
|
||
model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings) | ||
|
||
# use `model` and `target_tokenizer` to continue training in Swahili! | ||
``` | ||
|
||
# Bilingual dictionaries | ||
|
||
We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in `dicts/`. | ||
|
||
# Citation | ||
|
||
Please cite WECHSEL as | ||
|
||
``` | ||
@misc{minixhofer2021wechsel, | ||
title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, | ||
author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz}, | ||
year={2021}, | ||
eprint={2112.06598}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
# Bilingual dictionaries for WECHSEL | ||
|
||
This directory contains 3249 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries. | ||
This directory contains 3276 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries. |
Oops, something went wrong.