Skip to content

Commit

Permalink
add python package
Browse files Browse the repository at this point in the history
  • Loading branch information
bminixhofer committed Dec 14, 2021
1 parent 451649d commit bbc4bfd
Show file tree
Hide file tree
Showing 8 changed files with 967 additions and 294 deletions.
70 changes: 68 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,70 @@
# wechsel
# WECHSEL
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Code will be published here once it is ready.
ArXiv: https://arxiv.org/abs/2112.06598

Models from the paper will be available on the Huggingface Hub.

# Installation

We distribute a Python Package via PyPI:

```
pip install wechsel
```

Alternatively, clone the repository, install `requirements.txt` and run the code in `wechsel/`.

# Example usage

Transferring English `roberta-base` to Swahili:

```python
import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
load_embeddings("en"),
load_embeddings("sw"),
bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
source_tokenizer,
target_tokenizer,
model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!
```

# Bilingual dictionaries

We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in `dicts/`.

# Citation

Please cite WECHSEL as

```
@misc{minixhofer2021wechsel,
title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models},
author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
year={2021},
eprint={2112.06598},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
2 changes: 1 addition & 1 deletion dicts/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Bilingual dictionaries for WECHSEL

This directory contains 3249 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries.
This directory contains 3276 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries.
Loading

0 comments on commit bbc4bfd

Please sign in to comment.