add python package

polm-stability · Dec 14, 2021 · bbc4bfd · bbc4bfd
1 parent 451649d
commit bbc4bfd
Show file tree

Hide file tree

Showing 8 changed files with 967 additions and 294 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,70 @@
-# wechsel
+# WECHSEL
 Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
 
-Code will be published here once it is ready.
+ArXiv: https://arxiv.org/abs/2112.06598
+
+Models from the paper will be available on the Huggingface Hub.
+
+# Installation
+
+We distribute a Python Package via PyPI:
+
+```
+pip install wechsel
+```
+
+Alternatively, clone the repository, install `requirements.txt` and run the code in `wechsel/`.
+
+# Example usage
+
+Transferring English `roberta-base` to Swahili:
+
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+from datasets import load_dataset
+from wechsel import WECHSEL, load_embeddings
+
+source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+model = AutoModel.from_pretrained("roberta-base")
+
+target_tokenizer = source_tokenizer.train_new_from_iterator(
+    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
+    vocab_size=len(source_tokenizer)
+)
+
+wechsel = WECHSEL(
+    load_embeddings("en"),
+    load_embeddings("sw"),
+    bilingual_dictionary="swahili"
+)
+
+target_embeddings, info = wechsel.apply(
+    source_tokenizer,
+    target_tokenizer,
+    model.get_input_embeddings().weight.detach().numpy(),
+)
+
+model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)
+
+# use `model` and `target_tokenizer` to continue training in Swahili!
+```
+
+# Bilingual dictionaries
+
+We distribute 3276 bilingual dictionaries from English to other languages for use with WECHSEL in `dicts/`.
+
+# Citation
+
+Please cite WECHSEL as
+
+```
+@misc{minixhofer2021wechsel,
+      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
+      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
+      year={2021},
+      eprint={2112.06598},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
diff --git a/dicts/README.md b/dicts/README.md
@@ -1,3 +1,3 @@
 # Bilingual dictionaries for WECHSEL
 
-This directory contains 3249 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries.
+This directory contains 3276 bilingual dictionaries from English <-> Target Language scraped from https://en.wiktionary.org and the code used to do so. The wiktionary dump at https://dumps.wikimedia.org/enwiktionary/20211201/enwiktionary-20211201-pages-articles.xml.bz2 was used to generate the dictionaries.