- First obtain the word embedding matrix of English and French according to the paper.
- Run make_en.py and make_fr.py.
- Get lex.1.e2f and lex.1.f2e from CLSP Grid, and put them in the same directory.
- Uncomment the comment line of create_language_pairs.py, run the code.
- Comment the comment line if you would like to speed it up for further use.
- Run bilingual.py, you can check the argument, and maybe the code position needs to be modified
The rest of them are for training word embeddings, and evaluation of the result of word embedding.
- Get ant corpus.
- Run tokenization by Moses.
- Run remove.py, which removes the name and number, replaced by , .
- Run window.py, which takes consective 5 words to one line.
- If you think the corpus is too large, you can always run shrink.py, and specify any size.
- Then run train.py, there are arguments to be specified as bilingual.py.
- If you have done training, you can run evaluation.py to evaluate the result.