GitHub

Code Accompanying Local Byte Fusion for Machine Translation

Adapted from https://github.com/UriSha/EmbeddinglessNMT/.

Requirements

PyTorch version >= 1.4.0
Python version >= 3.6
NVIDIA GPU and NCCL

Setting up environment

git clone https://github.com/makeshn/LOBEF_Byte_NMT.git
cd LOBEF_Byte_NMT
pip install -e .
mkdir results
cd examples/translation
git clone https://github.com/moses-smt/mosesdecoder.git
git clone https://github.com/rsennrich/subword-nmt.git
cd ../..

Edit examples/translation/mosesdecoder/scripts/training/clean-corpus-n.perl to run on bytes:

From:

sub word_count {
 my ($line) = @_;
 if ($ignore_xml) {
  $line =~ s/<\S[^>]*\S>/ /g;
  $line =~ s/\s+/ /g;
  $line =~ s/^ //g;
  $line =~ s/ $//g;
 }
 my @w = split(/ /,$line);
 return scalar @w;
}

To:

sub word_count {
 use bytes;
 my ($line) = @_;
 if ($ignore_xml) {
  $line =~ s/<\S[^>]*\S>/ /g;
  $line =~ s/\s+/ /g;
  $line =~ s/^ //g;
  $line =~ s/ $//g;
 }
 return length($line);
}

Download raw and pre-processed data from:

Data can be downloaded from this link
Data can be found in the following folders
- OPUS data - opus.tar.gz for raw data, byte-bin for preprocessed data
- Cross domain data - cross_domain_adaptation.zip
- Cross lingual transfer - cross_lingual_transfer.zip
Move the folders to the data/ folder and update the paths in the scripts.

Train and evaluate:

bash embeddingless_scripts/train_byte_ncf.sh
bash embeddingless_scripts/train_byte_wsf.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
embeddingless_scripts		embeddingless_scripts
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
README.md		README.md
hubconf.py		hubconf.py
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requirements

Setting up environment

Download raw and pre-processed data from:

Train and evaluate:

About

Releases

Packages

Languages

makeshn/LOBEF_Byte_NMT

Folders and files

Latest commit

History

Repository files navigation

Requirements

Setting up environment

Download raw and pre-processed data from:

Train and evaluate:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages