Code Accompanying Local Byte Fusion for Machine Translation
Adapted from https://github.com/UriSha/EmbeddinglessNMT/.
git clone https://github.com/makeshn/LOBEF_Byte_NMT.git
cd LOBEF_Byte_NMT
pip install -e .
mkdir results
cd examples/translation
git clone https://github.com/moses-smt/mosesdecoder.git
git clone https://github.com/rsennrich/subword-nmt.git
cd ../..
Edit examples/translation/mosesdecoder/scripts/training/clean-corpus-n.perl to run on bytes:
From:
sub word_count {
my ($line) = @_;
if ($ignore_xml) {
$line =~ s/<\S[^>]*\S>/ /g;
$line =~ s/\s+/ /g;
$line =~ s/^ //g;
$line =~ s/ $//g;
}
my @w = split(/ /,$line);
return scalar @w;
}
To:
sub word_count {
use bytes;
my ($line) = @_;
if ($ignore_xml) {
$line =~ s/<\S[^>]*\S>/ /g;
$line =~ s/\s+/ /g;
$line =~ s/^ //g;
$line =~ s/ $//g;
}
return length($line);
}
- Data can be downloaded from this link
- Data can be found in the following folders
- OPUS data - opus.tar.gz for raw data, byte-bin for preprocessed data
- Cross domain data - cross_domain_adaptation.zip
- Cross lingual transfer - cross_lingual_transfer.zip
- Move the folders to the data/ folder and update the paths in the scripts.
bash embeddingless_scripts/train_byte_ncf.sh
bash embeddingless_scripts/train_byte_wsf.sh