NMT: Usage

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--src-langs [lang [lang ...]]`	Source language files	The name of one (or more) files in the source language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `abp-ABP.txt', specify` abp-ABP`.
`--trg-langs [lang [lang ...]]`	Target language files	The name of one (or more) files in the target language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `en-ABPBTE.txt', specify` en-ABPBTE`.
`--vocab-size VOCAB_SIZE`	Shared vocabulary size	Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
`--src-vocab-size SRC_VOCAB_SIZE`	Source vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the `--trg-vocab-size` argument.
`--trg-vocab-size SRC_VOCAB_SIZE`	Target vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the `--src-vocab-size` argument.
`--parent PARENT`	Parent experiment name	The name of an experiment subfolder with a trained parent model. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--mirror`	Mirror train and validation data sets (default: False)	Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
`--force`	Overwrite existing config file	If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
`--seed SEED`	Randomization seed	Specifies the randomization seed that will be used during preprocessing and training.
`--model MODEL`	Neural network model	Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

creating SentencePiece vocabulary models from the experiment's source and target files;
splitting the source and target files into the training, validation, and test data sets;
writing the train/validate/test data sets to files in the subfolder;
adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--stats`	Output corpus statistics	Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: train.py [-h] [--mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]

Arguments:

Argument	Purpose	Description
`experiments`	Experiment names	The names of the experiments to train. Each experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--mixed-precision`	Enable mixed precision
`--memory-growth`	Enable memory growth
`--num-devices NUM_DEVICES`	Number of devices to train on
`--eager-execution`	Enable Tensorflow eager execution

test

The test tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: test.py [-h] [--memory-growth] [--checkpoint CHECKPOINT] [--last]
[--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books [book [book ...]]] [--by-book]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--memory growth`	Enable memory growth
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the `run` subfolder of the specified experiment.
`--last`	Test the last checkpoint	Use the last training checkpoint to generate target language predictions.
`--best`	Test the best checkpoint	Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the `run > export` subfolder of the specified experiment.
`--avg`	Test the averaged checkpoint	Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the `train: average_last_checkpoints: _<n>_` option, or it can be manually generated after training by using the average_checkpoints tool.
`--ref-projects [project [project ...]]`	Reference projects	The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
`--force-infer`	Force inferencing	If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
`--scorers [scorer [scorer ...]]`	List of scorers	Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'chrf3', 'meteor', 'ter', and 'wer'.
`--books [book [book ...]]`	Books to score	Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)
`--by-book`	Score individual books	In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the `--books` option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

Using a trained model to translate the text in a file from the source language to a target language.
Using a trained model to translate the text in a sequence of files into a target language.
Using Google Translate to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: translate.py [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG] [--src-prefix SRC_PREFIX]
[--trg-prefix TRG_PREFIX] [--start-seq START_SEQ]
[--end-seq END_SEQ] [--src-project SRC_PROJECT]
[--book BOOK] [--trg-lang TRG_LANG]
[--output-usfm OUTPUT_USFM] [--eager-execution]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command can be used to translate the sentences in a simple text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--memory growth`	Enable memory growth
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src SRC`	Source file	Name of a text file with the source language sentences to be translated. The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
`--trg TRG`	Target file	Name of the text file where the translated sentences will be written (one per line).

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command can be used to translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src-prefix SRC_PREFIX`	Source file prefix (e.g., de-news2019-)	The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
`--trg-prefix TRG_PREFIX`	Target file prefix (e.g., en-news2019-)	The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
`--start-seq START_SEQ`	Starting file sequence #	The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
`--end-seq START_SEQ`	Ending file sequence #	The final source language file sequence number to translate.

Paratext book (USFM file)

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiments to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT: Usage

Setting up and running an experiment

config

preprocess

train

test

translate

Text file

Sequence of Text Files

Paratext book (USFM file)

Analyzing the results of an experiment

analyze

check_train_val_test_split

diff_predictions

Miscellaneous commands

average_checkpoints

export_embeddings

Clone this wiki locally