Skip to content

NMT: Usage

Michael A. Martin edited this page Jun 14, 2021 · 25 revisions

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--src-langs [lang [lang ...]] Source language files The name of one (or more) files in the source language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file abp-ABP.txt', specify abp-ABP`.
--trg-langs [lang [lang ...]] Target language files The name of one (or more) files in the target language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file en-ABPBTE.txt', specify en-ABPBTE`.
--vocab-size VOCAB_SIZE Shared vocabulary size Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
--src-vocab-size SRC_VOCAB_SIZE Source vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the --trg-vocab-size argument.
--trg-vocab-size SRC_VOCAB_SIZE Target vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the --src-vocab-size argument.
--parent PARENT Parent experiment name The name of an experiment subfolder with a trained parent model. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--mirror Mirror train and validation data sets (default: False) Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
--force Overwrite existing config file If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
--seed SEED Randomization seed Specifies the randomization seed that will be used during preprocessing and training.
--model MODEL Neural network model Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

  • creating SentencePiece vocabulary models from the experiment's source and target files;
  • splitting the source and target files into the training, validation, and test data sets;
  • writing the train/validate/test data sets to files in the subfolder;
  • adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Output corpus statistics Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: train.py [-h] [--mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]

Arguments:

Argument Purpose Description
experiments Experiment names The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--mixed-precision Enable mixed precision
--memory-growth Enable memory growth
--num-devices NUM_DEVICES Number of devices to train on
--eager-execution Enable Tensorflow eager execution

test

The test tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: test.py [-h] [--memory-growth] [--checkpoint CHECKPOINT] [--last]
[--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books [book [book ...]]] [--by-book]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--memory growth Enable memory growth
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment.
--last Test the last checkpoint Use the last training checkpoint to generate target language predictions.
--best Test the best checkpoint Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment.
--avg Test the averaged checkpoint Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool.
--ref-projects [project [project ...]] Reference projects The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
--force-infer Force inferencing If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
--scorers [scorer [scorer ...]] List of scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'chrf3', 'meteor', 'ter', and 'wer'.
--books [book [book ...]] Books to score Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)
--by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

  1. Using a trained model to translate the text in a file from the source language to a target language.
  2. Using a trained model to translate the text in a sequence of files into a target language.
  3. Using Google Translate to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: translate.py [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG] [--src-prefix SRC_PREFIX]
[--trg-prefix TRG_PREFIX] [--start-seq START_SEQ]
[--end-seq END_SEQ] [--src-project SRC_PROJECT]
[--book BOOK] [--trg-lang TRG_LANG]
[--output-usfm OUTPUT_USFM] [--eager-execution]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command can be used to translate the sentences in a simple text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--memory growth Enable memory growth
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src SRC Source file Name of a text file with the source language sentences to be translated. The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
--trg TRG Target file Name of the text file where the translated sentences will be written (one per line).

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command can be used to translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-prefix SRC_PREFIX Source file prefix (e.g., de-news2019-) The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
--trg-prefix TRG_PREFIX Target file prefix (e.g., en-news2019-) The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
--start-seq START_SEQ Starting file sequence # The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
--end-seq START_SEQ Ending file sequence # The final source language file sequence number to translate.

Paratext book (USFM file)

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.

Analyzing the results of an experiment

analyze

check_train_val_test_split

diff_predictions

Miscellaneous commands

average_checkpoints

export_embeddings