-
-
Notifications
You must be signed in to change notification settings - Fork 3
NMT: Usage
The tools described in this section are the tools that are most commonly used in setting up and running an experiment.
The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>
)
usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--src-langs [lang [lang ...]] |
Source language files | The name of one (or more) files in the source language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file abp-ABP.txt', specify abp-ABP`. |
--trg-langs [lang [lang ...]] |
Target language files | The name of one (or more) files in the target language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file en-ABPBTE.txt', specify en-ABPBTE`. |
--vocab-size VOCAB_SIZE |
Shared vocabulary size | Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files. |
--src-vocab-size SRC_VOCAB_SIZE |
Source vocabulary size | Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the --trg-vocab-size argument. |
--trg-vocab-size SRC_VOCAB_SIZE |
Target vocabulary size | Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the --src-vocab-size argument. |
--parent PARENT |
Parent experiment name | The name of an experiment subfolder with a trained parent model. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--mirror |
Mirror train and validation data sets (default: False) | Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair. |
--force |
Overwrite existing config file | If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file. |
--seed SEED |
Randomization seed | Specifies the randomization seed that will be used during preprocessing and training. |
--model MODEL |
Neural network model | Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced). |
The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:
- creating SentencePiece vocabulary models from the experiment's source and target files;
- splitting the source and target files into the training, validation, and test data sets;
- writing the train/validate/test data sets to files in the subfolder;
- adapting the parent model (if one is specified) to be used by this experiment.
usage: preprocess.py [-h] [--stats] experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--stats |
Output corpus statistics | Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available. |
The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.
usage: train.py [-h] [--mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]
Arguments:
Argument | Purpose | Description |
---|---|---|
experiments |
Experiment names | The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--mixed-precision |
Enable mixed precision | |
--memory-growth |
Enable memory growth | |
--num-devices NUM_DEVICES |
Number of devices to train on | |
--eager-execution |
Enable Tensorflow eager execution |
The test tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.
usage: test.py [-h] [--memory-growth] [--checkpoint CHECKPOINT] [--last]
[--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books [book [book ...]]] [--by-book]
experiment
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--memory growth |
Enable memory growth | |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment. |
--last |
Test the last checkpoint | Use the last training checkpoint to generate target language predictions. |
--best |
Test the best checkpoint | Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment. |
--avg |
Test the averaged checkpoint | Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool. |
--ref-projects [project [project ...]] |
Reference projects | The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions. |
--force-infer |
Force inferencing | If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions. |
--scorers [scorer [scorer ...]] |
List of scorers | Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'chrf3', 'meteor', 'ter', and 'wer'. |
--books [book [book ...]] |
Books to score | Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis) |
--by-book |
Score individual books | In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books. |
The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:
- Using a trained model to translate the text in a file from the source language to a target language.
- Using a trained model to translate the text in a sequence of files into a target language.
- Using Google Translate to translate a USFM-formatted book in a Paratext project into a target language.
The command line arguments for each of these scenarios are described below.
usage: translate.py [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG] [--src-prefix SRC_PREFIX]
[--trg-prefix TRG_PREFIX] [--start-seq START_SEQ]
[--end-seq END_SEQ] [--src-project SRC_PROJECT]
[--book BOOK] [--trg-lang TRG_LANG]
[--output-usfm OUTPUT_USFM] [--eager-execution]
experiment
Using the combination of command line arguments described in this section, the translate command can be used to translate the sentences in a simple text file from the source language to the target language, using the requested checkpoint from a trained model.
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario). |
--memory growth |
Enable memory growth | |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |
--src SRC |
Source file | Name of a text file with the source language sentences to be translated. The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file. |
--trg TRG |
Target file | Name of the text file where the translated sentences will be written (one per line). |
Using the combination of command line arguments described in this section, the translate command can be used to translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario). |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |
--src-prefix SRC_PREFIX |
Source file prefix (e.g., de-news2019-) | The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory. |
--trg-prefix TRG_PREFIX |
Target file prefix (e.g., en-news2019-) | The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory. |
--start-seq START_SEQ |
Starting file sequence # | The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc). |
--end-seq START_SEQ |
Ending file sequence # | The final source language file sequence number to translate. |
Arguments:
Argument | Purpose | Description |
---|---|---|
experiment |
Experiment name | The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. |
--checkpoint CHECKPOINT |
Test specified checkpoint | Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment. |