Improving Back-Translation with Uncertainty-based Confidence Estimation

Introduction

This is the implementation of our work Improving Back-Translation with Uncertainty-based Confidence Estimation.

@inproceedings{Wang:2019:EMNLP,
    title = "Improving Back-Translation with Uncertainty-based Confidence Estimation",
    author = "Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong",
    booktitle = "EMNLP",
    year = "2019"
}

The implementation is on top of THUMT.

Prerequisites

This repository runs in the same environment as THUMT, please refer to the user manual of THUMT to config the environment.

Usage

Note: The usage is not user-friendly. May improve later.
Suppose the local path to this repository is CODE_DIR.

Standard training:

python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--side none \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

You can train a target-source translation model by simply exchanging source corpus and target corpus, source vocabulary and target vocabulary.

Translate target-side monolingual corpus:

python [CODE_DIR]/thumt/bin/translator.py \
	--input [monolingual corpus] \
	--output [translated corpus] \
	--vocabulary [target vocabulary] [source vocabulary] \
	--model transformer \
	--checkpoint [path to the target-source model] \
	--parameters=device_list=[0]

We recommand splitting the entire monolingual corpus into small corpora before translation if the monolingual corpus is too big.

Uncertainty estimation for the translated corpus:

python [CODE_DIR]/thumt/bin/scorer.py \
	--input [monolingual corpus] [translated corpus] \
	--vocabulary [target vocabulary] [source vocabulary] \
	--mean_file [word-level mean] \
	--var_file [word-level var] \
	--rv_file [word-level var/mean] \
	--sen_mean [sentence-level mean] \
	--sen_var [sentence-level var] \
	--sen_rv [sentence-level var/mean] \
	--model transformer \
	--checkpoint [path to the target-source model] \
	--parameters=model_uncertainty=true,device_list=[0]

Confidence-aware training:

python [CODE_DIR]/thumt/bin/trainer.py \
	--input [source corpus] [target corpus] \
	--word_confidence [word-level uncertainty file] \
	--sen_confidence [sentence-level uncertainty file] \
	--side source_sentence_source_word \
	--vocabulary [source vocabulary] [target vocabulary] \
	--model transformer \
	--checkpoint [path to the source-target checkpoint] \
	--parameters=train_steps=60000,constant_batch_size=false,batch_size=6250,device_list=[0,1,2,3]

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
thumt		thumt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
UserManual.pdf		UserManual.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Back-Translation with Uncertainty-based Confidence Estimation

Contents

Introduction

Prerequisites

Usage

Contact

About

Releases

Packages

Languages

License

THUNLP-MT/UCE4BT

Folders and files

Latest commit

History

Repository files navigation

Improving Back-Translation with Uncertainty-based Confidence Estimation

Contents

Introduction

Prerequisites

Usage

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages