Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for NMT without attention #22

Merged
merged 15 commits into from
May 24, 2017

Conversation

kuke
Copy link
Collaborator

@kuke kuke commented May 3, 2017

resolve #21


import sys
import gzip
import sqlite3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this package is not used in this scripts. please remove it.

# Embedding of the last generated word is automatically gotten by
# GeneratedInputs, which is initialized by a start mark, such as <s>,
# and must be included in generation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 111 to 118 are copied from explanations written for user documentation, they can be deleted.

input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The param_attr only specify the parameter name and this parameter is not referred to by its name, so it should be removed.

reverse=True)
encoded_vector = paddle.layer.concat(
input=[encoder_forward, encoder_backward])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite line 29 to 42 by invoking this https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py#L1096, and help to check this issue PaddlePaddle/Paddle#2001. Thanks very much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reminder. I will rewrite this part by bidirectional_gru and clean up other parts

# paddle.dataset.wmt14.test(source_dict_dim),
# buf_size=8192), batch_size=1)

#test_result = trainer.test(wmt14_test_batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commenting line 164 to 169 out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some unknown reason, the test phase dragged the speed of running even under very small batch size. Would enable these lines later if I can fixed that


beam_gen = seq2seq_net(source_dict_dim, target_dict_dim, True)
# get the pretrained model, whose bleu = 26.92
# parameters = paddle.dataset.wmt14.model()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove line 196, it is copied from the old demos and is not right for this script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused code line 197. Please keep your codes clean.

paddle.init(use_gpu=False, trainer_count=4)
source_language_dict_dim = 30000
target_language_dict_dim = 30000
generating = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make generating a parameter and get it from the command line. Now is set to True, if a user directly runs this example, it is not logic because we do not provide a trained model, he cannot generate without training.



def generate(source_dict_dim, target_dict_dim):
# use the first 3 samples for generation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes are directly copied from PaddleBook which aim to demo that restricts sample number to generate, so that generation can finish in short time, but here we expect to provide an example that can directly be used by the users, the constraint should be removed.

After copying, please delete inappropriate codes and comments.

@lcy-seso lcy-seso self-assigned this May 4, 2017
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes.

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes.

@kuke
Copy link
Collaborator Author

kuke commented May 8, 2017

Changes done. hi @lcy-seso, please have a review.

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration looks good to me. Please help to add the README.md.

cost = seq2seq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)

# define optimize method and trainer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize --> optimization.

@kuke kuke force-pushed the seq2seq_demo_dev branch 2 times, most recently from cd2feb7 to 520c229 Compare May 15, 2017 05:17
@kuke
Copy link
Collaborator Author

kuke commented May 15, 2017

Doc done. Please review the README.md file. @lcy-seso

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refine the doc.

@@ -0,0 +1,242 @@
# 神经机器翻译模型

机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Paddlepaddle --> PaddlePaddle

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。

#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第5行删掉。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。将构成编码器和解码器的两个 RNN 在时间上展开, 我们可以得到如下的模型结构图

<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件夹统一命名成 images 吧~ 和其它例子保持一致。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>

- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 将用源语言表达的句子编码成一个向量 --> 将源语言句子编码成一个向量
  2. 我这里图无法正常显示

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>

- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">,
用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,紧接着我们将独热码表示转换为词嵌入(Word Embedding)表示。最后 RNN 单元逐字符或逐词处理输入,得到完整句子的编码向量。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 紧接着我们 (尽量避免用我们,详细介绍引导至PaddleBook,避免展开细节,也保证精确) --> 模型为热独码表示的单词学习一个词嵌入表示,也就是常说的词向量,关于词向量的详细介绍请参考PaddleBook的词向量一课。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

## 数据准备
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在Paddlepaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddlepaddle-->PaddlePaddle

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* `prob`表示生成句子的得分,随之其后则是翻译生成的句子;
* `<s>` 表示句子的开始,`<e>`表示一个句子的结束,如果出现了在词典中未包含的词,则用`<unk>`替代。

至此,我们在Paddlepaddle上实现了一个初步的机器翻译模型。我们可以看到,Paddlepaddle提供了灵活丰富的API供选择和使用,使得我们能够很方便完成各种复杂网络的配置。机器翻译本身也是个快速发展的领域,各种新方法新思想在不断涌现。在学习完本教程后,读者若有兴趣和余力,可基于Paddlepaddle平台实现更为复杂、性能更优的机器翻译模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

全文的Paddlepaddle 都替换成 PaddlePaddle。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

对应的英文翻译结果为,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个概率没问题吗?log prob 怎么会这么小呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified. done

python nmt_without_attention_v2.py --generate
```
则自动为测试数据生成了对应的翻译结果。
如果设置beam search的大小为3,输入法文句子
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设置 beam size 的大小为3。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- **train():** 依次完成网络的构建、data reader的定义、trainer的建议和事件句柄的定义,最后训练网络;

- **generate():**: 根据指定的模型路径初始化网络,导入测试数据并由`beam_search`完成翻译过程。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第185 到 189 行,请解释核心的逻辑,而不要这样“流水账”,流水账对普通用户来说,和他没看代码是没区别的。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kuke kuke force-pushed the seq2seq_demo_dev branch from 35346b3 to e816864 Compare May 16, 2017 08:23
Copy link
Collaborator Author

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the doc accordingly. Please review the changes @lcy-seso

@@ -0,0 +1,242 @@
# 神经机器翻译模型

机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。

#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。将构成编码器和解码器的两个 RNN 在时间上展开, 我们可以得到如下的模型结构图

<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>

- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>

- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">,
用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,紧接着我们将独热码表示转换为词嵌入(Word Embedding)表示。最后 RNN 单元逐字符或逐词处理输入,得到完整句子的编码向量。
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

python nmt_without_attention_v2.py --generate
```
则自动为测试数据生成了对应的翻译结果。
如果设置beam search的大小为3,输入法文句子
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- **train():** 依次完成网络的构建、data reader的定义、trainer的建议和事件句柄的定义,最后训练网络;

- **generate():**: 根据指定的模型路径初始化网络,导入测试数据并由`beam_search`完成翻译过程。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

## 数据准备
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在Paddlepaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


这两部分的逻辑分别实现在如下的`if-else`条件分支中:

```python
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


在模型训练和测试阶段,解码器的行为有很大的不同:

- **训练阶段:**目标翻译结果的词嵌入`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标结果与实际结果的差异cost并返回;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small modifications of README is needed.

# 神经网络机器翻译模型
## 背景介绍

机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些小词句的修改:

  1. 是自然语言处理中非常重要的研究方向 --> 是自然语言处理中重要的研究方向
    把非常去掉吧~
  2. 其实现方式也经历了不断的演化 --> 其实现方式也经历了不断地演化
  3. 传统的机器翻译方法 --> 传统机器翻译方法
  4. 深度学习的提出与迅速发展使得特征的自动学习成为可能。 --> 深度学习的提出与迅速发展使得特征的自动学习成为可能。 [删掉了]
  5. 本教程 --> 本例,和其它例子保持一致
  6. 本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN) --> 本例介绍在PaddlePaddle中如何利用循环神经网络(Recurrent Neural Network, RNN)构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation, NMT)模型。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。

## 模型概览

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 8 多的这一行删掉吧。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

## 模型概览


基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。如果将构成编码器和解码器的两个 RNN 在时间上展开,可以得到如下的模型结构图
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 基于RNN的机器翻译模型常见的是一个编码器-解码器 --> 基于RNN的神经网络机器翻译模型遵循编码器-解码器结构。
  2. 如果将构成编码器和解码器的两个 RNN 在时间上展开,可以得到如下的模型结构图 --> 将构成编码器和解码器的两个 RNN 沿时间步展开,得到如下的模型结构图:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


<p align="center"><img src="images/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p>

该翻译模型输入输出的基本单位可以是字符,也可以是词或者短语。不失一般性,下面以基于词的模型为例说明编码器/解码器的工作机制:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 该翻译模型输入输出的基本单位可以是字符 --> 神经机器翻译模型的输入输出可以是字符,
  2. 不失一般性,下面 --> 不失一般性,本例

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


该翻译模型输入输出的基本单位可以是字符,也可以是词或者短语。不失一般性,下面以基于词的模型为例说明编码器/解码器的工作机制:

- **编码器**:将源语言句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示词的 `id` 序列 $w = {w_1, w_2, ..., w_T}$,用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,模型为热独码表示的单词学习一个词嵌入(Word Embedding)表示,也就是常说的词向量,关于词向量的详细介绍请参考 PaddleBook 的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一章。最后 RNN 单元逐个词地处理输入,得到完整句子的编码向量。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 用独热码(One-hot encoding)表示 --> 用独热(One-hot)码表示
  2. 词向量的连接改成中文md,目前BOOK默认是英文版,而models是中文教程,models 目前先保证全部引用中文内容。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

在模型训练和测试阶段,解码器的行为有很大的不同:

- **训练阶段**:目标翻译结果的词向量`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标翻译与实际解码的差异cost并返回;
- **测试阶段**:解码器根据最后一个生成的词预测下一个词,`GeneratedInputV2()`自动生成最后一个词的词嵌入并传递给单步逻辑,`beam_search()`函数调用单步逻辑函数`gru_decoder_without_attention()`完成柱搜索并作为结果返回。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeneratedInputV2()自动生成最后一个词的词嵌入并传递给单步逻辑 --> GeneratedInputV2()自动取出模型预测出的概率最高的$k$个词的词向量传递给单步逻辑。
- 上文均使用词向量这一叫法
- GeneratedInputV2() 这个东西在Paddle里面不生成,只取结果,文字略微修改

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- **训练阶段**:目标翻译结果的词向量`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标翻译与实际解码的差异cost并返回;
- **测试阶段**:解码器根据最后一个生成的词预测下一个词,`GeneratedInputV2()`自动生成最后一个词的词嵌入并传递给单步逻辑,`beam_search()`函数调用单步逻辑函数`gru_decoder_without_attention()`完成柱搜索并作为结果返回。

这两部分的逻辑分别实现在如下的`if-else`条件分支中:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两部分的逻辑 --> 训练和生成的逻辑

  • 如果不是必须(连得非常紧密的两句话,避免冗余),避免使用指代,否则跨越大段文字描述的指代词会有指代不请的嫌疑。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

## 数据准备
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在PaddlePaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 本教程 --> 本例
  2. 该数据集是法文到英文互译的平行语料数据 --> 该数据集是法文到英文互译的平行语料。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## 模型的训练与测试

在定义好网络结构后,就可以进行模型训练与测试了。根据用户输入命令的不同,模型的训练与测试分别由`main()`函数调用`train()`和`generate()`完成。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加一句,修改什么变量来切换train 和 generate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* `prob`表示生成句子的得分,随之其后则是翻译生成的句子;
* `<s>` 表示句子的开始,`<e>`表示一个句子的结束,如果出现了在词典中未包含的词,则用`<unk>`替代。

至此,我们在PaddlePaddle上实现了一个初步的机器翻译模型。我们可以看到,PaddlePaddle提供了灵活丰富的API供选择和使用,使得我们能够很方便完成各种复杂网络的配置。机器翻译本身也是个快速发展的领域,各种新方法新思想在不断涌现。在学习完本教程后,读者若有兴趣和余力,可基于PaddlePaddle平台实现更为复杂、性能更优的机器翻译模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 供选择和使用 --> 供大家选择和使用,缺主语。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kuke kuke force-pushed the seq2seq_demo_dev branch from 89894a5 to 725a051 Compare May 24, 2017 07:28
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last modification. LGTM.


机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。
## 背景介绍
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断地演化。传统机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本例介绍在PaddlePaddle中如何利用循环神经网络(Recurrent Neural Network, RNN)构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation, NMT)模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断地演化。 --> 机器翻译利用计算机将源语言转换成目标语言的同义表达,是自然语言处理中重要的研究方向,有着广泛的应用需求,其实现方式也经历了不断地演化。

@kuke kuke force-pushed the seq2seq_demo_dev branch from 725a051 to 0233e87 Compare May 24, 2017 07:53
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcy-seso lcy-seso merged commit ea6378f into PaddlePaddle:develop May 24, 2017
@kuke kuke deleted the seq2seq_demo_dev branch May 24, 2017 10:19
HongyuLi2018 pushed a commit that referenced this pull request Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example nmt_without_attention for seq2seq demo
2 participants