-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example for NMT without attention #22
Conversation
seq2seq/nmt_without_attention_v2.py
Outdated
|
||
import sys | ||
import gzip | ||
import sqlite3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this package is not used in this scripts. please remove it.
seq2seq/nmt_without_attention_v2.py
Outdated
# Embedding of the last generated word is automatically gotten by | ||
# GeneratedInputs, which is initialized by a start mark, such as <s>, | ||
# and must be included in generation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 111 to 118 are copied from explanations written for user documentation, they can be deleted.
seq2seq/nmt_without_attention_v2.py
Outdated
input=src_word_id, | ||
size=word_vector_dim, | ||
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The param_attr
only specify the parameter name and this parameter is not referred to by its name, so it should be removed.
seq2seq/nmt_without_attention_v2.py
Outdated
reverse=True) | ||
encoded_vector = paddle.layer.concat( | ||
input=[encoder_forward, encoder_backward]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rewrite line 29 to 42 by invoking this https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/trainer_config_helpers/networks.py#L1096, and help to check this issue PaddlePaddle/Paddle#2001. Thanks very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminder. I will rewrite this part by bidirectional_gru
and clean up other parts
seq2seq/nmt_without_attention_v2.py
Outdated
# paddle.dataset.wmt14.test(source_dict_dim), | ||
# buf_size=8192), batch_size=1) | ||
|
||
#test_result = trainer.test(wmt14_test_batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why commenting line 164 to 169 out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some unknown reason, the test phase dragged the speed of running even under very small batch size. Would enable these lines later if I can fixed that
seq2seq/nmt_without_attention_v2.py
Outdated
|
||
beam_gen = seq2seq_net(source_dict_dim, target_dict_dim, True) | ||
# get the pretrained model, whose bleu = 26.92 | ||
# parameters = paddle.dataset.wmt14.model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove line 196, it is copied from the old demos and is not right for this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove unused code line 197. Please keep your codes clean.
seq2seq/nmt_without_attention_v2.py
Outdated
paddle.init(use_gpu=False, trainer_count=4) | ||
source_language_dict_dim = 30000 | ||
target_language_dict_dim = 30000 | ||
generating = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make generating
a parameter and get it from the command line. Now is set to True
, if a user directly runs this example, it is not logic because we do not provide a trained model, he cannot generate without training.
seq2seq/nmt_without_attention_v2.py
Outdated
|
||
|
||
def generate(source_dict_dim, target_dict_dim): | ||
# use the first 3 samples for generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These codes are directly copied from PaddleBook which aim to demo that restricts sample number to generate, so that generation can finish in short time, but here we expect to provide an example that can directly be used by the users, the constraint should be removed.
After copying, please delete inappropriate codes and comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request changes.
Changes done. hi @lcy-seso, please have a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration looks good to me. Please help to add the README.md.
seq2seq/nmt_without_attention_v2.py
Outdated
cost = seq2seq_net(source_dict_dim, target_dict_dim) | ||
parameters = paddle.parameters.create(cost) | ||
|
||
# define optimize method and trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize --> optimization.
cd2feb7
to
520c229
Compare
Doc done. Please review the README.md file. @lcy-seso |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please refine the doc.
seq2seq/basic_nmt/README.md
Outdated
@@ -0,0 +1,242 @@ | |||
# 神经机器翻译模型 | |||
|
|||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Paddlepaddle --> PaddlePaddle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。 | ||
|
||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第5行删掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。将构成编码器和解码器的两个 RNN 在时间上展开, 我们可以得到如下的模型结构图 | ||
|
||
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文件夹统一命名成 images 吧~ 和其它例子保持一致。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> | ||
|
||
- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 将用源语言表达的句子编码成一个向量 --> 将源语言句子编码成一个向量
- 我这里图无法正常显示
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> | ||
|
||
- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">, | ||
用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,紧接着我们将独热码表示转换为词嵌入(Word Embedding)表示。最后 RNN 单元逐字符或逐词处理输入,得到完整句子的编码向量。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 紧接着我们 (尽量避免用我们,详细介绍引导至PaddleBook,避免展开细节,也保证精确) --> 模型为热独码表示的单词学习一个词嵌入表示,也就是常说的词向量,关于词向量的详细介绍请参考PaddleBook的词向量一课。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
``` | ||
|
||
## 数据准备 | ||
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在Paddlepaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddlepaddle-->PaddlePaddle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
* `prob`表示生成句子的得分,随之其后则是翻译生成的句子; | ||
* `<s>` 表示句子的开始,`<e>`表示一个句子的结束,如果出现了在词典中未包含的词,则用`<unk>`替代。 | ||
|
||
至此,我们在Paddlepaddle上实现了一个初步的机器翻译模型。我们可以看到,Paddlepaddle提供了灵活丰富的API供选择和使用,使得我们能够很方便完成各种复杂网络的配置。机器翻译本身也是个快速发展的领域,各种新方法新思想在不断涌现。在学习完本教程后,读者若有兴趣和余力,可基于Paddlepaddle平台实现更为复杂、性能更优的机器翻译模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
全文的Paddlepaddle 都替换成 PaddlePaddle。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
``` | ||
|
||
对应的英文翻译结果为, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个概率没问题吗?log prob 怎么会这么小呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified. done
seq2seq/basic_nmt/README.md
Outdated
python nmt_without_attention_v2.py --generate | ||
``` | ||
则自动为测试数据生成了对应的翻译结果。 | ||
如果设置beam search的大小为3,输入法文句子 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
设置 beam size 的大小为3。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
- **train():** 依次完成网络的构建、data reader的定义、trainer的建议和事件句柄的定义,最后训练网络; | ||
|
||
- **generate():**: 根据指定的模型路径初始化网络,导入测试数据并由`beam_search`完成翻译过程。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第185 到 189 行,请解释核心的逻辑,而不要这样“流水账”,流水账对普通用户来说,和他没看代码是没区别的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the doc accordingly. Please review the changes @lcy-seso
seq2seq/basic_nmt/README.md
Outdated
@@ -0,0 +1,242 @@ | |||
# 神经机器翻译模型 | |||
|
|||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在Paddlepaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经机器翻译(Neural Machine Translation)模型。 | ||
|
||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。将构成编码器和解码器的两个 RNN 在时间上展开, 我们可以得到如下的模型结构图 | ||
|
||
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> | ||
|
||
- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
<p align="center"><img src="figures/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> | ||
|
||
- **编码器**:将用源语言表达的句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示字符或词的 id 序列<img src="http://chart.googleapis.com/chart?cht=tx&chl= w=\langle w_1, w_2, ..., w_T\rangle" style="border:none;">, | ||
用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,紧接着我们将独热码表示转换为词嵌入(Word Embedding)表示。最后 RNN 单元逐字符或逐词处理输入,得到完整句子的编码向量。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
python nmt_without_attention_v2.py --generate | ||
``` | ||
则自动为测试数据生成了对应的翻译结果。 | ||
如果设置beam search的大小为3,输入法文句子 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
- **train():** 依次完成网络的构建、data reader的定义、trainer的建议和事件句柄的定义,最后训练网络; | ||
|
||
- **generate():**: 根据指定的模型路径初始化网络,导入测试数据并由`beam_search`完成翻译过程。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
``` | ||
|
||
## 数据准备 | ||
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在Paddlepaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
这两部分的逻辑分别实现在如下的`if-else`条件分支中: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
在模型训练和测试阶段,解码器的行为有很大的不同: | ||
|
||
- **训练阶段:**目标翻译结果的词嵌入`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标结果与实际结果的差异cost并返回; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small modifications of README is needed.
seq2seq/basic_nmt/README.md
Outdated
# 神经网络机器翻译模型 | ||
## 背景介绍 | ||
|
||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些小词句的修改:
- 是自然语言处理中非常重要的研究方向 --> 是自然语言处理中重要的研究方向
把非常去掉吧~ - 其实现方式也经历了不断的演化 --> 其实现方式也经历了不断地演化
- 传统的机器翻译方法 --> 传统机器翻译方法
- 深度学习的提出与迅速发展使得特征的自动学习成为
了可能。 --> 深度学习的提出与迅速发展使得特征的自动学习成为可能。 [删掉了] - 本教程 --> 本例,和其它例子保持一致
- 本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN) --> 本例介绍在PaddlePaddle中如何利用循环神经网络(Recurrent Neural Network, RNN)构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation, NMT)模型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。 | ||
|
||
## 模型概览 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 8 多的这一行删掉吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
## 模型概览 | ||
|
||
|
||
基于RNN的机器翻译模型常见的是一个编码器-解码器(Encoder-Decoder)结构,其中的编码器和解码器均是一个循环神经网络。如果将构成编码器和解码器的两个 RNN 在时间上展开,可以得到如下的模型结构图 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 基于RNN的机器翻译模型
常见的是一个编码器-解码器 --> 基于RNN的神经网络机器翻译模型遵循编码器-解码器结构。 - 如果将构成编码器和解码器的两个 RNN 在时间上展开,可以得到如下的模型结构图 --> 将构成编码器和解码器的两个 RNN 沿时间步展开,得到如下的模型结构图:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
<p align="center"><img src="images/Encoder-Decoder.png" width = "90%" align="center"/><br/>图 1. 编码器-解码器框架 </p> | ||
|
||
该翻译模型输入输出的基本单位可以是字符,也可以是词或者短语。不失一般性,下面以基于词的模型为例说明编码器/解码器的工作机制: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 该翻译模型输入输出的基本单位可以是字符 --> 神经机器翻译模型的输入输出可以是字符,
- 不失一般性,下面 --> 不失一般性,本例
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
该翻译模型输入输出的基本单位可以是字符,也可以是词或者短语。不失一般性,下面以基于词的模型为例说明编码器/解码器的工作机制: | ||
|
||
- **编码器**:将源语言句子编码成一个向量,作为解码器的输入。解码器的原始输入是表示词的 `id` 序列 $w = {w_1, w_2, ..., w_T}$,用独热码(One-hot encoding)表示。为了对输入进行降维,同时建立词语之间的语义关联,模型为热独码表示的单词学习一个词嵌入(Word Embedding)表示,也就是常说的词向量,关于词向量的详细介绍请参考 PaddleBook 的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一章。最后 RNN 单元逐个词地处理输入,得到完整句子的编码向量。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 用独热码(One-hot encoding)表示 --> 用独热(One-hot)码表示
- 词向量的连接改成中文md,目前BOOK默认是英文版,而models是中文教程,models 目前先保证全部引用中文内容。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
在模型训练和测试阶段,解码器的行为有很大的不同: | ||
|
||
- **训练阶段**:目标翻译结果的词向量`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标翻译与实际解码的差异cost并返回; | ||
- **测试阶段**:解码器根据最后一个生成的词预测下一个词,`GeneratedInputV2()`自动生成最后一个词的词嵌入并传递给单步逻辑,`beam_search()`函数调用单步逻辑函数`gru_decoder_without_attention()`完成柱搜索并作为结果返回。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeneratedInputV2()
自动生成最后一个词的词嵌入并传递给单步逻辑 --> GeneratedInputV2()
自动取出模型预测出的概率最高的$k$个词的词向量传递给单步逻辑。
- 上文均使用词向量这一叫法
- GeneratedInputV2() 这个东西在Paddle里面不生成,只取结果,文字略微修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
- **训练阶段**:目标翻译结果的词向量`trg_embedding`作为参数传递给单步逻辑`gru_decoder_without_attention()`,函数`recurrent_group()`循环调用单步逻辑执行,最后计算目标翻译与实际解码的差异cost并返回; | ||
- **测试阶段**:解码器根据最后一个生成的词预测下一个词,`GeneratedInputV2()`自动生成最后一个词的词嵌入并传递给单步逻辑,`beam_search()`函数调用单步逻辑函数`gru_decoder_without_attention()`完成柱搜索并作为结果返回。 | ||
|
||
这两部分的逻辑分别实现在如下的`if-else`条件分支中: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两部分的逻辑 --> 训练和生成的逻辑
- 如果不是必须(连得非常紧密的两句话,避免冗余),避免使用指代,否则跨越大段文字描述的指代词会有指代不请的嫌疑。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
``` | ||
|
||
## 数据准备 | ||
本教程所用到的数据来自[WMT14](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/),该数据集是法文到英文互译的平行语料数据。用[bitexts](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz)作为训练数据,[dev+test data](http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz)作为验证与测试数据。在PaddlePaddle中已经封装好了该数据集的读取接口,在首次运行的时候,程序会自动完成下载,用户无需手动完成相关的数据准备。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 本教程 --> 本例
- 该数据集是法文到英文互译的平行语料数据 --> 该数据集是法文到英文互译的平行语料。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
|
||
## 模型的训练与测试 | ||
|
||
在定义好网络结构后,就可以进行模型训练与测试了。根据用户输入命令的不同,模型的训练与测试分别由`main()`函数调用`train()`和`generate()`完成。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加一句,修改什么变量来切换train 和 generate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
seq2seq/basic_nmt/README.md
Outdated
* `prob`表示生成句子的得分,随之其后则是翻译生成的句子; | ||
* `<s>` 表示句子的开始,`<e>`表示一个句子的结束,如果出现了在词典中未包含的词,则用`<unk>`替代。 | ||
|
||
至此,我们在PaddlePaddle上实现了一个初步的机器翻译模型。我们可以看到,PaddlePaddle提供了灵活丰富的API供选择和使用,使得我们能够很方便完成各种复杂网络的配置。机器翻译本身也是个快速发展的领域,各种新方法新思想在不断涌现。在学习完本教程后,读者若有兴趣和余力,可基于PaddlePaddle平台实现更为复杂、性能更优的机器翻译模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 供选择和使用 --> 供大家选择和使用,缺主语。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last modification. LGTM.
seq2seq/basic_nmt/README.md
Outdated
|
||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中非常重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断的演化。传统的机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为了可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本教程主要介绍的就是在PaddlePaddle中如何利用循环神经网络(RNN),构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation)模型。 | ||
## 背景介绍 | ||
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断地演化。传统机器翻译方法主要基于规则或统计模型,需要人为地指定翻译规则或设计语言特征,效果依赖于人对源语言与目标语言的理解程度。近些年来,深度学习的提出与迅速发展使得特征的自动学习成为可能。深度学习首先在图像识别和语音识别中取得成功,进而在机器翻译等自然语言处理领域中掀起了研究热潮。机器翻译中的深度学习模型直接学习源语言到目标语言的映射,大为减少了学习过程中人的介入,同时显著地提高了翻译质量。本例介绍在PaddlePaddle中如何利用循环神经网络(Recurrent Neural Network, RNN)构建一个端到端(End-to-End)的神经网络机器翻译(Neural Machine Translation, NMT)模型。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
机器翻译利用计算机将源语言的表达转换成目标语言的同义表达,是自然语言处理中重要的研究方向。机器翻译有着广泛的应用需求,其实现方式也经历了不断地演化。 --> 机器翻译利用计算机将源语言转换成目标语言的同义表达,是自然语言处理中重要的研究方向,有着广泛的应用需求,其实现方式也经历了不断地演化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
resolve #21