add ner model #30

guoshengCS · 2017-05-08T08:18:15Z

luotao1

数据集和模型比较大，不应该上传版本库，请清理。目前这个commit应该只留下conll03.py和ner_final.py两个文件。数据集可以给个下载地址，或集成进paddle.dataset中。

lcy-seso

request changes.

lcy-seso · 2017-05-09T07:41:44Z

sequence_tagging_for_ner/ner_final.py

+    predict = paddle.layer.crf_decoding(
+        size=label_dict_len,
+        input=output,
+        param_attr=paddle.attr.Param(name='crfw'))


The original configuration file contains an chunk evaluator.

lcy-seso · 2017-05-09T07:45:13Z

sequence_tagging_for_ner/ner_final.py

+    hidden_para_attr = paddle.attr.Param(
+        initial_std=default_std, learning_rate=mix_hidden_lr)
+
+    lstm_1_1 = paddle.layer.lstmemory(


The original configuration file uses simple RNN. By considering that we do not provide any examples of how to use the simple recurrent layer. I suggest not modify the original configuration file.

lcy-seso · 2017-05-09T07:46:35Z

sequence_tagging_for_ner/ner_final.py

+
+def ner_net():
+    word = paddle.layer.data(name='word', type=d_type(word_dict_len))
+    #ws = paddle.layer.data(name='ws', type=d_type(num_ws))


If the comment has no special purpose, please remove it.

lcy-seso · 2017-05-09T07:46:42Z

sequence_tagging_for_ner/ner_final.py

+        size=word_dim,
+        input=paddle.layer.table_projection(input=word, param_attr=emb_para))
+    #ws_embedding = paddle.layer.mixed(name='ws_embedding', size=caps_dim, 
+    #                    input=paddle.layer.table_projection(input=ws))


If the comment has no special purpose, please remove it.

lcy-seso

Bug occurs.

lcy-seso · 2017-05-09T10:43:05Z

sequence_tagging_for_ner/conll03.py

+
+    def reader():
+        for sentence, labels in corpus_reader():
+            #word_idx = [word_dict.get(w, UNK_IDX) for w in sentence]


Please remove useless comments.

lcy-seso · 2017-05-09T10:46:14Z

sequence_tagging_for_ner/ner_final.py

+        gate_act=paddle.activation.Sigmoid(),
+        state_act=paddle.activation.Sigmoid(),
+        bias_attr=std_0,
+        param_attr=lstm_para_attr)


line 51 to line 109, please rewrite by using for loop statement.

to be done, haven't found a good way to rewrite, and fellow the original configuration temporarily

有什么问题？

不太清楚需要用for循环抽出什么结构，比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM，这个需要点拨一下，辛苦review

lcy-seso · 2017-05-09T10:48:22Z

sequence_tagging_for_ner/ner_final.py

+        if len(test_data) == 10:
+            break
+
+    feature_out, target, crf_cost, predict = ner_net()


crf_cost cannot be used in inference because label is unknown.

guoshengCS · 2017-05-10T03:11:16Z

@lcy-seso 请教下chunk_evalutor 的使用问题，以下是文档中的部分截图，使用chunk_evalutor的话label的id是需要满足这样的条件的么：tagType = label % numTagType，chunkType = label / numTagType

另外numChunkTypes在设置的时候是num_chunk_types=(number_classes - 1) / len(schema) 么

lcy-seso

Need some modifications on codes and doc.

lcy-seso · 2017-05-11T05:51:19Z

sequence_tagging_for_ner/README.md

+命名实体识别（Named Entity Recognition，NER）又称作“专名识别”，是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、专有名词等，是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分，可以将其作为序列标注问题，根据序列标注结果可以直接得到实体边界和实体类别。
+##数据说明
+在本示例中，我们将使用CoNLL 2003 NER任务中开放出的数据集。由于版权原因，我们暂不提供此数据集的下载，可以按照[此页面](http://www.clips.uantwerpen.be/conll2003/ner/)中的说明免费获取该数据。该数据集中训练和测试数据格式如下
+<img src="image/data_format.png" width = "60%"  align=center /><br>


请变成表格，或则重绘。

lcy-seso · 2017-05-11T05:51:53Z

sequence_tagging_for_ner/README.md

+<div  align="center">  
+<img src="image/ner_network.png" width = "60%"  align=center /><br>
+图1. NER模型网络结构
+</div>


图再缩小一些，太大了。

lcy-seso · 2017-05-11T05:56:38Z

sequence_tagging_for_ner/ner.py

        bias_attr=std_0,
-        param_attr=lstm_para_attr)
+        param_attr=rnn_para_attr)


请用for 循环来写 62 ~ 107 行。

不太清楚需要用for循环抽出什么结构，这个能否具体的说下，比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM

lcy-seso · 2017-05-11T06:01:04Z

sequence_tagging_for_ner/ner.py

-    ner_net_train()
-    ner_net_infer()
+    parameters = ner_net_train(train_data_reader, 1)
+    ner_net_infer(test_data_reader, parameters)


和其他例子一样，把训练好的模型保存到本地，目前的模型没有存储下来。

预测时，加载存储好的模型，进行预测。

guoshengCS · 2017-05-12T07:40:20Z

辛苦review，用for循环改写一条暂未修正，这个还需要具体说明指导下，不太清楚需要用for循环抽出什么结构，比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM

guoshengCS · 2017-05-12T08:10:14Z

另外，按照 @luotao1 的意见相关数据移出版本库的话，可否提供一个类似SRL中的存放数据的地方，如http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/wordDict.txt

lcy-seso

The readme doc needs modifications.

lcy-seso · 2017-05-16T05:22:20Z

sequence_tagging_for_ner/README.md

+# 命名实体识别
+
+## 背景说明
+


背景说明需要再完善一下，NER 序列标注的一个例子，利用序列标注还可以做其它任务，是本例的目标。

大致解释一下什么是序列标注。

NN 解决序列标注的思路。

序列标注还可以做什么，使用本例的配置，还可以做其它什么任务。

序列标注需要的相关展开，请直接引导至PaddleBook的语义角色标注一节。

done，重写了背景说明部分，补充了相应内容

lcy-seso · 2017-05-18T04:49:04Z

sequence_tagging_for_ner/README.md

+   .            .    O     O
+```
+
+其中第一列为原始句子序列（第二、三列分别为词性标签和句法分析中的语块标签，这里暂时不用），第四列为采用了I-TYPE方式表示的NER标签（I-TYPE和[BIO方式](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)的主要区别在于语块开始标记的使用上，I-TYPE只有在出现相邻的同类别实体时对后者使用B标记，其他均使用I标记），而我们这里将使用BIO方式表示的标签集，这两种方式的转换过程在我们提供的`conll03.py`文件中进行。另外，我们附上word词典、label词典和预训练的词向量三个文件（word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业）以供使用。


这里请说明一下 conll03.py 做了哪些处理

如何运行 conll03.py

数据处理的说明很不清晰，比如，如果考虑需要运行这个例子：

原始数据下载下来是什么样子呢？

数据下载之后该放在哪里呢？

数据预处理应该首先执行哪个脚本呢？是不是需要指定参数？路径是否需要设置？

预处理完之后长什么样子呢？

预处理的数据应该修改哪些变量，修改成什么样？修改好之后运行什么脚本来跑训练？

请按照 step by step 的顺序来充足文档

请参考文本分类的自定义数据一节，来介绍如何替换成自己的数据进行训练。

done，重写了数据说明部分的内容进行了更细致的说明。暂未提供独立的数据预处理脚本，是将预处理过程内置在了返回生成器的过程中，另在使用说明部分增加了自定义数据的内容

lcy-seso · 2017-05-18T04:50:44Z

sequence_tagging_for_ner/README.md

+
+## 模型说明
+
+在本示例中，我们所使用的模型结构如图1所示。其输入为句子序列，在取词向量转换为词向量序列后，经过多组全连接层、双向RNN进行特征提取，最后接入CRF以学习到的特征为输入，以标记序列为监督信号，完成序列标注。更多关于RNN及其变体的知识可见[此页面](http://book.paddlepaddle.org/06.understand_sentiment/)。


把模型流程分解成：
1. 输入特征是 one-hot 表示的……
2. 转词向量……
3. RNN 学习句子特征
4. CRF 完成序列标注
这样用序号描述的步骤，不需要太长，但力求逻辑清晰。

lcy-seso · 2017-05-18T04:51:41Z

sequence_tagging_for_ner/README.md

+
+### 数据设置
+
+运行`ner.py`需要对数据设置部分进行更改，将以下代码中的变量值修改为正确的文件路径即可。


在哪里修改，如果修改？请基于“手把手不需要用户动脑子”这样的出发点来介绍。

lcy-seso · 2017-05-18T04:51:54Z

sequence_tagging_for_ner/README.md

+运行`ner.py`需要对数据设置部分进行更改，将以下代码中的变量值修改为正确的文件路径即可。
+
+```python
+# init dataset


这里的注释不要用缩写。

lcy-seso · 2017-05-18T04:52:46Z

sequence_tagging_for_ner/README.md

+
+`ner.py`提供了以下两个接口分别进行模型训练和预测：
+
+1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器（使用默认值即可）、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息，每个pass后会将模型保存下来，并将最终模型保存为`ner_net.tar.gz`。


按照Pass ID 来存储模型，文档这里也需要修改。

done，按照pass id存储模型，并将最后一轮模型另存为ner_model.tar.gz

lcy-seso · 2017-05-18T04:54:33Z

sequence_tagging_for_ner/README.md

+
+1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器（使用默认值即可）、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息，每个pass后会将模型保存下来，并将最终模型保存为`ner_net.tar.gz`。
+
+2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能，参数`data_reader`表示测试数据的迭代器（使用默认值即可）、`model_file`表示保存在本地的模型文件，预测过程会按如下格式打印预测结果：


预测需要用户做哪些事情呢？这里没有描述清楚。

怎样执行？（比如就是简单执行python A.py，那么就简单直接的告诉用户，如果需要修改某个变量的值，也请加以说明）

是否需要替换数据呢？

lcy-seso · 2017-05-18T04:54:49Z

sequence_tagging_for_ner/README.md

+	   for       O
+	   Baghdad   B-LOC
+	   .         O
+	```


文本块之前的缩进请去掉。

guoshengCS · 2017-05-22T07:31:44Z

@lcy-seso 辛苦review，根据上次review意见对README进行了较大改动，另为增强说服力参照论文Natural Language Processing (Almost) from Scratch对模型输入进行了修改，使用小写并加入大写标记作为特征。

lcy-seso

Good writing, almost LGTM.

lcy-seso · 2017-05-24T02:17:26Z

sequence_tagging_for_ner/README.md

+
+<div  align="center">  
+<img src="image/ner_label_ins.png" width = "80%"  align=center /><br>
+图1. NER标注示例


图的名字修改一下：NER标注示例 --> BIO标注方法示例

lcy-seso · 2017-05-24T02:37:16Z

sequence_tagging_for_ner/README.md

+
+根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/)等任务同样可作为序列标注问题。
+
+由于序列标注问题的广泛性，产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/)等经典的序列模型，这些模型多只能使用局部信息或需要人工设计特征。发展到深度学习阶段，各种网络结构能够实现复杂的特征抽取功能，循环神经网络（Recurrent Neural Network，RNN，更多相关知识见[此页面](http://book.paddlepaddle.org/07.label_semantic_roles/)）能够处理输入序列元素之间前后关联的问题而更适合序列数据。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题的通常做法是：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF进行序列标注。这实际上是将传统CRF中的线性模型换成了非线性神经网络，沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本示例中也将基于此思路建立模型，另外，虽然这里使用的是NER任务，但是所给出的模型也可以应用到其他序列标注任务中。


更多相关知识见[此页面] --> 把“此页面”修改为，PaddleBook中语义角色标注一课。

语义角色标注的连接换成中文README的，目前models都是中文的说明

lcy-seso · 2017-05-24T02:37:52Z

sequence_tagging_for_ner/README.md

+
+## 模型说明
+
+在NER任务中，输入是"一句话"，目标是识别句子中的实体边界及类别，我们这里仅使用原始句子作为特征（参照论文\[[2](#参考文献)\]进行了一些预处理工作：将每个词转换为小写并将原词是否大写另作为一个特征）。按照上文所述处理序列标注问题的思路，可以构造如下结构的模型（图2是模型结构示意图）：


我们这里仅使用原始句子作为特征 --> 我们这里仅对原始句子作为特征

lcy-seso · 2017-05-24T02:42:56Z

sequence_tagging_for_ner/README.md

+
+### 运行程序
+
+本示例另在`ner.py`中提供了完整的运行流程，包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法，使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为正确的文件路径：


本示例 --> 本例

修改为正确的文件路径 --> 修改为对应文件路径

lcy-seso · 2017-05-24T02:48:32Z

sequence_tagging_for_ner/README.md

+# 修改以下变量为对应文件路径
+train_data_file = 'data/train'    # 训练数据文件
+test_data_file = 'data/test'      # 测试数据文件
+vocab_file = 'data/vocab.txt'     # word_dict文件


注释里word_dict这变量个用中文描述一下吧，比如：“输入句子对应的字典文件的路径”否则，还需要对应去理解这个变量的含义

lcy-seso · 2017-05-24T02:48:47Z

sequence_tagging_for_ner/README.md

+
+```python
+# 修改以下变量为对应文件路径
+train_data_file = 'data/train'    # 训练数据文件


训练数据文件 --> 训练数据文件的路径

lcy-seso · 2017-05-24T02:49:06Z

sequence_tagging_for_ner/README.md

+```python
+# 修改以下变量为对应文件路径
+train_data_file = 'data/train'    # 训练数据文件
+test_data_file = 'data/test'      # 测试数据文件


测试数据文件 --> 测试数据文件的路径

lcy-seso · 2017-05-24T02:49:30Z

sequence_tagging_for_ner/README.md

+train_data_file = 'data/train'    # 训练数据文件
+test_data_file = 'data/test'      # 测试数据文件
+vocab_file = 'data/vocab.txt'     # word_dict文件
+target_file = 'data/target.txt'   # label_dict文件


注释中 label_dict 用中文来解释吧~

lcy-seso · 2017-05-24T02:50:01Z

sequence_tagging_for_ner/README.md

+test_data_file = 'data/test'      # 测试数据文件
+vocab_file = 'data/vocab.txt'     # word_dict文件
+target_file = 'data/target.txt'   # label_dict文件
+emb_file = 'data/wordVectors.txt' # 词向量文件


词向量文件 --> 预训练的词向量参数的路径。

lcy-seso · 2017-05-24T02:50:49Z

sequence_tagging_for_ner/README.md

+# 模型训练
+ner_net_train(data_reader=train_data_reader, num_passes=1)
+# 预测
+ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz')


这里把模型的保存/加载的名字都改为与pass_id有关吧。

lcy-seso

some small modifications.

lcy-seso · 2017-05-24T08:15:25Z

sequence_tagging_for_ner/README.md


 ## 模型说明

-在NER任务中，输入是"一句话"，目标是识别句子中的实体边界及类别，我们这里仅使用原始句子作为特征（参照论文\[[2](#参考文献)\]进行了一些预处理工作：将每个词转换为小写并将原词是否大写另作为一个特征）。按照上文所述处理序列标注问题的思路，可以构造如下结构的模型（图2是模型结构示意图）：
+在NER任务中，输入是"一句话"，目标是识别句子中的实体边界及类别，我们这里仅对原始句子作为特征（参照论文\[[2](#参考文献)\]进行了一些预处理工作：将每个词转换为小写并将原词是否大写另作为一个特征）。按照上文所述处理序列标注问题的思路，可以构造如下结构的模型（图2是模型结构示意图）：


抱歉，上一次的review 没有看清楚，将第一句话改残了。。

我们这里仅对原始句子作为特征（参照论文[2]进行了一些预处理工作：将每个词转换为小写并将原词是否大写另作为一个特征） --> 我们参照论文[2]仅对原始句子进行了一些预处理工作：将每个词转换为小写，并将原词是否大写另作为一个特征，共同作为模型的输入。

lcy-seso · 2017-05-24T08:17:44Z

sequence_tagging_for_ner/README.md

@@ -43,7 +43,7 @@
 | eng.testa | 验证数据，可用来进行参数调优 |
 | eng.testb | 评估数据，用来进行最终效果评估 |

-这三个文件数据格式如下：
+（为保证本例的完整性，我们从中抽取少量样本放在`data/train`和`data/test`文件中作为训练和测试示例使用；由于版权原因完整数据还请自行获取）这三个文件数据格式如下：


（为保证本例的完整性，我们从中抽取少量样本放在data/train和data/test文件中作为训练和测试示例使用；由于版权原因完整数据还请自行获取） --> 为保证本例的完整性，我们从中原始数据抽取少量样本放在data/train和data/test文件中，作为示例使用；由于版权原因，完整数据还请大家自行获取。

lcy-seso · 2017-05-24T08:21:05Z

sequence_tagging_for_ner/README.md

@@ -85,7 +85,7 @@
 | baghdad | 1 | B-LOC |
 | . | 0 | O |

-另外，我们附上word词典、label词典和预训练的词向量三个文件（word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业）以供使用。
+另外，我们附上word词典、label词典和预训练的词向量三个文件（word词典和词向量来源于[Stanford CS224d](http://cs224d.stanford.edu/)课程作业）以供使用。


这三个文件是否有可能在data文件夹下增加一个download的shell脚本，而不直接放在版本库中？

lcy-seso · 2017-05-24T08:23:06Z

sequence_tagging_for_ner/README.md

@@ -120,7 +120,7 @@ test_data_reader = conll03.test(test_data_file, vocab_file, target_file)

 `ner.py`提供了以下两个接口分别进行模型训练和预测：

-1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息，每个pass后会将模型保存为`params_pass_***.tar.gz`的文件（`***`表示pass的id），并将最终模型另存为`ner_model.tar.gz`。
+1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息（由于加入了chunk evaluator，会按语块计算当前模型识别的Precision、Recall和F1值，这里也会打印出来，其详细使用说明请参照[文档](http://www.paddlepaddle.org/develop/doc/api/v2/config/evaluators.html#chunk)），每个pass后会将模型保存为`params_pass_***.tar.gz`的文件（`***`表示pass的id），并将最终模型另存为。


训练过程中每100个iteration会打印模型训练信息（由于加入了chunk evaluator，会按语块计算当前模型识别的Precision、Recall和F1值，这里也会打印出来，其详细使用说明请参照文档） --> 训练过程中每100个iteration会打印模型训练信息。我们同时在模型配置中加入了chunk evaluator，会输出当前模型对语块识别的Precision、Recall和F1值。chunk evaluator 的详细使用说明请参照文档。

lcy-seso · 2017-05-24T08:25:02Z

sequence_tagging_for_ner/ner.py

+        input=crf_dec,
+        label=target,
+        chunk_scheme='IOB',
+        num_chunk_types=(label_dict_len - 1) / 2)


221 ~ 222 行 bug，应该删掉。

lcy-seso · 2017-05-24T08:26:41Z

sequence_tagging_for_ner/ner.py

@@ -260,4 +266,5 @@ def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'):
 if __name__ == '__main__':
    paddle.init(use_gpu=False, trainer_count=1)


268 ~ 270 三行略微改一下~

is_test = False if is_test: ner_net_train(data_reader=train_data_reader, num_passes=1) else: ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz')

done temporarily

… add_ner_model

lcy-seso

LGTM

update

guoshengCS requested a review from lcy-seso May 8, 2017 08:18

luotao1 requested changes May 9, 2017

View reviewed changes

lcy-seso requested changes May 9, 2017

View reviewed changes

lcy-seso self-assigned this May 9, 2017

lcy-seso requested changes May 11, 2017

View reviewed changes

guoshengCS force-pushed the add_ner_model branch from e43f1e2 to bc13bf9 Compare May 12, 2017 07:29

lcy-seso requested changes May 18, 2017

View reviewed changes

guoshengCS force-pushed the add_ner_model branch from e2477cd to c1ac9e4 Compare May 22, 2017 07:19

lcy-seso requested changes May 24, 2017

View reviewed changes

guoshengCS added 6 commits May 24, 2017 16:00

add ner model

29b56ee

change according to comments on commit 76fe6fc

cee7f2b

change according to comments on commit e43f1e2

228db9c

update README

7ae836e

add mark feature and update README

91f1564

add chunk evaluator and update READEME

e9b09a6

guoshengCS force-pushed the add_ner_model branch from c1ac9e4 to e9b09a6 Compare May 24, 2017 08:01

lcy-seso requested changes May 24, 2017

View reviewed changes

guoshengCS added 2 commits May 24, 2017 19:25

add download.sh and update READEME

c295c22

Merge branch 'develop' of https://github.com/PaddlePaddle/models into…

21c16a0

… add_ner_model

lcy-seso approved these changes May 24, 2017

View reviewed changes

luotao1 approved these changes May 24, 2017

View reviewed changes

lcy-seso merged commit bf929d8 into PaddlePaddle:develop May 24, 2017

HongyuLi2018 pushed a commit that referenced this pull request Apr 25, 2019

Merge pull request #30 from PaddlePaddle/develop

19aa404

update


		## 模型说明

		在本示例中，我们所使用的模型结构如图1所示。其输入为句子序列，在取词向量转换为词向量序列后，经过多组全连接层、双向RNN进行特征提取，最后接入CRF以学习到的特征为输入，以标记序列为监督信号，完成序列标注。更多关于RNN及其变体的知识可见[此页面](http://book.paddlepaddle.org/06.understand_sentiment/)。


		### 数据设置

		运行`ner.py`需要对数据设置部分进行更改，将以下代码中的变量值修改为正确的文件路径即可。


		`ner.py`提供了以下两个接口分别进行模型训练和预测：

		1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器（使用默认值即可）、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息，每个pass后会将模型保存下来，并将最终模型保存为`ner_net.tar.gz`。


		1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能，参数`data_reader`表示训练数据的迭代器（使用默认值即可）、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息，每个pass后会将模型保存下来，并将最终模型保存为`ner_net.tar.gz`。

		2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能，参数`data_reader`表示测试数据的迭代器（使用默认值即可）、`model_file`表示保存在本地的模型文件，预测过程会按如下格式打印预测结果：


		根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/)等任务同样可作为序列标注问题。

		由于序列标注问题的广泛性，产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/)等经典的序列模型，这些模型多只能使用局部信息或需要人工设计特征。发展到深度学习阶段，各种网络结构能够实现复杂的特征抽取功能，循环神经网络（Recurrent Neural Network，RNN，更多相关知识见[此页面](http://book.paddlepaddle.org/07.label_semantic_roles/)）能够处理输入序列元素之间前后关联的问题而更适合序列数据。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题的通常做法是：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF进行序列标注。这实际上是将传统CRF中的线性模型换成了非线性神经网络，沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本示例中也将基于此思路建立模型，另外，虽然这里使用的是NER任务，但是所给出的模型也可以应用到其他序列标注任务中。


		## 模型说明

		在NER任务中，输入是"一句话"，目标是识别句子中的实体边界及类别，我们这里仅使用原始句子作为特征（参照论文\[[2](#参考文献)\]进行了一些预处理工作：将每个词转换为小写并将原词是否大写另作为一个特征）。按照上文所述处理序列标注问题的思路，可以构造如下结构的模型（图2是模型结构示意图）：


		### 运行程序

		本示例另在`ner.py`中提供了完整的运行流程，包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法，使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为正确的文件路径：

		@@ -260,4 +266,5 @@ def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'):
		if __name__ == '__main__':
		paddle.init(use_gpu=False, trainer_count=1)

add ner model #30

add ner model #30

Conversation

guoshengCS commented May 8, 2017

luotao1 left a comment

Choose a reason for hiding this comment

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented May 10, 2017 • edited Loading

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented May 12, 2017

guoshengCS commented May 12, 2017

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented May 22, 2017

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented May 10, 2017 •

edited

Loading