Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ner model #30

Merged
merged 8 commits into from
May 24, 2017
Merged

add ner model #30

merged 8 commits into from
May 24, 2017

Conversation

guoshengCS
Copy link
Collaborator

fix #13

@guoshengCS guoshengCS requested a review from lcy-seso May 8, 2017 08:18
Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据集和模型比较大,不应该上传版本库,请清理。目前这个commit应该只留下conll03.py和ner_final.py两个文件。数据集可以给个下载地址,或集成进paddle.dataset中。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request changes.

predict = paddle.layer.crf_decoding(
size=label_dict_len,
input=output,
param_attr=paddle.attr.Param(name='crfw'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original configuration file contains an chunk evaluator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be done

hidden_para_attr = paddle.attr.Param(
initial_std=default_std, learning_rate=mix_hidden_lr)

lstm_1_1 = paddle.layer.lstmemory(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original configuration file uses simple RNN. By considering that we do not provide any examples of how to use the simple recurrent layer. I suggest not modify the original configuration file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def ner_net():
word = paddle.layer.data(name='word', type=d_type(word_dict_len))
#ws = paddle.layer.data(name='ws', type=d_type(num_ws))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment has no special purpose, please remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

size=word_dim,
input=paddle.layer.table_projection(input=word, param_attr=emb_para))
#ws_embedding = paddle.layer.mixed(name='ws_embedding', size=caps_dim,
# input=paddle.layer.table_projection(input=ws))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment has no special purpose, please remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug occurs.


def reader():
for sentence, labels in corpus_reader():
#word_idx = [word_dict.get(w, UNK_IDX) for w in sentence]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove useless comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

gate_act=paddle.activation.Sigmoid(),
state_act=paddle.activation.Sigmoid(),
bias_attr=std_0,
param_attr=lstm_para_attr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 51 to line 109, please rewrite by using for loop statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be done, haven't found a good way to rewrite, and fellow the original configuration temporarily

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有什么问题?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不太清楚需要用for循环抽出什么结构,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM,这个需要点拨一下,辛苦review

if len(test_data) == 10:
break

feature_out, target, crf_cost, predict = ner_net()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crf_cost cannot be used in inference because label is unknown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lcy-seso lcy-seso self-assigned this May 9, 2017
@guoshengCS
Copy link
Collaborator Author

guoshengCS commented May 10, 2017

@lcy-seso 请教下chunk_evalutor 的使用问题,以下是文档中的部分截图,使用chunk_evalutor的话label的id是需要满足这样的条件的么:tagType = label % numTagType,chunkType = label / numTagType
image
另外numChunkTypes在设置的时候是num_chunk_types=(number_classes - 1) / len(schema) 么
image

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some modifications on codes and doc.

命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题,根据序列标注结果可以直接得到实体边界和实体类别。
##数据说明
在本示例中,我们将使用CoNLL 2003 NER任务中开放出的数据集。由于版权原因,我们暂不提供此数据集的下载,可以按照[此页面](http://www.clips.uantwerpen.be/conll2003/ner/)中的说明免费获取该数据。该数据集中训练和测试数据格式如下
<img src="image/data_format.png" width = "60%" align=center /><br>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请变成表格,或则重绘。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<div align="center">
<img src="image/ner_network.png" width = "60%" align=center /><br>
图1. NER模型网络结构
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图再缩小一些,太大了。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bias_attr=std_0,
param_attr=lstm_para_attr)
param_attr=rnn_para_attr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请用for 循环来写 62 ~ 107 行。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不太清楚需要用for循环抽出什么结构,这个能否具体的说下,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM

ner_net_train()
ner_net_infer()
parameters = ner_net_train(train_data_reader, 1)
ner_net_infer(test_data_reader, parameters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 和其他例子一样,把训练好的模型保存到本地,目前的模型没有存储下来。
  • 预测时,加载存储好的模型,进行预测。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@guoshengCS
Copy link
Collaborator Author

辛苦review,用for循环改写一条暂未修正,这个还需要具体说明指导下,不太清楚需要用for循环抽出什么结构,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM

@guoshengCS
Copy link
Collaborator Author

另外,按照 @luotao1 的意见相关数据移出版本库的话,可否提供一个类似SRL中的存放数据的地方,如http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/wordDict.txt

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readme doc needs modifications.

# 命名实体识别

## 背景说明

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

背景说明需要再完善一下,NER 序列标注的一个例子,利用序列标注还可以做其它任务,是本例的目标。

  1. 大致解释一下什么是序列标注。
  2. NN 解决序列标注的思路。
  3. 序列标注还可以做什么,使用本例的配置,还可以做其它什么任务。
  4. 序列标注需要的相关展开,请直接引导至PaddleBook的语义角色标注一节。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,重写了背景说明部分,补充了相应内容

. . O O
```

其中第一列为原始句子序列(第二、三列分别为词性标签和句法分析中的语块标签,这里暂时不用),第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和[BIO方式](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),而我们这里将使用BIO方式表示的标签集,这两种方式的转换过程在我们提供的`conll03.py`文件中进行。另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业)以供使用。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里请说明一下 conll03.py 做了哪些处理
  2. 如何运行 conll03.py
  3. 数据处理的说明很不清晰,比如,如果考虑需要运行这个例子:
    • 原始数据下载下来是什么样子呢?
    • 数据下载之后该放在哪里呢?
    • 数据预处理应该首先执行哪个脚本呢?是不是需要指定参数?路径是否需要设置?
    • 预处理完之后长什么样子呢?
    • 预处理的数据应该修改哪些变量,修改成什么样?修改好之后运行什么脚本来跑训练?
  4. 请按照 step by step 的顺序来充足文档
  5. 请参考文本分类的自定义数据 一节,来介绍如何替换成自己的数据进行训练。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,重写了数据说明部分的内容进行了更细致的说明。暂未提供独立的数据预处理脚本,是将预处理过程内置在了返回生成器的过程中,另在使用说明部分增加了自定义数据的内容


## 模型说明

在本示例中,我们所使用的模型结构如图1所示。其输入为句子序列,在取词向量转换为词向量序列后,经过多组全连接层、双向RNN进行特征提取,最后接入CRF以学习到的特征为输入,以标记序列为监督信号,完成序列标注。更多关于RNN及其变体的知识可见[此页面](http://book.paddlepaddle.org/06.understand_sentiment/)。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把模型流程分解成:
1. 输入特征是 one-hot 表示的……
2. 转词向量……
3. RNN 学习句子特征
4. CRF 完成序列标注
这样用序号描述的步骤,不需要太长,但力求逻辑清晰。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### 数据设置

运行`ner.py`需要对数据设置部分进行更改,将以下代码中的变量值修改为正确的文件路径即可。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在哪里修改,如果修改?请基于“手把手不需要用户动脑子”这样的出发点来介绍。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

运行`ner.py`需要对数据设置部分进行更改,将以下代码中的变量值修改为正确的文件路径即可。

```python
# init dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释不要用缩写。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


`ner.py`提供了以下两个接口分别进行模型训练和预测:

1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器(使用默认值即可)、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存下来,并将最终模型保存为`ner_net.tar.gz`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照Pass ID 来存储模型,文档这里也需要修改。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,按照pass id存储模型,并将最后一轮模型另存为ner_model.tar.gz


1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器(使用默认值即可)、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存下来,并将最终模型保存为`ner_net.tar.gz`。

2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能,参数`data_reader`表示测试数据的迭代器(使用默认值即可)、`model_file`表示保存在本地的模型文件,预测过程会按如下格式打印预测结果:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预测需要用户做哪些事情呢?这里没有描述清楚。

  1. 怎样执行?(比如就是简单执行python A.py,那么就简单直接的告诉用户,如果需要修改某个变量的值,也请加以说明)
  2. 是否需要替换数据呢?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for O
Baghdad B-LOC
. O
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文本块之前的缩进请去掉。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@guoshengCS
Copy link
Collaborator Author

@lcy-seso 辛苦review,根据上次review意见对README进行了较大改动,另为增强说服力参照论文Natural Language Processing (Almost) from Scratch对模型输入进行了修改,使用小写并加入大写标记作为特征。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good writing, almost LGTM.


<div align="center">
<img src="image/ner_label_ins.png" width = "80%" align=center /><br>
图1. NER标注示例
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图的名字修改一下:NER标注示例 --> BIO标注方法示例

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/)等任务同样可作为序列标注问题。

由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/)等经典的序列模型,这些模型多只能使用局部信息或需要人工设计特征。发展到深度学习阶段,各种网络结构能够实现复杂的特征抽取功能,循环神经网络(Recurrent Neural Network,RNN,更多相关知识见[此页面](http://book.paddlepaddle.org/07.label_semantic_roles/))能够处理输入序列元素之间前后关联的问题而更适合序列数据。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题的通常做法是:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF进行序列标注。这实际上是将传统CRF中的线性模型换成了非线性神经网络,沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本示例中也将基于此思路建立模型,另外,虽然这里使用的是NER任务,但是所给出的模型也可以应用到其他序列标注任务中。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 更多相关知识见[此页面] --> 把“此页面”修改为,PaddleBook中语义角色标注一课。
  • 语义角色标注的连接换成中文README的,目前models都是中文的说明

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## 模型说明

在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅使用原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们这里仅使用原始句子作为特征 --> 我们这里仅对原始句子作为特征

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### 运行程序

本示例另在`ner.py`中提供了完整的运行流程,包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法,使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为正确的文件路径:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 本示例 --> 本例
  2. 修改为正确的文件路径 --> 修改为对应文件路径

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# 修改以下变量为对应文件路径
train_data_file = 'data/train' # 训练数据文件
test_data_file = 'data/test' # 测试数据文件
vocab_file = 'data/vocab.txt' # word_dict文件
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释里word_dict这变量个用中文描述一下吧,比如:“输入句子对应的字典文件的路径”否则,还需要对应去理解这个变量的含义

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


```python
# 修改以下变量为对应文件路径
train_data_file = 'data/train' # 训练数据文件
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练数据文件 --> 训练数据文件的路径

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```python
# 修改以下变量为对应文件路径
train_data_file = 'data/train' # 训练数据文件
test_data_file = 'data/test' # 测试数据文件
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测试数据文件 --> 测试数据文件的路径

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

train_data_file = 'data/train' # 训练数据文件
test_data_file = 'data/test' # 测试数据文件
vocab_file = 'data/vocab.txt' # word_dict文件
target_file = 'data/target.txt' # label_dict文件
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释中 label_dict 用中文来解释吧~

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test_data_file = 'data/test' # 测试数据文件
vocab_file = 'data/vocab.txt' # word_dict文件
target_file = 'data/target.txt' # label_dict文件
emb_file = 'data/wordVectors.txt' # 词向量文件
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

词向量文件 --> 预训练的词向量参数的路径。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# 模型训练
ner_net_train(data_reader=train_data_reader, num_passes=1)
# 预测
ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里把模型的保存/加载的名字都改为与pass_id有关吧。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small modifications.


## 模型说明

在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅使用原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图):
在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅对原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 抱歉,上一次的review 没有看清楚,将第一句话改残了。。
  • 我们这里仅对原始句子作为特征(参照论文[2]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征) --> 我们参照论文[2]仅对原始句子进行了一些预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -43,7 +43,7 @@
| eng.testa | 验证数据,可用来进行参数调优 |
| eng.testb | 评估数据,用来进行最终效果评估 |

这三个文件数据格式如下:
(为保证本例的完整性,我们从中抽取少量样本放在`data/train`和`data/test`文件中作为训练和测试示例使用;由于版权原因完整数据还请自行获取)这三个文件数据格式如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(为保证本例的完整性,我们从中抽取少量样本放在data/traindata/test文件中作为训练和测试示例使用;由于版权原因完整数据还请自行获取) --> 为保证本例的完整性,我们从中原始数据抽取少量样本放在data/traindata/test文件中,作为示例使用;由于版权原因,完整数据还请大家自行获取。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -85,7 +85,7 @@
| baghdad | 1 | B-LOC |
| . | 0 | O |

另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业)以供使用。
另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford CS224d](http://cs224d.stanford.edu/)课程作业)以供使用。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这三个文件是否有可能在data文件夹下增加一个download的shell脚本,而不直接放在版本库中?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -120,7 +120,7 @@ test_data_reader = conll03.test(test_data_file, vocab_file, target_file)

`ner.py`提供了以下两个接口分别进行模型训练和预测:

1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id),并将最终模型另存为`ner_model.tar.gz`
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息(由于加入了chunk evaluator,会按语块计算当前模型识别的Precision、Recall和F1值,这里也会打印出来,其详细使用说明请参照[文档](http://www.paddlepaddle.org/develop/doc/api/v2/config/evaluators.html#chunk)),每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id),并将最终模型另存为。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练过程中每100个iteration会打印模型训练信息(由于加入了chunk evaluator,会按语块计算当前模型识别的Precision、Recall和F1值,这里也会打印出来,其详细使用说明请参照文档) --> 训练过程中每100个iteration会打印模型训练信息。我们同时在模型配置中加入了chunk evaluator,会输出当前模型对语块识别的Precision、Recall和F1值。chunk evaluator 的详细使用说明请参照文档

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

input=crf_dec,
label=target,
chunk_scheme='IOB',
num_chunk_types=(label_dict_len - 1) / 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

221 ~ 222 行 bug,应该删掉。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -260,4 +266,5 @@ def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'):
if __name__ == '__main__':
paddle.init(use_gpu=False, trainer_count=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

268 ~ 270 三行略微改一下~

is_test = False

if is_test: 
    ner_net_train(data_reader=train_data_reader, num_passes=1)
else:
    ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done temporarily

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcy-seso lcy-seso merged commit bf929d8 into PaddlePaddle:develop May 24, 2017
HongyuLi2018 pushed a commit that referenced this pull request Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

example configuration for NER.
3 participants