-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ner model #30
add ner model #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
数据集和模型比较大,不应该上传版本库,请清理。目前这个commit应该只留下conll03.py和ner_final.py两个文件。数据集可以给个下载地址,或集成进paddle.dataset中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request changes.
predict = paddle.layer.crf_decoding( | ||
size=label_dict_len, | ||
input=output, | ||
param_attr=paddle.attr.Param(name='crfw')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original configuration file contains an chunk evaluator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be done
hidden_para_attr = paddle.attr.Param( | ||
initial_std=default_std, learning_rate=mix_hidden_lr) | ||
|
||
lstm_1_1 = paddle.layer.lstmemory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original configuration file uses simple RNN. By considering that we do not provide any examples of how to use the simple recurrent layer. I suggest not modify the original configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
def ner_net(): | ||
word = paddle.layer.data(name='word', type=d_type(word_dict_len)) | ||
#ws = paddle.layer.data(name='ws', type=d_type(num_ws)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comment has no special purpose, please remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
size=word_dim, | ||
input=paddle.layer.table_projection(input=word, param_attr=emb_para)) | ||
#ws_embedding = paddle.layer.mixed(name='ws_embedding', size=caps_dim, | ||
# input=paddle.layer.table_projection(input=ws)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comment has no special purpose, please remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug occurs.
sequence_tagging_for_ner/conll03.py
Outdated
|
||
def reader(): | ||
for sentence, labels in corpus_reader(): | ||
#word_idx = [word_dict.get(w, UNK_IDX) for w in sentence] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove useless comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gate_act=paddle.activation.Sigmoid(), | ||
state_act=paddle.activation.Sigmoid(), | ||
bias_attr=std_0, | ||
param_attr=lstm_para_attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 51 to line 109, please rewrite by using for loop statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be done, haven't found a good way to rewrite, and fellow the original configuration temporarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有什么问题?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不太清楚需要用for循环抽出什么结构,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM,这个需要点拨一下,辛苦review
if len(test_data) == 10: | ||
break | ||
|
||
feature_out, target, crf_cost, predict = ner_net() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crf_cost cannot be used in inference because label is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@lcy-seso 请教下chunk_evalutor 的使用问题,以下是文档中的部分截图,使用chunk_evalutor的话label的id是需要满足这样的条件的么:tagType = label % numTagType,chunkType = label / numTagType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some modifications on codes and doc.
sequence_tagging_for_ner/README.md
Outdated
命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题,根据序列标注结果可以直接得到实体边界和实体类别。 | ||
##数据说明 | ||
在本示例中,我们将使用CoNLL 2003 NER任务中开放出的数据集。由于版权原因,我们暂不提供此数据集的下载,可以按照[此页面](http://www.clips.uantwerpen.be/conll2003/ner/)中的说明免费获取该数据。该数据集中训练和测试数据格式如下 | ||
<img src="image/data_format.png" width = "60%" align=center /><br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请变成表格,或则重绘。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
<div align="center"> | ||
<img src="image/ner_network.png" width = "60%" align=center /><br> | ||
图1. NER模型网络结构 | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
图再缩小一些,太大了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/ner.py
Outdated
bias_attr=std_0, | ||
param_attr=lstm_para_attr) | ||
param_attr=rnn_para_attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请用for 循环来写 62 ~ 107 行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不太清楚需要用for循环抽出什么结构,这个能否具体的说下,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM
sequence_tagging_for_ner/ner.py
Outdated
ner_net_train() | ||
ner_net_infer() | ||
parameters = ner_net_train(train_data_reader, 1) | ||
ner_net_infer(test_data_reader, parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 和其他例子一样,把训练好的模型保存到本地,目前的模型没有存储下来。
- 预测时,加载存储好的模型,进行预测。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
辛苦review,用for循环改写一条暂未修正,这个还需要具体说明指导下,不太清楚需要用for循环抽出什么结构,比如SRL中使用for循环抽出"fc+lstm"来构造栈式双向LSTM |
另外,按照 @luotao1 的意见相关数据移出版本库的话,可否提供一个类似SRL中的存放数据的地方,如http://paddlepaddle.bj.bcebos.com/demo/srl_dict_and_embedding/wordDict.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readme doc needs modifications.
# 命名实体识别 | ||
|
||
## 背景说明 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
背景说明需要再完善一下,NER 序列标注的一个例子,利用序列标注还可以做其它任务,是本例的目标。
- 大致解释一下什么是序列标注。
- NN 解决序列标注的思路。
- 序列标注还可以做什么,使用本例的配置,还可以做其它什么任务。
- 序列标注需要的相关展开,请直接引导至PaddleBook的语义角色标注一节。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,重写了背景说明部分,补充了相应内容
sequence_tagging_for_ner/README.md
Outdated
. . O O | ||
``` | ||
|
||
其中第一列为原始句子序列(第二、三列分别为词性标签和句法分析中的语块标签,这里暂时不用),第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和[BIO方式](https://github.com/PaddlePaddle/book/tree/develop/07.label_semantic_roles)的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),而我们这里将使用BIO方式表示的标签集,这两种方式的转换过程在我们提供的`conll03.py`文件中进行。另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业)以供使用。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这里请说明一下
conll03.py
做了哪些处理 - 如何运行
conll03.py
- 数据处理的说明很不清晰,比如,如果考虑需要运行这个例子:
- 原始数据下载下来是什么样子呢?
- 数据下载之后该放在哪里呢?
- 数据预处理应该首先执行哪个脚本呢?是不是需要指定参数?路径是否需要设置?
- 预处理完之后长什么样子呢?
- 预处理的数据应该修改哪些变量,修改成什么样?修改好之后运行什么脚本来跑训练?
- 请按照 step by step 的顺序来充足文档
- 请参考文本分类的自定义数据 一节,来介绍如何替换成自己的数据进行训练。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,重写了数据说明部分的内容进行了更细致的说明。暂未提供独立的数据预处理脚本,是将预处理过程内置在了返回生成器的过程中,另在使用说明部分增加了自定义数据的内容
sequence_tagging_for_ner/README.md
Outdated
|
||
## 模型说明 | ||
|
||
在本示例中,我们所使用的模型结构如图1所示。其输入为句子序列,在取词向量转换为词向量序列后,经过多组全连接层、双向RNN进行特征提取,最后接入CRF以学习到的特征为输入,以标记序列为监督信号,完成序列标注。更多关于RNN及其变体的知识可见[此页面](http://book.paddlepaddle.org/06.understand_sentiment/)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把模型流程分解成:
1. 输入特征是 one-hot 表示的……
2. 转词向量……
3. RNN 学习句子特征
4. CRF 完成序列标注
这样用序号描述的步骤,不需要太长,但力求逻辑清晰。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
### 数据设置 | ||
|
||
运行`ner.py`需要对数据设置部分进行更改,将以下代码中的变量值修改为正确的文件路径即可。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在哪里修改,如果修改?请基于“手把手不需要用户动脑子”这样的出发点来介绍。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
运行`ner.py`需要对数据设置部分进行更改,将以下代码中的变量值修改为正确的文件路径即可。 | ||
|
||
```python | ||
# init dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的注释不要用缩写。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
`ner.py`提供了以下两个接口分别进行模型训练和预测: | ||
|
||
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器(使用默认值即可)、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存下来,并将最终模型保存为`ner_net.tar.gz`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
按照Pass ID 来存储模型,文档这里也需要修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,按照pass id存储模型,并将最后一轮模型另存为ner_model.tar.gz
sequence_tagging_for_ner/README.md
Outdated
|
||
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器(使用默认值即可)、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存下来,并将最终模型保存为`ner_net.tar.gz`。 | ||
|
||
2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能,参数`data_reader`表示测试数据的迭代器(使用默认值即可)、`model_file`表示保存在本地的模型文件,预测过程会按如下格式打印预测结果: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预测需要用户做哪些事情呢?这里没有描述清楚。
- 怎样执行?(比如就是简单执行
python A.py
,那么就简单直接的告诉用户,如果需要修改某个变量的值,也请加以说明) - 是否需要替换数据呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
for O | ||
Baghdad B-LOC | ||
. O | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
文本块之前的缩进请去掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@lcy-seso 辛苦review,根据上次review意见对README进行了较大改动,另为增强说服力参照论文Natural Language Processing (Almost) from Scratch对模型输入进行了修改,使用小写并加入大写标记作为特征。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good writing, almost LGTM.
sequence_tagging_for_ner/README.md
Outdated
|
||
<div align="center"> | ||
<img src="image/ner_label_ins.png" width = "80%" align=center /><br> | ||
图1. NER标注示例 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
图的名字修改一下:NER标注示例 --> BIO标注方法示例
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/)等任务同样可作为序列标注问题。 | ||
|
||
由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/)等经典的序列模型,这些模型多只能使用局部信息或需要人工设计特征。发展到深度学习阶段,各种网络结构能够实现复杂的特征抽取功能,循环神经网络(Recurrent Neural Network,RNN,更多相关知识见[此页面](http://book.paddlepaddle.org/07.label_semantic_roles/))能够处理输入序列元素之间前后关联的问题而更适合序列数据。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题的通常做法是:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF进行序列标注。这实际上是将传统CRF中的线性模型换成了非线性神经网络,沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本示例中也将基于此思路建立模型,另外,虽然这里使用的是NER任务,但是所给出的模型也可以应用到其他序列标注任务中。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 更多相关知识见[此页面] --> 把“此页面”修改为,PaddleBook中语义角色标注一课。
- 语义角色标注的连接换成中文README的,目前models都是中文的说明
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
## 模型说明 | ||
|
||
在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅使用原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们这里仅使用原始句子作为特征 --> 我们这里仅对原始句子作为特征
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
### 运行程序 | ||
|
||
本示例另在`ner.py`中提供了完整的运行流程,包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法,使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为正确的文件路径: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 本示例 --> 本例
- 修改为正确的文件路径 --> 修改为对应文件路径
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
# 修改以下变量为对应文件路径 | ||
train_data_file = 'data/train' # 训练数据文件 | ||
test_data_file = 'data/test' # 测试数据文件 | ||
vocab_file = 'data/vocab.txt' # word_dict文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释里word_dict这变量个用中文描述一下吧,比如:“输入句子对应的字典文件的路径”否则,还需要对应去理解这个变量的含义
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
|
||
```python | ||
# 修改以下变量为对应文件路径 | ||
train_data_file = 'data/train' # 训练数据文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
训练数据文件 --> 训练数据文件的路径
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
```python | ||
# 修改以下变量为对应文件路径 | ||
train_data_file = 'data/train' # 训练数据文件 | ||
test_data_file = 'data/test' # 测试数据文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
测试数据文件 --> 测试数据文件的路径
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
train_data_file = 'data/train' # 训练数据文件 | ||
test_data_file = 'data/test' # 测试数据文件 | ||
vocab_file = 'data/vocab.txt' # word_dict文件 | ||
target_file = 'data/target.txt' # label_dict文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释中 label_dict 用中文来解释吧~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
test_data_file = 'data/test' # 测试数据文件 | ||
vocab_file = 'data/vocab.txt' # word_dict文件 | ||
target_file = 'data/target.txt' # label_dict文件 | ||
emb_file = 'data/wordVectors.txt' # 词向量文件 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
词向量文件 --> 预训练的词向量参数的路径。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
# 模型训练 | ||
ner_net_train(data_reader=train_data_reader, num_passes=1) | ||
# 预测 | ||
ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里把模型的保存/加载的名字都改为与pass_id有关吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small modifications.
sequence_tagging_for_ner/README.md
Outdated
|
||
## 模型说明 | ||
|
||
在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅使用原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图): | ||
在NER任务中,输入是"一句话",目标是识别句子中的实体边界及类别,我们这里仅对原始句子作为特征(参照论文\[[2](#参考文献)\]进行了一些预处理工作:将每个词转换为小写并将原词是否大写另作为一个特征)。按照上文所述处理序列标注问题的思路,可以构造如下结构的模型(图2是模型结构示意图): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
@@ -43,7 +43,7 @@ | |||
| eng.testa | 验证数据,可用来进行参数调优 | | |||
| eng.testb | 评估数据,用来进行最终效果评估 | | |||
|
|||
这三个文件数据格式如下: | |||
(为保证本例的完整性,我们从中抽取少量样本放在`data/train`和`data/test`文件中作为训练和测试示例使用;由于版权原因完整数据还请自行获取)这三个文件数据格式如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(为保证本例的完整性,我们从中抽取少量样本放在data/train
和data/test
文件中作为训练和测试示例使用;由于版权原因完整数据还请自行获取) --> 为保证本例的完整性,我们从中原始数据抽取少量样本放在data/train
和data/test
文件中,作为示例使用;由于版权原因,完整数据还请大家自行获取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
@@ -85,7 +85,7 @@ | |||
| baghdad | 1 | B-LOC | | |||
| . | 0 | O | | |||
|
|||
另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford cs224d](http://cs224d.stanford.edu/)课程作业)以供使用。 | |||
另外,我们附上word词典、label词典和预训练的词向量三个文件(word词典和词向量来源于[Stanford CS224d](http://cs224d.stanford.edu/)课程作业)以供使用。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这三个文件是否有可能在data文件夹下增加一个download的shell脚本,而不直接放在版本库中?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/README.md
Outdated
@@ -120,7 +120,7 @@ test_data_reader = conll03.test(test_data_file, vocab_file, target_file) | |||
|
|||
`ner.py`提供了以下两个接口分别进行模型训练和预测: | |||
|
|||
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息,每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id),并将最终模型另存为`ner_model.tar.gz`。 | |||
1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息(由于加入了chunk evaluator,会按语块计算当前模型识别的Precision、Recall和F1值,这里也会打印出来,其详细使用说明请参照[文档](http://www.paddlepaddle.org/develop/doc/api/v2/config/evaluators.html#chunk)),每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id),并将最终模型另存为。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
input=crf_dec, | ||
label=target, | ||
chunk_scheme='IOB', | ||
num_chunk_types=(label_dict_len - 1) / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
221 ~ 222 行 bug,应该删掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sequence_tagging_for_ner/ner.py
Outdated
@@ -260,4 +266,5 @@ def ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz'): | |||
if __name__ == '__main__': | |||
paddle.init(use_gpu=False, trainer_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
268 ~ 270 三行略微改一下~
is_test = False
if is_test:
ner_net_train(data_reader=train_data_reader, num_passes=1)
else:
ner_net_infer(data_reader=test_data_reader, model_file='ner_model.tar.gz')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done temporarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fix #13