Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add analysis module in text classification application #3011

Merged
merged 8 commits into from
Aug 22, 2022

Conversation

lugimzzz
Copy link
Contributor

@lugimzzz lugimzzz commented Aug 10, 2022

PR types

Others

PR changes

Others

Description

在文本分类application中新增analysis模块

  • 新增评估脚本
  • 新增稀疏数据识别、脏数据清洗、数据增强三种方案
  • 新增analysis模块文档
  • doccano标注代码和文档适配脏数据和稀疏数据标注
  • 原有的数据集标注质量不高,为了更好地验证analysis模块方案,层次分类更换数据集为2020语言与智能技术竞赛:事件抽取任务数据集


**安装TrustAI**
```shell
pip install trustai
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的版本是否能确定下来

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已固定为pip install trustai==0.1.4


## analysis模块介绍

analysis模块提供了**模型评估**脚本对每个类别分别进行评估,帮助开发者分析模型表现,同时基于可信AI工具集和数据增强API提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化方案帮助提升模型效果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分析模型看缺乏一个概念图,具体分析什么,以及带来的策略方案是什么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加概念图

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已新增

--params_path "../checkpoint/" \
--batch_size 16 \
--sparse_num 100 \
--valid_num 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

既然是建议10~20%,这里又是一个固定的值,是不是改成比例比较合适

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为方案效果里都是用稀疏数据100条,有效数据100条的方式进行,这边修改的话那边可能要重新做实验。我的想法是暂时保留固定值的写法

* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。
* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。
* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。
* `valid_file`:保存在本地数据集路径中有效正影响训练数据文件名;默认为"valid.txt"。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的valid file是啥意思了? 感觉它的含义和dev_file有点重复,让人感觉有点误解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为support file,把所有有效数据表达改为支持数据,感觉支持数据也和计算正影响分数的方法更为匹配

|训练集(500+25%脏数据) |65.58|
|训练集(500+25%脏数据) + 脏数据清洗(50)|68.90|
|训练集(500+25%脏数据) + 脏数据清洗(100)|69.36|
|训练集(500+25%脏数据) + 脏数据清洗(150)|73.15|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的逻辑相对比较奇怪,虽然你的意思是加上脏数据效果变差了,但是用户的逻辑应该是在数据集有脏数据,加上数据清洗之后,效果变好了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改用含xx条脏数据的方式更直观。

parser.add_argument("--dirty_threshold",
type=float,
default="0",
help="The threshold to select dirty data.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里设置成0是否合适了,并且建议这里设置成一个required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认为0,用户根据自己需要调整

set_seed(args.seed)
paddle.set_device(args.device)
# Define model & tokenizer
if os.path.exists(args.params_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里最好还是判断一下params文件存在

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改为if os.path.exists(os.path.join(args.params_path, "model_state.pdparams")) and os.path.exists(os.path.join(args.params_path, "model_config.json")) and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json")):

return ret_idxs, ret_scores


class LocalDataCollatorWithPadding(DataCollatorWithPadding):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里重载基类的目的是什么了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把字典变成list,适配模型输入

train_data_loader,
classifier_layer_name="classifier")
# Feature similarity analysis & select sparse data
analysis_result = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里帮忙解释一下 FeatureSimilarityModel 具体获取的结果是什么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

得到一个基于训练集数据的相似度计算模型

@ZHUI ZHUI self-requested a review August 17, 2022 06:04
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


训练数据标注质量对模型效果有较大影响,但受限于标注人员水平、标注任务难易程度等影响,训练数据中都存在一定比例的标注较差的数据(脏数据)。当标注数据规模较大时,数据标注检查就成为一个难题。本项目中脏数据清洗基于TrustAI(可信AI)工具集,利用基于表示点方法的实例级证据分析方法,计算训练数据对模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为脏数据(标注错误样本)。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。

**安装TrustAI**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 安装TrustAI 文档中出现了很多次了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为只有稀疏数据和脏数据清洗会用到trustai其他的不会用到,所以在稀疏数据和脏数据清洗前加入了trustai安装提示,如果只是使用评估模型或数据增强是不需要安装trustai的

* `dirty_file`:保存脏数据文件名,默认为"train_dirty.txt"。
* `rest_file`:保存剩余数据(非脏数据)文件名,默认为"train_dirty_rest.txt"。
* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。
* `dirty_threshold`:筛选脏数据用于重新标注的阈值,只选择影响分数大于阈值作为有效数据,默认为0。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个看起来一个打分值,不是比例。如何做到 脏数据清洗(100),用户手动再去除打分top100的吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,如果是0的话,用户默认只要得分top100的。如果设置了阈值,除了脏数据得分需要大于阈值,并且排名前100

@@ -204,8 +216,11 @@ python doccano.py \
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。
- ``seed``: 随机种子,默认为1000.
- ``separator``: 不同层标签之间的分隔符,该参数只对层次文本分类任务有效。默认为"##"。
- ``valid``: 是否为稀疏数据筛选的有效标注数据,默认为False.
- ``dirty``: 是否为脏数据清洗策略标注数据,默认为False.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如沟通,valid 应该是可以不用用户介入,自动aug。
dirty 需要用户判断,此处是否更多说明?

|"ernie-3.0-mini-zh" |6-layer, 384-hidden, 12-heads|61.70|42.30| 0.83|
|"ernie-3.0-micro-zh" | 4-layer, 384-hidden, 12-heads|61.14|42.97| 0.60|
|"ernie-3.0-nano-zh" |4-layer, 312-hidden, 12-heads|60.15|32.62|0.25|
|"ernie-3.0-xbase-zh" |20-layer, 1024-hidden, 12-heads|95.12|92.77| 12.51 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xbase 效果不如base

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不删除吧

Comment on lines +15 to +18
import os
import argparse
import paddle
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import os
import argparse
import paddle
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap
import os
import argparse
import paddle
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap

Comment on lines +20 to +22
from sklearn.metrics import accuracy_score, classification_report, f1_score

import paddle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn.metrics import accuracy_score, classification_report, f1_score
import paddle
from sklearn.metrics import accuracy_score, classification_report, f1_score
import paddle

paddle, skearn 这些都算第三方库,可不加空格

import argparse

import numpy as np

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同之前

@lugimzzz lugimzzz merged commit f1634e5 into PaddlePaddle:develop Aug 22, 2022
@lugimzzz lugimzzz deleted the analysis branch September 19, 2022 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants