add analysis module in text classification application #3011

lugimzzz · 2022-08-10T07:37:30Z

PR types

Others

PR changes

Others

Description

在文本分类application中新增analysis模块

新增评估脚本
新增稀疏数据识别、脏数据清洗、数据增强三种方案
新增analysis模块文档
doccano标注代码和文档适配脏数据和稀疏数据标注
原有的数据集标注质量不高，为了更好地验证analysis模块方案，层次分类更换数据集为2020语言与智能技术竞赛：事件抽取任务数据集

applications/text_classification/multi_class/README.md

wawltor · 2022-08-14T09:41:56Z

applications/text_classification/multi_class/analysis/README.md

+
+**安装TrustAI**
+```shell
+pip install trustai


这里的版本是否能确定下来

已固定为pip install trustai==0.1.4

wawltor · 2022-08-14T09:45:38Z

applications/text_classification/multi_class/analysis/README.md

+
+## analysis模块介绍
+
+analysis模块提供了**模型评估**脚本对每个类别分别进行评估，帮助开发者分析模型表现，同时基于可信AI工具集和数据增强API提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化方案帮助提升模型效果。


分析模型看缺乏一个概念图，具体分析什么，以及带来的策略方案是什么

已添加概念图

wawltor · 2022-08-14T09:48:30Z

applications/text_classification/multi_class/analysis/README.md

+    --params_path "../checkpoint/" \
+    --batch_size 16 \
+    --sparse_num 100 \
+    --valid_num 100


既然是建议10~20%，这里又是一个固定的值，是不是改成比例比较合适

因为方案效果里都是用稀疏数据100条，有效数据100条的方式进行，这边修改的话那边可能要重新做实验。我的想法是暂时保留固定值的写法

wawltor · 2022-08-14T09:49:39Z

applications/text_classification/multi_class/analysis/README.md

+* `dev_file`：本地数据集中开发集文件名；默认为"dev.txt"。
+* `label_file`：本地数据集中标签集文件名；默认为"label.txt"。
+* `sparse_file`：保存在本地数据集路径中稀疏数据文件名；默认为"sparse.txt"。
+* `valid_file`：保存在本地数据集路径中有效正影响训练数据文件名；默认为"valid.txt"。


这里的valid file是啥意思了？感觉它的含义和dev_file有点重复，让人感觉有点误解

修改为support file，把所有有效数据表达改为支持数据，感觉支持数据也和计算正影响分数的方法更为匹配

wawltor · 2022-08-14T09:53:31Z

applications/text_classification/multi_class/analysis/README.md

+|训练集(500+25%脏数据) |65.58|
+|训练集(500+25%脏数据) + 脏数据清洗(50)|68.90|
+|训练集(500+25%脏数据) + 脏数据清洗(100)|69.36|
+|训练集(500+25%脏数据) + 脏数据清洗(150)|73.15|


这里的逻辑相对比较奇怪，虽然你的意思是加上脏数据效果变差了，但是用户的逻辑应该是在数据集有脏数据，加上数据清洗之后，效果变好了

已修改用含xx条脏数据的方式更直观。

applications/text_classification/multi_class/analysis/aug.py

wawltor · 2022-08-14T10:01:01Z

applications/text_classification/multi_class/analysis/dirty.py

+parser.add_argument("--dirty_threshold",
+                    type=float,
+                    default="0",
+                    help="The threshold to select dirty data.")


这里设置成0是否合适了，并且建议这里设置成一个required

默认为0，用户根据自己需要调整

wawltor · 2022-08-14T10:09:58Z

applications/text_classification/multi_class/analysis/dirty.py

+    set_seed(args.seed)
+    paddle.set_device(args.device)
+    # Define model & tokenizer
+    if os.path.exists(args.params_path):


这里最好还是判断一下params文件存在

改为if os.path.exists(os.path.join(args.params_path, "model_state.pdparams")) and os.path.exists(os.path.join(args.params_path, "model_config.json")) and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json")):

applications/text_classification/multi_class/analysis/evaluate.py

wawltor · 2022-08-14T10:11:20Z

applications/text_classification/multi_class/analysis/dirty.py

+    return ret_idxs, ret_scores
+
+
+class LocalDataCollatorWithPadding(DataCollatorWithPadding):


这里重载基类的目的是什么了？

把字典变成list，适配模型输入

applications/text_classification/multi_class/analysis/sparse.py

wawltor · 2022-08-14T10:16:28Z

applications/text_classification/multi_class/analysis/sparse.py

+                                         train_data_loader,
+                                         classifier_layer_name="classifier")
+    # Feature similarity analysis & select sparse data
+    analysis_result = []


这里帮忙解释一下 FeatureSimilarityModel 具体获取的结果是什么

得到一个基于训练集数据的相似度计算模型

ZHUI

LGTM

ZHUI · 2022-08-19T04:03:59Z

applications/text_classification/hierarchical/analysis/README.md

+
+训练数据标注质量对模型效果有较大影响，但受限于标注人员水平、标注任务难易程度等影响，训练数据中都存在一定比例的标注较差的数据（脏数据）。当标注数据规模较大时，数据标注检查就成为一个难题。本项目中脏数据清洗基于TrustAI（可信AI）工具集，利用基于表示点方法的实例级证据分析方法，计算训练数据对模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为脏数据（标注错误样本）。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。
+
+**安装TrustAI**


这个 安装TrustAI 文档中出现了很多次了

因为只有稀疏数据和脏数据清洗会用到trustai其他的不会用到，所以在稀疏数据和脏数据清洗前加入了trustai安装提示，如果只是使用评估模型或数据增强是不需要安装trustai的

ZHUI · 2022-08-19T04:07:44Z

applications/text_classification/hierarchical/analysis/README.md

+* `dirty_file`：保存脏数据文件名，默认为"train_dirty.txt"。
+* `rest_file`：保存剩余数据（非脏数据）文件名，默认为"train_dirty_rest.txt"。
+* `train_file`：本地数据集中训练集文件名；默认为"train.txt"。
+* `dirty_threshold`：筛选脏数据用于重新标注的阈值，只选择影响分数大于阈值作为有效数据，默认为0。


这个看起来一个打分值，不是比例。如何做到 脏数据清洗(100)，用户手动再去除打分top100的吗？

对，如果是0的话，用户默认只要得分top100的。如果设置了阈值，除了脏数据得分需要大于阈值，并且排名前100

ZHUI · 2022-08-19T04:49:14Z

applications/text_classification/doccano.md

@@ -204,8 +216,11 @@ python doccano.py \
 - ``is_shuffle``: 是否对数据集进行随机打散，默认为True。
 - ``seed``: 随机种子，默认为1000.
 - ``separator``: 不同层标签之间的分隔符，该参数只对层次文本分类任务有效。默认为"##"。
+- ``valid``: 是否为稀疏数据筛选的有效标注数据，默认为False.
+- ``dirty``: 是否为脏数据清洗策略标注数据，默认为False.


如沟通，valid 应该是可以不用用户介入，自动aug。
但 dirty 需要用户判断，此处是否更多说明？

ZHUI · 2022-08-19T04:53:07Z

applications/text_classification/hierarchical/README.md

-|"ernie-3.0-mini-zh" |6-layer, 384-hidden, 12-heads|61.70|42.30| 0.83|
-|"ernie-3.0-micro-zh" | 4-layer, 384-hidden, 12-heads|61.14|42.97| 0.60|
-|"ernie-3.0-nano-zh" |4-layer, 312-hidden, 12-heads|60.15|32.62|0.25|
+|"ernie-3.0-xbase-zh" |20-layer, 1024-hidden, 12-heads|95.12|92.77| 12.51 |


xbase 效果不如base

要不删除吧

ZHUI · 2022-08-19T08:08:55Z

applications/text_classification/hierarchical/analysis/aug.py

+import os
+import argparse
+import paddle
+from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap


Suggested change

import os

import argparse

import paddle

from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap

import os

import argparse

import paddle

from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap

ZHUI · 2022-08-19T08:10:37Z

applications/text_classification/hierarchical/analysis/evaluate.py

+from sklearn.metrics import accuracy_score, classification_report, f1_score
+
+import paddle


Suggested change

from sklearn.metrics import accuracy_score, classification_report, f1_score

import paddle

from sklearn.metrics import accuracy_score, classification_report, f1_score

import paddle

paddle， skearn 这些都算第三方库，可不加空格

ZHUI · 2022-08-19T08:12:52Z

applications/text_classification/multi_class/analysis/sparse.py

+import argparse
+
+import numpy as np
+


add_analysis_module

6c5d098

lugimzzz added text classification data augmentation trustai labels Aug 10, 2022

lugimzzz requested a review from wawltor August 10, 2022 07:37

lugimzzz self-assigned this Aug 10, 2022

lugimzzz added 2 commits August 11, 2022 07:16

add_analysis_module

1013269

add_analysis_module

61f4bc1

wawltor reviewed Aug 14, 2022

View reviewed changes

lugimzzz and others added 4 commits August 15, 2022 09:32

change_argument_format

79613f9

add_analysis_module

d8e5991

add_evaluation_result_for_analysis_module

fa5f134

Merge branch 'develop' into analysis

2a06243

ZHUI self-requested a review August 17, 2022 06:04

ZHUI approved these changes Aug 19, 2022

View reviewed changes

Merge branch 'develop' into analysis

f89b442

lugimzzz merged commit f1634e5 into PaddlePaddle:develop Aug 22, 2022

lugimzzz mentioned this pull request Aug 23, 2022

PaddleNLP 2.3.6 Release Note Candidate #3122

Closed

lugimzzz mentioned this pull request Sep 5, 2022

PaddleNLP 2.4.0 Release Note Candidate #3190

Closed

lugimzzz deleted the analysis branch September 19, 2022 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add analysis module in text classification application #3011

add analysis module in text classification application #3011

lugimzzz commented Aug 10, 2022 •

edited

Loading

wawltor Aug 14, 2022

lugimzzz Aug 15, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

wawltor Aug 14, 2022

lugimzzz Aug 16, 2022

ZHUI left a comment

ZHUI Aug 19, 2022

lugimzzz Aug 22, 2022

ZHUI Aug 19, 2022

lugimzzz Aug 22, 2022

ZHUI Aug 19, 2022

ZHUI Aug 19, 2022

ZHUI Aug 19, 2022

ZHUI Aug 19, 2022

ZHUI Aug 19, 2022

ZHUI Aug 19, 2022


		## analysis模块介绍

		analysis模块提供了模型评估脚本对每个类别分别进行评估，帮助开发者分析模型表现，同时基于可信AI工具集和数据增强API提供了稀疏数据筛选、脏数据清洗、数据增强三种优化方案帮助提升模型效果。

		return ret_idxs, ret_scores


		class LocalDataCollatorWithPadding(DataCollatorWithPadding):


		训练数据标注质量对模型效果有较大影响，但受限于标注人员水平、标注任务难易程度等影响，训练数据中都存在一定比例的标注较差的数据（脏数据）。当标注数据规模较大时，数据标注检查就成为一个难题。本项目中脏数据清洗基于TrustAI（可信AI）工具集，利用基于表示点方法的实例级证据分析方法，计算训练数据对模型的影响分数，分数高的训练数据表明对模型影响大，这些数据有较大概率为脏数据（标注错误样本）。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。

		安装TrustAI

		from sklearn.metrics import accuracy_score, classification_report, f1_score

		import paddle

add analysis module in text classification application #3011

add analysis module in text classification application #3011

Conversation

lugimzzz commented Aug 10, 2022 • edited Loading

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz commented Aug 10, 2022 •

edited

Loading