-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add analysis module in text classification application #3011
Conversation
|
||
**安装TrustAI** | ||
```shell | ||
pip install trustai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的版本是否能确定下来
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已固定为pip install trustai==0.1.4
|
||
## analysis模块介绍 | ||
|
||
analysis模块提供了**模型评估**脚本对每个类别分别进行评估,帮助开发者分析模型表现,同时基于可信AI工具集和数据增强API提供了**稀疏数据筛选、脏数据清洗、数据增强**三种优化方案帮助提升模型效果。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
分析模型看缺乏一个概念图,具体分析什么,以及带来的策略方案是什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加概念图
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已新增
--params_path "../checkpoint/" \ | ||
--batch_size 16 \ | ||
--sparse_num 100 \ | ||
--valid_num 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
既然是建议10~20%,这里又是一个固定的值,是不是改成比例比较合适
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为方案效果里都是用稀疏数据100条,有效数据100条的方式进行,这边修改的话那边可能要重新做实验。我的想法是暂时保留固定值的写法
* `dev_file`:本地数据集中开发集文件名;默认为"dev.txt"。 | ||
* `label_file`:本地数据集中标签集文件名;默认为"label.txt"。 | ||
* `sparse_file`:保存在本地数据集路径中稀疏数据文件名;默认为"sparse.txt"。 | ||
* `valid_file`:保存在本地数据集路径中有效正影响训练数据文件名;默认为"valid.txt"。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的valid file是啥意思了? 感觉它的含义和dev_file有点重复,让人感觉有点误解
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修改为support file,把所有有效数据表达改为支持数据,感觉支持数据也和计算正影响分数的方法更为匹配
|训练集(500+25%脏数据) |65.58| | ||
|训练集(500+25%脏数据) + 脏数据清洗(50)|68.90| | ||
|训练集(500+25%脏数据) + 脏数据清洗(100)|69.36| | ||
|训练集(500+25%脏数据) + 脏数据清洗(150)|73.15| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的逻辑相对比较奇怪,虽然你的意思是加上脏数据效果变差了,但是用户的逻辑应该是在数据集有脏数据,加上数据清洗之后,效果变好了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改用含xx条脏数据的方式更直观。
parser.add_argument("--dirty_threshold", | ||
type=float, | ||
default="0", | ||
help="The threshold to select dirty data.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里设置成0是否合适了,并且建议这里设置成一个required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
默认为0,用户根据自己需要调整
set_seed(args.seed) | ||
paddle.set_device(args.device) | ||
# Define model & tokenizer | ||
if os.path.exists(args.params_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里最好还是判断一下params文件存在
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改为if os.path.exists(os.path.join(args.params_path, "model_state.pdparams")) and os.path.exists(os.path.join(args.params_path, "model_config.json")) and os.path.exists(os.path.join(args.params_path, "tokenizer_config.json")):
return ret_idxs, ret_scores | ||
|
||
|
||
class LocalDataCollatorWithPadding(DataCollatorWithPadding): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里重载基类的目的是什么了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把字典变成list,适配模型输入
applications/text_classification/multi_class/analysis/sparse.py
Outdated
Show resolved
Hide resolved
train_data_loader, | ||
classifier_layer_name="classifier") | ||
# Feature similarity analysis & select sparse data | ||
analysis_result = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里帮忙解释一下 FeatureSimilarityModel 具体获取的结果是什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
得到一个基于训练集数据的相似度计算模型
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
训练数据标注质量对模型效果有较大影响,但受限于标注人员水平、标注任务难易程度等影响,训练数据中都存在一定比例的标注较差的数据(脏数据)。当标注数据规模较大时,数据标注检查就成为一个难题。本项目中脏数据清洗基于TrustAI(可信AI)工具集,利用基于表示点方法的实例级证据分析方法,计算训练数据对模型的影响分数,分数高的训练数据表明对模型影响大,这些数据有较大概率为脏数据(标注错误样本)。更多细节详见[TrustAI](https://github.com/PaddlePaddle/TrustAI)和[实例级证据分析](https://github.com/PaddlePaddle/TrustAI/blob/main/trustai/interpretation/example_level/README.md)。 | ||
|
||
**安装TrustAI** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 安装TrustAI
文档中出现了很多次了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为只有稀疏数据和脏数据清洗会用到trustai其他的不会用到,所以在稀疏数据和脏数据清洗前加入了trustai安装提示,如果只是使用评估模型或数据增强是不需要安装trustai的
* `dirty_file`:保存脏数据文件名,默认为"train_dirty.txt"。 | ||
* `rest_file`:保存剩余数据(非脏数据)文件名,默认为"train_dirty_rest.txt"。 | ||
* `train_file`:本地数据集中训练集文件名;默认为"train.txt"。 | ||
* `dirty_threshold`:筛选脏数据用于重新标注的阈值,只选择影响分数大于阈值作为有效数据,默认为0。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个看起来一个打分值,不是比例。如何做到 脏数据清洗(100)
,用户手动再去除打分top100的吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对,如果是0的话,用户默认只要得分top100的。如果设置了阈值,除了脏数据得分需要大于阈值,并且排名前100
@@ -204,8 +216,11 @@ python doccano.py \ | |||
- ``is_shuffle``: 是否对数据集进行随机打散,默认为True。 | |||
- ``seed``: 随机种子,默认为1000. | |||
- ``separator``: 不同层标签之间的分隔符,该参数只对层次文本分类任务有效。默认为"##"。 | |||
- ``valid``: 是否为稀疏数据筛选的有效标注数据,默认为False. | |||
- ``dirty``: 是否为脏数据清洗策略标注数据,默认为False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如沟通,valid
应该是可以不用用户介入,自动aug。
但 dirty
需要用户判断,此处是否更多说明?
|"ernie-3.0-mini-zh" |6-layer, 384-hidden, 12-heads|61.70|42.30| 0.83| | ||
|"ernie-3.0-micro-zh" | 4-layer, 384-hidden, 12-heads|61.14|42.97| 0.60| | ||
|"ernie-3.0-nano-zh" |4-layer, 312-hidden, 12-heads|60.15|32.62|0.25| | ||
|"ernie-3.0-xbase-zh" |20-layer, 1024-hidden, 12-heads|95.12|92.77| 12.51 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xbase 效果不如base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不删除吧
import os | ||
import argparse | ||
import paddle | ||
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import os | |
import argparse | |
import paddle | |
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap | |
import os | |
import argparse | |
import paddle | |
from paddlenlp.dataaug import WordSubstitute, WordInsert, WordDelete, WordSwap |
from sklearn.metrics import accuracy_score, classification_report, f1_score | ||
|
||
import paddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sklearn.metrics import accuracy_score, classification_report, f1_score | |
import paddle | |
from sklearn.metrics import accuracy_score, classification_report, f1_score | |
import paddle |
paddle, skearn 这些都算第三方库,可不加空格
import argparse | ||
|
||
import numpy as np | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同之前
PR types
Others
PR changes
Others
Description
在文本分类application中新增analysis模块