Add data distillation for UIE #3136

linjieccc · 2022-08-24T14:25:58Z

PR types

New features

PR changes

APIs

Description

新增通过UIE及数据蒸馏的方式训练封闭域信息抽取模型的示例

wawltor · 2022-08-25T00:34:15Z

model_zoo/uie/data_distill/README.md

@@ -0,0 +1,31 @@
+# UIE数据蒸馏


这里的UIE数据蒸馏需要解释一下，同时整体的文档太过于简单还是需要优化，概念，数据来源，部署方面都有欠缺

已补充UIE数据蒸馏相关概念和数据来源，具体示例和部署说明待补充

wawltor · 2022-08-25T00:40:03Z

model_zoo/uie/data_distill/data_generate.py

+    # yapf: enable
+
+    # Define your schema here
+    schema = ["观点词", {"评价维度": ["观点词", "情感倾向[正向,负向]"]}]


这里需要用户自己来自定义schema，看看有没有将字符串转成python code的方式
通过传参的方式传入

wawltor · 2022-08-25T00:51:25Z

model_zoo/uie/data_distill/data_generate.py

+                    relation2id[child.name] = len(relation2id)
+                schema_list.append(child)
+
+        entity2id['OBJECT'] = len(entity2id)


这里的key为什么要大写了？

wawltor · 2022-08-25T00:56:58Z

model_zoo/uie/data_distill/data_generate.py

+    for text in tqdm(infer_texts, desc="Predicting: ", leave=False):
+        infer_results.append(uie(text))
+
+    train_synthetic_lines = synthetic2distill(texts, infer_results,


这里的逻辑没有太搞懂，这里的unlabel的数据的结果是来自己Taskflow的输出结果吗？

嗯嗯，unlabel数据通过Taskflow load定制模型的形式推理得到合成数据

wawltor · 2022-08-25T00:59:02Z

model_zoo/uie/data_distill/data_generate.py

+    return schema_tree
+
+
+def schema2label_maps(task_type, schema=None):


这里在schema的分析的需要特别的指出情感分类不在这个数据蒸馏的范围，可以在代码做个提示或者报错

在文档中也要特别的指出

wawltor · 2022-08-30T02:23:32Z

model_zoo/uie/data_distill/README.md

+
+#### UIE数据蒸馏三步
+
+- **Step 1**: 使用UIE模型对标注数据进行fine-tune，得到Teacher Model。


fine-tune -> finetune，或者整体查看一下，微调这块的统一英文术语是啥样子的了

wawltor · 2022-08-30T02:26:44Z

model_zoo/uie/data_distill/README.md

@@ -0,0 +1,50 @@
+# UIE Slim 数据蒸馏
+
+在UIE强大的抽取能力背后，是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高，若不能有效压缩则无法实际应用。因此，我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁，将UIE模型的知识迁移到小模型，以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。


在UIE强大的抽取能力背后，是需要同样强大的算力才能支撑起如此大规模模型的训练和预测 -> 在UIE强大的抽取能力背后，同样需要较大的算力支持计算

很多工业应用场景对性能要求较高 -> 在一些工业应用场景中对性能的要求较高

wawltor · 2022-08-30T02:30:17Z

model_zoo/uie/data_distill/criterion.py

+import paddle.nn as nn
+
+
+class Criterion(nn.Layer):


这块的文档说明需要是英文的

wawltor · 2022-08-30T02:30:24Z

model_zoo/uie/data_distill/criterion.py

+                                                    y_true,
+                                                    y_pred,
+                                                    mask_zero=False):
+        """稀疏版多标签分类的交叉熵


wawltor · 2022-08-30T02:38:02Z

model_zoo/uie/data_distill/criterion.py

+            y_pred = paddle.concat([-infs, y_pred[..., 1:]], axis=-1)
+            y_pos_2 = paddle.take_along_axis(y_pred, y_true, axis=-1)
+
+        pos_loss = (-y_pos_1).exp().sum(axis=-1).log()


paddle看起来是有logsumexp这个API的，可以尝试提交一下

这里之前profile使用logsumexp对四维Tensor输入的计算特别慢，这块我整理下最小复现代码反馈给框架同学

wawltor · 2022-08-30T02:46:25Z

model_zoo/uie/data_distill/train.py

+
+
+def do_train():
+    paddle.disable_static()


这里为什么和metric那块都要disable_static，默认看起来就是disable的

wawltor · 2022-08-30T02:46:56Z

model_zoo/uie/data_distill/train.py

+
+    train_ds = load_dataset(reader, data_path=args.train_path, lazy=False)
+    dev_ds = load_dataset(reader, data_path=args.dev_path, lazy=False)
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")


建议的model_name不要默认，可以让用户选择

wawltor · 2022-08-30T02:48:02Z

model_zoo/uie/data_distill/train.py

+                                       task_type=args.task_type)
+
+    encoder = AutoModel.from_pretrained("ernie-3.0-base-zh")
+    if args.task_type == "entity_extraction":


这里不太明白的是，单独实体抽取是走这个分支，如果是实体抽取、关系抽取在一起的任务是不是也可以了？

实体关系抽取可以指定任务类型为relation_extraction

这里不太明白的是，单独实体抽取是走这个分支，如果是实体抽取、关系抽取在一起的任务是不是也可以了？

请问事件抽取也是选择 task_type为“relation_extraction”吗？

wawltor · 2022-08-30T02:49:48Z

model_zoo/uie/data_distill/train.py

+    # All bias and LayerNorm parameters are excluded.
+    decay_params = [
+        p.name for n, p in model.named_parameters()
+        if not any(nd in n for nd in ["bias", "norm", "LayerNorm.weight"])


这里多了一个LayerNorm.weight，之前的经验是norm已经覆盖了，在GPLinkerForRelationExtraction这些模型增加了LayerNorm是吗？

已删除，这里GP没有增加LayerNorm

wawltor · 2022-08-30T02:51:11Z

model_zoo/uie/data_distill/evaluate.py

+    label_maps = get_label_maps(args.task_type, args.label_maps_path)
+
+    tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
+    encoder = AutoModel.from_pretrained("ernie-3.0-base-zh")


wawltor · 2022-08-30T02:51:24Z

model_zoo/uie/data_distill/evaluate.py

+    else:
+        model = GPLinkerForRelationExtraction(encoder, label_maps)
+
+    state_dict = paddle.load(


这里需要判断一下文件是否存在

wawltor · 2022-08-30T03:12:44Z

paddlenlp/layers/gp.py

+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software


gp.py -> globalpoint.py

wawltor

LGTM

linjieccc added 2 commits August 24, 2022 14:16

Add data distill for UIE

7ee1828

Update README.md and remove unsed code

1504d41

linjieccc requested a review from wawltor August 24, 2022 14:26

linjieccc self-assigned this Aug 24, 2022

linjieccc added ie Issues related to Information Extraction taskflow Taskflow labels Aug 24, 2022

wawltor reviewed Aug 25, 2022

View reviewed changes

Update README.md and doccano process

c0f91c9

wawltor reviewed Aug 30, 2022

View reviewed changes

linjieccc added 6 commits August 30, 2022 13:23

Update data distillation example

66cacee

fix entity extraction

f9dbd5c

update

77a7974

update

d83a9a8

add data_distill.py

c2b6a26

Update README.md

0ff0ca2

wawltor approved these changes Sep 5, 2022

View reviewed changes

wawltor merged commit 5401f01 into PaddlePaddle:develop Sep 5, 2022

linjieccc deleted the distill_uie branch September 5, 2022 07:35

linjieccc mentioned this pull request Sep 5, 2022

PaddleNLP 2.4.0 Release Note Candidate #3190

Closed

Alone749-i mentioned this pull request Sep 23, 2022

paddlenlp.layers.GPLinkerForEventExtraction 参数label_maps["label2id"]询问，及UIE蒸馏询问。 #3354

Closed

		return schema_tree


		def schema2label_maps(task_type, schema=None):


		#### UIE数据蒸馏三步

		- Step 1: 使用UIE模型对标注数据进行fine-tune，得到Teacher Model。

		@@ -0,0 +1,50 @@
		# UIE Slim 数据蒸馏

		在UIE强大的抽取能力背后，是需要同样强大的算力才能支撑起如此大规模模型的训练和预测。很多工业应用场景对性能要求较高，若不能有效压缩则无法实际应用。因此，我们基于数据蒸馏技术构建了UIE Slim数据蒸馏系统。其原理是通过数据作为桥梁，将UIE模型的知识迁移到小模型，以达到精度损失较小的情况下却能达到大幅度预测速度提升的效果。

		import paddle.nn as nn


		class Criterion(nn.Layer):

Add data distillation for UIE #3136

Add data distillation for UIE #3136

Conversation

linjieccc commented Aug 24, 2022

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment