Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retrieval based multi class classification #3180

Merged
merged 4 commits into from
Sep 5, 2022

Conversation

w5688414
Copy link
Contributor

@w5688414 w5688414 commented Sep 1, 2022

PR types

  • New features

PR changes

  • APIs

Description

基于检索的分类系统使用了Client Server的模式,即抽取向量的模型部署在服务端,然后启动客户端(Client)端去访问。

python run_system.py

代码内置的测试用例为:

list_data = [{"sentence": "谈谈去西安旅游,哪些地方让你觉得不虚此行?"}]

会输出如下的结果:

......
PipelineClient::predict pack_data time:1658988661.507715
PipelineClient::predict before time:1658988661.5081818
Extract feature time to cost :0.02322244644165039 seconds
Search milvus time cost is 0.06801486015319824 seconds
{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 旅行 0.3969537019729614
{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 西安 0.7750667333602905
{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 陕西 0.8064634799957275
{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 火车上 0.8384211659431458
{'sentence': '谈谈去西安旅游,哪些地方让你觉得不虚此行?'} 山西 0.9251932501792908
.....

@w5688414 w5688414 requested a review from lugimzzz September 1, 2022 12:32
@w5688414 w5688414 self-assigned this Sep 1, 2022

以前的分类任务中,标签信息作为无实际意义,独立存在的one-hot编码形式存在,这种做法会潜在的丢失标签的语义信息,本方案把文本分类任务中的标签信息转换成含有语义信息的语义向量,将文本分类任务转换成向量检索和匹配的任务。这样做的好处是对于一些类别标签不是很固定的场景,或者需要经常有一些新增类别的需求的情况非常合适。另外,对于一些新的相关的分类任务,这种方法也不需要模型重新学习或者设计一种新的模型结构来适应新的任务。总的来说,这种基于检索的文本分类方法能够有很好的拓展性,能够利用标签里面包含的语义信息,不需要重新进行学习。这种方法可以应用到相似标签推荐,文本标签标注,金融风险事件分类,政务信访分类等领域。

本方案是基于语义索引模型的分类,语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。基于语义索引的分类方法有两种,第一种方法是直接把标签变成召回库,即把输入文本和标签的文本进行匹配,第二种是利用召回的文本带有类别标签,把召回文本的类别标签作为给定输入文本的类别。本方案使用双塔模型,训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。最后利用召回的结果使用 Accuracy 指标来评估语义索引模型的分类的效果。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里如果默认是第一种方法,建议提一下默认使用标签作为召回库

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@w5688414 w5688414 merged commit 54df619 into PaddlePaddle:develop Sep 5, 2022
@w5688414 w5688414 deleted the rb1 branch September 8, 2022 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants