Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Neural Search models into Pipelines #3172

Merged
merged 7 commits into from
Sep 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 43 additions & 26 deletions applications/neural_search/recall/in_batch_negative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ In-batch Negatives 策略的训练数据为语义相似的 Pair 对,策略核

### 技术方案

双塔模型,采用ERNIE1.0热启,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。
双塔模型,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。


### 评估指标
Expand All @@ -53,10 +53,10 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序

**效果评估**

| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明|
| 策略 | 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |
| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
| In-batch Negatives | 51.301 | 65.309| 69.878| 73.996|78.881| Inbatch-negative有监督训练|

| In-batch Negatives | ernie 1.0 | 51.301 | 65.309| 69.878| 73.996|78.881|
| In-batch Negatives | rocketqa-zh-base-query-encoder | **59.622** | **75.089**| **79.668**| **83.404**|**87.773**|


<a name="环境依赖"></a>
Expand Down Expand Up @@ -166,10 +166,10 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序

|Model|训练参数配置|硬件|MD5|
| ------------ | ------------ | ------------ |-----------|
|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">ernie 1.0 margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|

### 训练环境说明

### 训练环境说明

- NVIDIA Driver Version: 440.64.00
- Ubuntu 16.04.6 LTS (Docker)
Expand All @@ -185,7 +185,7 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
然后运行下面的命令使用GPU训练,得到语义索引模型:

```
root_path=recall
root_path=inbatch
python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
train_batch_neg.py \
--device gpu \
Expand All @@ -194,11 +194,11 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
--learning_rate 5E-5 \
--epochs 3 \
--output_emb_size 256 \
--model_name_or_path rocketqa-zh-base-query-encoder \
--save_steps 10 \
--max_seq_length 64 \
--margin 0.2 \
--train_set_file recall/train.csv \
--evaluate \
--recall_result_dir "recall_result_dir" \
--recall_result_file "recall_result.txt" \
--hnsw_m 100 \
Expand All @@ -217,6 +217,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
* `learning_rate`: 训练的学习率的大小
* `epochs`: 训练的epoch数
* `output_emb_size`: Transformer 顶层输出的文本向量维度
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数
* `max_seq_length`: 输入序列的最大长度
* `margin`: 正样本相似度与负样本之间的目标 Gap
Expand All @@ -234,7 +235,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
也可以使用bash脚本:

```
sh scripts/train_batch_neg.sh
sh scripts/train.sh
```


Expand Down Expand Up @@ -270,6 +271,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
--recall_result_dir "recall_result_dir" \
--recall_result_file "recall_result.txt" \
--params_path "${root_dir}/model_40/model_state.pdparams" \
--model_name_or_path rocketqa-zh-base-query-encoder \
--hnsw_m 100 \
--hnsw_ef 100 \
--batch_size 64 \
Expand All @@ -280,16 +282,17 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
--corpus_file "recall/corpus.csv"
```
参数含义说明
* `device`: 使用 cpu/gpu 进行训练
* `recall_result_dir`: 召回结果存储目录
* `recall_result_file`: 召回结果的文件名
* `device` 使用 cpu/gpu 进行训练
* `recall_result_dir` 召回结果存储目录
* `recall_result_file` 召回结果的文件名
* `params_path`: 待评估模型的参数文件名
* `hnsw_m`: hnsw 算法相关参数,保持默认即可
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
* `output_emb_size`: Transformer 顶层输出的文本向量维度
* `recall_num`: 对 1 个文本召回的相似文本数量
* `similar_text_pair`: 由相似文本对构成的评估集
* `corpus_file`: 召回库数据 corpus_file
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
* `hnsw_m`: hnsw 算法相关参数,保持默认即可
* `hnsw_ef`: hnsw 算法相关参数,保持默认即可
* `output_emb_size`: Transformer 顶层输出的文本向量维度
* `recall_num`: 对 1 个文本召回的相似文本数量
* `similar_text_pair`: 由相似文本对构成的评估集
* `corpus_file`: 召回库数据 corpus_file

也可以使用下面的bash脚本:

Expand Down Expand Up @@ -383,10 +386,11 @@ python inference.py
```
root_dir="checkpoints/inbatch"

python -u -m paddle.distributed.launch --gpus "3" \
python -u -m paddle.distributed.launch --gpus "0" \
predict.py \
--device gpu \
--params_path "${root_dir}/model_40/model_state.pdparams" \
--model_name_or_path rocketqa-zh-base-query-encoder \
--output_emb_size 256 \
--batch_size 128 \
--max_seq_length 64 \
Expand All @@ -396,6 +400,7 @@ python -u -m paddle.distributed.launch --gpus "3" \
参数含义说明
* `device`: 使用 cpu/gpu 进行训练
* `params_path`: 预训练模型的参数文件名
* `model_name_or_path`: 预训练模型,用于模型和`Tokenizer`的参数初始化
* `output_emb_size`: Transformer 顶层输出的文本向量维度
* `text_pair_file`: 由文本 Pair 构成的待预测数据集

Expand Down Expand Up @@ -423,7 +428,9 @@ predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本
首先把动态图模型转换为静态图:

```
python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams --output_path=./output
python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \
--model_name_or_path rocketqa-zh-base-query-encoder \
--output_path=./output
```
也可以运行下面的bash脚本:

Expand All @@ -449,7 +456,9 @@ corpus_list=[['中西方语言与文化的差异','中西方文化差异以及
然后使用PaddleInference

```
python deploy/python/predict.py --model_dir=./output
python deploy/python/predict.py \
--model_dir=./output \
--model_name_or_path rocketqa-zh-base-query-encoder
```
也可以运行下面的bash脚本:

Expand Down Expand Up @@ -501,9 +510,16 @@ Paddle Serving的部署有两种方式,第一种方式是Pipeline的方式,

#### Pipeline方式

启动 Pipeline Server:
修改模型需要用到的`Tokenizer`

```
self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
```

然后启动 Pipeline Server:

```
cd deploy/python
python web_service.py
```

Expand All @@ -520,7 +536,7 @@ list_data = [
然后运行:

```
python rpc_client.py
python deploy/python/rpc_client.py
```
模型的输出为:

Expand All @@ -547,12 +563,12 @@ python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_i
也可以使用脚本:

```
sh deploy/C++/start_server.sh
sh deploy/cpp/start_server.sh
```
Client 可以使用 http 或者 rpc 两种方式,rpc 的方式为:

```
python deploy/C++/rpc_client.py
python deploy/cpp/rpc_client.py
```
运行的输出为:
```
Expand All @@ -571,7 +587,7 @@ time to cost :0.3960278034210205 seconds
或者使用 http 的客户端访问模式:

```
python deploy/C++/http_client.py
python deploy/cpp/http_client.py
```
运行的输出为:

Expand Down Expand Up @@ -599,6 +615,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
train_batch_neg.py \
--device gpu \
--save_dir ./checkpoints/simcse_inbatch_negative \
--model_name_or_path rocketqa-zh-base-query-encoder \
--batch_size 64 \
--learning_rate 5E-5 \
--epochs 3 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def convert_example(example,
print(fetch_names)

# 创建tokenizer
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
max_seq_len = 64

# 数据预处理
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def convert_example(example,
print(fetch_names)

# 创建tokenizer
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
max_seq_len = 64

# 数据预处理
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
help="Batch size per GPU/CPU for training.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu",
help="Select which device to train model, defaults to gpu.")

parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.")
parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False],
help='Enable to use tensorrt to speed up.')
parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
Expand Down Expand Up @@ -156,22 +156,21 @@ def __init__(self,
if args.benchmark:
import auto_log
pid = os.getpid()
self.autolog = auto_log.AutoLogger(model_name="ernie-3.0-medium-zh",
model_precision=precision,
batch_size=self.batch_size,
data_shape="dynamic",
save_path=args.save_log_path,
inference_config=config,
pids=pid,
process_name=None,
gpu_ids=0,
time_keys=[
'preprocess_time',
'inference_time',
'postprocess_time'
],
warmup=0,
logger=logger)
self.autolog = auto_log.AutoLogger(
model_name=args.model_name_or_path,
model_precision=precision,
batch_size=self.batch_size,
data_shape="dynamic",
save_path=args.save_log_path,
inference_config=config,
pids=pid,
process_name=None,
gpu_ids=0,
time_keys=[
'preprocess_time', 'inference_time', 'postprocess_time'
],
warmup=0,
logger=logger)

def extract_embedding(self, data, tokenizer):
"""
Expand Down Expand Up @@ -279,7 +278,7 @@ def predict(self, data, tokenizer):

# ErnieTinyTokenizer is special for ernie-tiny pretained model.
output_emb_size = 256
tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
corpus_list = [{idx: text} for idx, text in id2corpus.items()]
res = predictor.extract_embedding(corpus_list, tokenizer)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ class ErnieOp(Op):

def init_op(self):
from paddlenlp.transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
self.tokenizer = AutoTokenizer.from_pretrained(
"rocketqa-zh-base-query-encoder")

def preprocess(self, input_dicts, data_id, log_id):
from paddlenlp.data import Stack, Tuple, Pad
Expand All @@ -56,7 +57,7 @@ def preprocess(self, input_dicts, data_id, log_id):
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
), # input
Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"
), # segment
): fn(samples)
input_ids, segment_ids = batchify_fn(examples)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,6 @@ def recall(rs, N=10):
relevance_labels.append(1)
else:
relevance_labels.append(0)
# print(len(rs))
# print(rs[:50])

recall_N = []
recall_num = [1, 5, 10, 20, 50]
Expand All @@ -92,4 +90,3 @@ def recall(rs, N=10):
print('recall@{}={}'.format(key, val))
res.append(str(val))
result.write('\t'.join(res) + '\n')
# print("\t".join(recall_N))
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,16 @@
parser = argparse.ArgumentParser()
parser.add_argument("--params_path", type=str, required=True,
default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select model to train, defaults to rocketqa-zh-base-query-encoder.")
parser.add_argument("--output_path", type=str, default='./output',
help="The path of model parameter in static graph to be saved.")
args = parser.parse_args()
# yapf: enable

if __name__ == "__main__":
output_emb_size = 256
pretrained_model = AutoModel.from_pretrained("ernie-1.0")
tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
model = SemanticIndexBaseStatic(pretrained_model,
output_emb_size=output_emb_size)
if args.params_path and os.path.isfile(args.params_path):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@
batch_size = 1
params_path = 'checkpoints/inbatch/model_40/model_state.pdparams'
id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
model_name_or_path = "rocketqa-zh-base-query-encoder"
paddle.set_device(device)

tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
trans_func = partial(convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
Expand All @@ -38,7 +39,7 @@
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment
): [data for data in fn(samples)]

pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
pretrained_model = AutoModel.from_pretrained(model_name_or_path)

model = SemanticIndexBaseStatic(pretrained_model,
output_emb_size=output_emb_size)
Expand Down
Loading