PaddlePaddle · w5688414 · Sep 5, 2022 · Sep 1, 2022 · Sep 1, 2022 · Sep 1, 2022
diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md
@@ -42,7 +42,7 @@ In-batch Negatives 策略的训练数据为语义相似的 Pair 对，策略核
 
 ### 技术方案
 
-双塔模型，采用ERNIE1.0热启，在召回训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，进行召回测试。
+双塔模型，在召回训练阶段引入In-batch Negatives  策略，使用hnswlib建立索引库，进行召回测试。
 
 
 ### 评估指标
@@ -53,10 +53,10 @@ Recall@K召回率是指预测的前topK（top-k是指从最后的按得分排序
 
 **效果评估**
 
-|  模型 |  Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明|
+|  策略 | 模型 |  Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |
 | ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- |
-|  In-batch Negatives |  51.301 | 65.309| 69.878| 73.996|78.881| Inbatch-negative有监督训练|
-
+|  In-batch Negatives | ernie 1.0 | 51.301 | 65.309| 69.878| 73.996|78.881|
+|  In-batch Negatives | rocketqa-zh-base-query-encoder | **59.622** | **75.089**| **79.668**| **83.404**|**87.773**|
 
 
 <a name="环境依赖"></a>
@@ -166,10 +166,10 @@ Recall@K召回率是指预测的前topK（top-k是指从最后的按得分排序
 
 |Model|训练参数配置|硬件|MD5|
 | ------------ | ------------ | ------------ |-----------|
-|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
+|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|<div style="width: 150pt">ernie 1.0 margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64 </div>|<div style="width: 100pt">4卡 v100-16g</div>|f3e5c7d7b0b718c2530c5e1b136b2d74|
 
-### 训练环境说明
 
+### 训练环境说明
 
 - NVIDIA Driver Version: 440.64.00
 - Ubuntu 16.04.6 LTS (Docker)
@@ -185,7 +185,7 @@ Recall@K召回率是指预测的前topK（top-k是指从最后的按得分排序
 然后运行下面的命令使用GPU训练，得到语义索引模型：
 
 ```
-root_path=recall
+root_path=inbatch
 python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
     train_batch_neg.py \
     --device gpu \
@@ -194,11 +194,11 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
     --learning_rate 5E-5 \
     --epochs 3 \
     --output_emb_size 256 \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
     --save_steps 10 \
     --max_seq_length 64 \
     --margin 0.2 \
     --train_set_file recall/train.csv \
-    --evaluate \
     --recall_result_dir "recall_result_dir" \
     --recall_result_file "recall_result.txt" \
     --hnsw_m 100 \
@@ -217,6 +217,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
 * `learning_rate`: 训练的学习率的大小
 * `epochs`: 训练的epoch数
 * `output_emb_size`: Transformer 顶层输出的文本向量维度
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
 * `save_steps`： 模型存储 checkpoint 的间隔 steps 个数
 * `max_seq_length`: 输入序列的最大长度
 * `margin`: 正样本相似度与负样本之间的目标 Gap
@@ -234,7 +235,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
 也可以使用bash脚本：
 
 ```
-sh scripts/train_batch_neg.sh
+sh scripts/train.sh
 ```
 
 
@@ -270,6 +271,7 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
         --recall_result_dir "recall_result_dir" \
         --recall_result_file "recall_result.txt" \
         --params_path "${root_dir}/model_40/model_state.pdparams" \
+        --model_name_or_path rocketqa-zh-base-query-encoder \
         --hnsw_m 100 \
         --hnsw_ef 100 \
         --batch_size 64 \
@@ -280,16 +282,17 @@ python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \
         --corpus_file "recall/corpus.csv"
 ```
 参数含义说明
-* `device`: 使用 cpu/gpu 进行训练
-* `recall_result_dir`: 召回结果存储目录
-* `recall_result_file`: 召回结果的文件名
+* `device`： 使用 cpu/gpu 进行训练
+* `recall_result_dir`： 召回结果存储目录
+* `recall_result_file`： 召回结果的文件名
 * `params_path`： 待评估模型的参数文件名
-* `hnsw_m`: hnsw 算法相关参数，保持默认即可
-* `hnsw_ef`: hnsw 算法相关参数，保持默认即可
-* `output_emb_size`: Transformer 顶层输出的文本向量维度
-* `recall_num`: 对 1 个文本召回的相似文本数量
-* `similar_text_pair`: 由相似文本对构成的评估集
-* `corpus_file`: 召回库数据 corpus_file
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
+* `hnsw_m`： hnsw 算法相关参数，保持默认即可
+* `hnsw_ef`： hnsw 算法相关参数，保持默认即可
+* `output_emb_size`： Transformer 顶层输出的文本向量维度
+* `recall_num`： 对 1 个文本召回的相似文本数量
+* `similar_text_pair`： 由相似文本对构成的评估集
+* `corpus_file`： 召回库数据 corpus_file
 
 也可以使用下面的bash脚本：
 
@@ -383,10 +386,11 @@ python inference.py
 ```
 root_dir="checkpoints/inbatch"
 
-python -u -m paddle.distributed.launch --gpus "3" \
+python -u -m paddle.distributed.launch --gpus "0" \
     predict.py \
     --device gpu \
     --params_path "${root_dir}/model_40/model_state.pdparams" \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
     --output_emb_size 256 \
     --batch_size 128 \
     --max_seq_length 64 \
@@ -396,6 +400,7 @@ python -u -m paddle.distributed.launch --gpus "3" \
 参数含义说明
 * `device`: 使用 cpu/gpu 进行训练
 * `params_path`： 预训练模型的参数文件名
+* `model_name_or_path`: 预训练模型，用于模型和`Tokenizer`的参数初始化
 * `output_emb_size`: Transformer 顶层输出的文本向量维度
 * `text_pair_file`: 由文本 Pair 构成的待预测数据集
 
@@ -423,7 +428,9 @@ predict.sh文件包含了cpu和gpu运行的脚本，默认是gpu运行的脚本
 首先把动态图模型转换为静态图：
 
 ```
-python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams --output_path=./output
+python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams \
+                       --model_name_or_path rocketqa-zh-base-query-encoder \
+                       --output_path=./output
 ```
 也可以运行下面的bash脚本：
 
@@ -449,7 +456,9 @@ corpus_list=[['中西方语言与文化的差异','中西方文化差异以及
 然后使用PaddleInference
 
 ```
-python deploy/python/predict.py --model_dir=./output
+python deploy/python/predict.py \
+                             --model_dir=./output \
+                             --model_name_or_path rocketqa-zh-base-query-encoder
 ```
 也可以运行下面的bash脚本：
 
@@ -501,9 +510,16 @@ Paddle Serving的部署有两种方式，第一种方式是Pipeline的方式，
 
 #### Pipeline方式
 
-启动 Pipeline Server:
+修改模型需要用到的`Tokenizer`
+
+```
+self.tokenizer = AutoTokenizer.from_pretrained("rocketqa-zh-base-query-encoder")
+```
+
+然后启动 Pipeline Server:
 
 ```
+cd deploy/python
 python web_service.py
 ```
 
@@ -520,7 +536,7 @@ list_data = [
 然后运行：
 
 ```
-python rpc_client.py
+python deploy/python/rpc_client.py
 ```
 模型的输出为：
 
@@ -547,12 +563,12 @@ python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_i
 也可以使用脚本：
 
 ```
-sh deploy/C++/start_server.sh
+sh deploy/cpp/start_server.sh
 ```
 Client 可以使用 http 或者 rpc 两种方式，rpc 的方式为：
 
 ```
-python deploy/C++/rpc_client.py
+python deploy/cpp/rpc_client.py
 ```
 运行的输出为：
 ```
@@ -571,7 +587,7 @@ time to cost :0.3960278034210205 seconds
 或者使用 http 的客户端访问模式：
 
 ```
-python deploy/C++/http_client.py
+python deploy/cpp/http_client.py
 ```
 运行的输出为：
 
@@ -599,6 +615,7 @@ python -u -m paddle.distributed.launch --gpus "0,1,2,3" \
     train_batch_neg.py \
     --device gpu \
     --save_dir ./checkpoints/simcse_inbatch_negative \
+    --model_name_or_path rocketqa-zh-base-query-encoder \
     --batch_size 64 \
     --learning_rate 5E-5 \
     --epochs 3 \

diff --git a/..._batch_negative/deploy/C++/http_client.py → ..._batch_negative/deploy/cpp/http_client.py b/..._batch_negative/deploy/C++/http_client.py → ..._batch_negative/deploy/cpp/http_client.py
@@ -54,7 +54,7 @@ def convert_example(example,
 print(fetch_names)
 
 # 创建tokenizer
-tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
 max_seq_len = 64
 
 # 数据预处理

diff --git a/...n_batch_negative/deploy/C++/rpc_client.py → ...n_batch_negative/deploy/cpp/rpc_client.py b/...n_batch_negative/deploy/C++/rpc_client.py → ...n_batch_negative/deploy/cpp/rpc_client.py
@@ -50,7 +50,7 @@ def convert_example(example,
 print(fetch_names)
 
 # 创建tokenizer
-tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+tokenizer = AutoTokenizer.from_pretrained('rocketqa-zh-base-query-encoder')
 max_seq_len = 64
 
 # 数据预处理

diff --git a/...batch_negative/deploy/C++/start_server.sh → ...batch_negative/deploy/cpp/start_server.sh b/...batch_negative/deploy/C++/start_server.sh → ...batch_negative/deploy/cpp/start_server.sh
diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py b/applications/neural_search/recall/in_batch_negative/deploy/python/predict.py
@@ -40,7 +40,7 @@
     help="Batch size per GPU/CPU for training.")
 parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu",
     help="Select which device to train model, defaults to gpu.")
-
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="model name.")
 parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False],
     help='Enable to use tensorrt to speed up.')
 parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
@@ -156,22 +156,21 @@ def __init__(self,
         if args.benchmark:
             import auto_log
             pid = os.getpid()
-            self.autolog = auto_log.AutoLogger(model_name="ernie-3.0-medium-zh",
-                                               model_precision=precision,
-                                               batch_size=self.batch_size,
-                                               data_shape="dynamic",
-                                               save_path=args.save_log_path,
-                                               inference_config=config,
-                                               pids=pid,
-                                               process_name=None,
-                                               gpu_ids=0,
-                                               time_keys=[
-                                                   'preprocess_time',
-                                                   'inference_time',
-                                                   'postprocess_time'
-                                               ],
-                                               warmup=0,
-                                               logger=logger)
+            self.autolog = auto_log.AutoLogger(
+                model_name=args.model_name_or_path,
+                model_precision=precision,
+                batch_size=self.batch_size,
+                data_shape="dynamic",
+                save_path=args.save_log_path,
+                inference_config=config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=0,
+                time_keys=[
+                    'preprocess_time', 'inference_time', 'postprocess_time'
+                ],
+                warmup=0,
+                logger=logger)
 
     def extract_embedding(self, data, tokenizer):
         """
@@ -279,7 +278,7 @@ def predict(self, data, tokenizer):
 
     # ErnieTinyTokenizer is special for ernie-tiny pretained model.
     output_emb_size = 256
-    tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
     id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
     corpus_list = [{idx: text} for idx, text in id2corpus.items()]
     res = predictor.extract_embedding(corpus_list, tokenizer)

diff --git a/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py b/applications/neural_search/recall/in_batch_negative/deploy/python/web_service.py
@@ -40,7 +40,8 @@ class ErnieOp(Op):
 
     def init_op(self):
         from paddlenlp.transformers import AutoTokenizer
-        self.tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            "rocketqa-zh-base-query-encoder")
 
     def preprocess(self, input_dicts, data_id, log_id):
         from paddlenlp.data import Stack, Tuple, Pad
@@ -56,7 +57,7 @@ def preprocess(self, input_dicts, data_id, log_id):
         batchify_fn = lambda samples, fn=Tuple(
             Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
                 ),  # input
-            Pad(axis=0, pad_val=self.tokenizer.pad_token_id, dtype="int64"
+            Pad(axis=0, pad_val=self.tokenizer.pad_token_type_id, dtype="int64"
                 ),  # segment
         ): fn(samples)
         input_ids, segment_ids = batchify_fn(examples)

diff --git a/applications/neural_search/recall/in_batch_negative/evaluate.py b/applications/neural_search/recall/in_batch_negative/evaluate.py
@@ -76,8 +76,6 @@ def recall(rs, N=10):
                 relevance_labels.append(1)
             else:
                 relevance_labels.append(0)
-        # print(len(rs))
-        # print(rs[:50])
 
     recall_N = []
     recall_num = [1, 5, 10, 20, 50]
@@ -92,4 +90,3 @@ def recall(rs, N=10):
         print('recall@{}={}'.format(key, val))
         res.append(str(val))
     result.write('\t'.join(res) + '\n')
-    # print("\t".join(recall_N))
diff --git a/applications/neural_search/recall/in_batch_negative/export_model.py b/applications/neural_search/recall/in_batch_negative/export_model.py
@@ -28,15 +28,16 @@
 parser = argparse.ArgumentParser()
 parser.add_argument("--params_path", type=str, required=True,
                     default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.")
+parser.add_argument('--model_name_or_path', default="rocketqa-zh-base-query-encoder", help="Select model to train, defaults to rocketqa-zh-base-query-encoder.")
 parser.add_argument("--output_path", type=str, default='./output',
                     help="The path of model parameter in static graph to be saved.")
 args = parser.parse_args()
 # yapf: enable
 
 if __name__ == "__main__":
     output_emb_size = 256
-    pretrained_model = AutoModel.from_pretrained("ernie-1.0")
-    tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
+    pretrained_model = AutoModel.from_pretrained(args.model_name_or_path)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
     model = SemanticIndexBaseStatic(pretrained_model,
                                     output_emb_size=output_emb_size)
     if args.params_path and os.path.isfile(args.params_path):

diff --git a/applications/neural_search/recall/in_batch_negative/inference.py b/applications/neural_search/recall/in_batch_negative/inference.py
@@ -26,9 +26,10 @@
     batch_size = 1
     params_path = 'checkpoints/inbatch/model_40/model_state.pdparams'
     id2corpus = {0: '国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'}
+    model_name_or_path = "rocketqa-zh-base-query-encoder"
     paddle.set_device(device)
 
-    tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
     trans_func = partial(convert_example,
                          tokenizer=tokenizer,
                          max_seq_length=max_seq_length)
@@ -38,7 +39,7 @@
         Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # text_segment
     ): [data for data in fn(samples)]
 
-    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
+    pretrained_model = AutoModel.from_pretrained(model_name_or_path)
 
     model = SemanticIndexBaseStatic(pretrained_model,
                                     output_emb_size=output_emb_size)