增加glm-4-9b-chat微调文档 (#251)

SmartFlowAI · Jun 14, 2024 · eb24f4b · eb24f4b
2 parents 291768d + dfb17da
commit eb24f4b
Show file tree

Hide file tree

Showing 3 changed files with 288 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -297,7 +297,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
 |         [dream00001](https://github.com/dream00001)          |                  南开大学在读硕士                  |                                                              |                  前后端开发                   |
 |     [王几行XING](https://zhihu.com/people/brycewang1898)     |                  北京大学硕士毕业                  |                                                              |         清洗数据、LLM微调、前后端开发         |
 |                            [思在]                            |            北京大学硕士毕业（微软美国）            |                                                              |              LLM微调、前后端开发              |
-
+|       [TingWei](https://github.com/wwewwt)        |                  电子科技大学硕士毕业士                  |     微信公众号：AI大模型在手                                                         |                     微调                      |
 ### 版权说明
 
 该项目签署了 MIT 授权许可，详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)

diff --git a/README_EN.md b/README_EN.md
@@ -299,6 +299,7 @@ This project uses Git for version control. You can see the currently available v
 |         [dream00001](https://github.com/dream00001)          |             Nankai University, Master's student              |                                                              |              Front-end and back-end development              |
 |         [王几行XING](zhihu.com/people/brycewang1898)         |             Peking University, Master's graduate             |                                                              | Data Processing, LLM finetuning, Front-end and back-end development |
 |                            [思在]                            |       Peking University, Master's graduate (Microsoft)       |                                                              |      LLM finetuning, Front-end and back-end development      |
+|       [TingWei](https://github.com/wwewwt)        |                  University Of Electronic Science And Technology Of China,Master's graduate                  |                                                                   |                     LLM finetuning                      |
 
 ### Copyright Notice
 

diff --git a/doc/GLM-4-9B-chat Lora 微调（llama-factory）.md b/doc/GLM-4-9B-chat Lora 微调（llama-factory）.md
@@ -0,0 +1,286 @@
+# GLM4-9B-chat Lora 微调.
+
+介绍如何基于 llama-factory 框架，对 glm-4-9b-chat 模型进行 Lora 微调。Lora 是一种高效微调方法，深入了解其原理可参见博客：[知乎|深入浅出 Lora](https://zhuanlan.zhihu.com/p/650197598)。
+
+## 一、环境准备
+我们实践了两种平台进行选择
+*  在[autodl](https://www.autodl.com/)平台中租一个3090等24G显存的显卡机器，如下图所示镜像选择`PyTorch`-->`2.0.0`-->`3.8(ubuntu20.04)`-->`11.8`
+![autodl](../xtuner_config/images/autodl.png)
+
+
+*  在 [InternStudio](https://studio.intern-ai.org.cn/) 平台中选择 A100(1/4) 的配置，如下图所示镜像选择 `Cuda11.7-conda`，如下图所示：
+![internstudio](../xtuner_config/images/internstudio.png)
+在Terminal中，进行pip换源和安装依赖包
+
+## 环境配置
+
+在完成基本环境配置和本地模型部署的情况下，你还需要安装一些第三方库，可以使用以下命令：
+
+```bash
+python -m pip install --upgrade pip
+# 更换 pypi 源加速库的安装
+pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
+
+# 安装 LLaMA-Factory
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]"
+#上面这步操作会完成torch、transformers、datasets等相关依赖包的安装
+
+```
+
+
+## 二、模型下载
+
+使用 `modelscope` 中的`snapshot_download`函数下载模型，第一个参数为模型名称，参数`cache_dir`为模型的下载路径。
+
+在 `/root/autodl-tmp` 路径下新建 `download.py` 文件并在其中输入以下内容，粘贴代码后记得保存文件，如下图所示。并运行 `python /root/autodl-tmp/download.py`执行下载，模型大小为 14 GB，下载模型大概需要 10~20 分钟
+
+```python
+import torch
+from modelscope import snapshot_download, AutoModel, AutoTokenizer
+import os
+model_dir = snapshot_download('ZhipuAI/glm-4-9b-chat', cache_dir='/root/autodl-tmp', revision='master')
+```
+
+## 三、指令集构建 —— Alpaca 格式
+
+ LLaMA-Factory 支持 alpaca 格式和 sharegpt 格式的数据集,本次微调我们使用 alpaca 格式
+
+### 指令监督微调数据格式说明
+
+在指令监督微调时，`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令，即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。
+
+如果指定，`system` 列对应的内容将被作为系统提示词。
+
+`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮对话的指令和回答。注意在指令监督微调时，历史消息中的回答内容**也会被用于模型学习**。
+
+```json
+[
+  {
+    "instruction": "人类指令（必填）",
+    "input": "人类输入（选填）",
+    "output": "模型回答（必填）",
+    "system": "系统提示词（选填）",
+    "history": [
+      ["第一轮指令（选填）", "第一轮回答（选填）"],
+      ["第二轮指令（选填）", "第二轮回答（选填）"]
+    ]
+  }
+]
+```
+
+
+### 单轮对话数据的格式转换
+使用以下程序将[数据集](../datasets/)转换成 alpaca 格式
+```python
+import json
+import re
+
+# 选择要格式转换的数据集
+file_name = "single_turn_dataset_1.json"
+#file_name = "single_turn_dataset_2.json"
+
+system_prompt = "如果要添加系统提示词，请放在这里"
+
+with open(f'../{file_name}', 'rt', encoding='utf-8') as file:
+    data = json.load(file)
+
+converted_data = [{"instruction": item["prompt"], 
+                   "input": "", 
+                   "output": item["completion"],
+                   "system": system_prompt
+                  } for item in data]
+
+for i in range(len(converted_data)):
+
+    # 数据清洗-去掉特殊符号
+    if "🐳" in converted_data[i]["output"]:
+        converted_data[i]["output"] = converted_data[i]["output"].replace("🐳", "")
+
+    # 数据清洗-去掉“你好，我是红烧肉”，会影响大模型的自我认知
+    if '好，我是' in converted_data[i]["output"]:
+        converted_data[i]["output"] = converted_data[i]["output"].strip()
+        intro_pattern = r"^[^\n]+\n"
+        converted_data[i]["output"] = re.sub(intro_pattern, "", converted_data[i]["output"]).strip() 
+
+with open(f'./processed/{file_name}', 'w', encoding='utf-8') as f:
+    json.dump(converted_data, f, ensure_ascii=False, indent=4)
+print(f'./processed/{file_name} Done')
+
+```
+
+### 多轮对话数据的格式转换
+使用以下程序将[数据集](../datasets/)转换成 alpaca 格式
+```python
+from tqdm import tqdm
+import json
+
+# 选择要格式转换的数据集
+file_name = "data.json"
+#file_name = "data_pro.json"
+#file_name = "multi_turn_dataset_1.json"
+#file_name = "multi_turn_dataset_2.json"
+#file_name = "aiwei.json"
+
+system_prompt = "如果要添加系统提示词，请放在这里"
+
+with open(f'../{file_name}', 'rt', encoding='utf-8') as file:
+    data = json.load(file)
+
+# 遍历原始数据，进行格式转换
+
+# 转换后的数据格式
+converted_data = []
+for item in tqdm(data):
+    conversation = item['conversation']
+    history = [(c['input'], c['output']) for c in conversation[:-1]]
+    last_item = conversation[-1]
+    converted_data.append({
+        "instruction": last_item['input'],
+        "input": "",
+        "output": last_item['output'],
+        "system": system_prompt,
+        "history": history
+    })
+    # 将转换后的数据转换为JSON格式
+    converted_json = json.dumps(converted_data, ensure_ascii=False, indent=4)
+
+with open(f'./processed/{file_name}', 'w', encoding='utf-8') as f:
+    json.dump(converted_data, f, ensure_ascii=False, indent=4)
+```
+
+
+### 角色扮演数据的格式转换
+代码同上，根据原数据集是单轮对话还是多轮对话来选择。注意设置各个角色的“system_prompt”。
+
+
+### 数据集合并
+为了方便处理（不想在LLaMA-Factory中添加太多的数据集），这里将所有已经处理好的 alpaca 格式的数据集（每一个数据集文件都是一个json字符串）合并成一个文件（一个大的json字符串），合并代码如下：
+```python
+import json
+
+# 初始化一个空列表来存储所有数据
+merged_data = []
+file_list = [
+    "single_turn_dataset_1.json",
+    "single_turn_dataset_2.json",
+    "self_cognition_EmoLLM.json",
+    "ruozhiba_raw.json",
+    "data.json",
+    "data_pro.json",
+    "multi_turn_dataset_1.json",
+    "multi_turn_dataset_2.json",
+    "aiwei.json",
+    "tiangou.json",
+    "SoulStar_data.json",
+    "mother_v2.json",
+    "scientist.json"
+]
+
+# 遍历所有文件并读取数据
+for filename in file_list:
+    with open(f"./processed/{filename}", 'r', encoding='utf-8') as file:
+        data = json.load(file)
+        merged_data.extend(data)
+
+# 将合并后的数据写入新的 JSON 文件
+with open('emo_glm4_merged_data.json', 'w', encoding='utf-8') as output_file:
+    json.dump(merged_data, output_file, ensure_ascii=False, indent=4)
+
+print("合并完成，已保存到 emo_glm4_merged_data.json 文件中。")
+```
+
+
+### 将数据集配置到LLaMA-Factory 中
+
+修改 LLaMa-Factory 目录中的 data/dataset_info.json 文件，在其中添加：
+
+```json
+"emo_merged": {
+  "file_name": "emo_glm4_merged_data.json文件的绝对路径",
+  }
+}
+```
+
+## 四、微调模型
+在 LLaMA-Factory 目录中新建配置文件 emo_glm4_lora_sft.yaml ：
+```python
+### model
+model_name_or_path: glm-4-9b-chat模型地址的绝对路径
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_target: all
+
+### dataset
+# dataset 要和 data/dataset_info.json 中添加的信息保持一致
+dataset: emo_merged
+template: glm4
+cutoff_len: 2048
+max_samples: 1000
+overwrite_cache: true
+preprocessing_num_workers: 16
+
+### output
+# output_dir是模型训练过程中的checkpoint，训练日志等的保存目录
+output_dir: saves/emo-glm4-epoch10/lora/sft
+logging_steps: 10
+#save_steps: 500
+plot_loss: true
+overwrite_output_dir: true
+save_strategy: epoch
+
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 8
+learning_rate: 1.0e-4
+num_train_epochs: 10.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+fp16: true
+
+### eval
+do_eval: false
+val_size: 0.1
+per_device_eval_batch_size: 1
+eval_strategy: steps
+eval_steps: 10
+```
+
+执行以下命令开始微调：
+```bash
+cd LLaMA-Factory
+llamafactory-cli train glm4_emo_lora_sft.yaml
+```
+
+训练完成后，在 LLaMA-Factory 目录中新建配置文件 emo_glm4_lora_sft_export.yaml:
+
+```python
+### model
+model_name_or_path: glm-4-9b-chat模型地址的绝对路径
+# 刚才emo_glm4_lora_sft.yaml文件中的 output_dir
+adapter_name_or_path: saves/emo-glm4-epoch10/lora/sft
+template: glm4
+finetuning_type: lora
+
+### export
+export_dir: models/EmoLLM-glm-4-9b-chat
+export_size: 2
+export_device: cpu
+export_legacy_format: false
+```
+
+## 五、合并模型
+
+执行以下命令开始合并模型：
+```bash
+cd LLaMA-Factory
+llamafactory-cli export emo_glm4_lora_sft_export.yaml
+```
+
+在 models/EmoLLM-glm-4-9b-chat 目录中就可以获得经过Lora微调后的完整模型。
+
+模型权重已开源：[ModelScope](https://www.modelscope.cn/models/wwewwt/EmoLLM-glm-4-9b-chat/summary)