Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Fix bugs in docs #60

Merged
merged 3 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion evaluate/General_evaluation_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ The `eval.py` script is used to generate the doctor's response and evaluate it,

The `metric.py` script contains functions to calculate evaluation metrics, which can be set to evaluate by character level or word level, currently including BLEU and ROUGE scores.

## Test results
## Results

Test the data in data.json with the following results:

Expand Down
45 changes: 24 additions & 21 deletions generate_data/tutorial.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,71 @@
# EMO 心理大模型 微调数据生成教程
# EmoLLM 微调数据生成教程

**一、目标与背景**

为了使我们的心理大模型有更好的表达效果,我们必须要有高质量的数据集。为了达到这一目标,我们决定利用四种强大的人工智能大模型:文心一言、通义千问、讯飞星火和智浦AI来生成对话数据。此外,我们还将增强数据集的认知深度,通过加入少量自我认知数据集来提高模型的泛化能力。
为了使我们的心理大模型有更好的表达效果,我们必须要有高质量的数据集。为了达到这一目标,我们决定利用四种强大的中文大模型:文心一言、通义千问、讯飞星火 和 智谱GLM 来生成对话数据。此外,我们还将增强数据集的认知深度,通过加入少量自我认知数据集来提高模型的泛化能力。

**二、数据集生成方法**

1. **模型选择与数据准备**

选择文心一言、通义千问、讯飞星火和智浦这四种大语言模型,获取调用相应接口的API,并准备用于生成对话数据。
2. **单轮与多轮对话数据生成**
选择文心一言、通义千问、讯飞星火和智谱GLM这四种大语言模型,获取调用相应接口的API,并准备用于生成对话数据。

3. **单轮与多轮对话数据生成**

利用这四种模型,我们生成了10000条单轮和多轮对话数据。在这一过程中,我们确保了数据的多样性、复杂性和有效性。

因为心理活动往往是复杂的,为了保证数据的多样性。我们选择了16 * 28 共 `448`个场景进行数据集生成,具体场景名称请参考config.yml中的 `emotions_list 和 areas_of_life`两个参数的配置。
3. **自我认知数据集的加入**

5. **自我认知数据集的加入**

为了增强模型的认知能力,我们特意加入了一部分自我认知数据集。这些数据集有助于模型更好地理解上下文,提高对话的自然度和连贯性。

**三、实践步骤**

1. **初始化**

* 安装所需的软件和库
* 安装所需的软件和库

```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
* 准备输入数据和配置参数
* 准备输入数据和配置参数

可参见 `config.yml`均有注释

2. **模型选择与配置**

* 根据需求选择适合的模型
* 根据需求选择适合的模型
为了使大家都能够玩上大模型,我们选用InterLLM2-7B作为我们的基线模型(消费级显卡也可部署微调的哦)
* 对模型进行必要的配置和调整
* 对模型进行必要的配置和调整
根据我们的数据集以及配置策略,使用XTuner进行微调

3. **数据生成**

* 使用通义千问大模型进行数据生成
* 使用通义千问大模型进行数据生成
```bash
# 终端运行
bash run_qwen.bash
```
* 使用百度文心大模型进行数据生成
* 使用百度文心大模型进行数据生成
```bash
# 终端运行
python ernie_gen_data.py
```
* 使用智浦AI大模型进行数据生成。
* 使用智谱GLM大模型进行数据生成
```bash
# 终端运行
python zhipuai_gen_data.py
```
* 使用讯飞星火大模型进行数据生成
* 使用讯飞星火大模型进行数据生成
```bash
# 终端运行
python ./xinghuo/gen_data.py
```

4. **自我认知数据集的整合**

* 自我认知数据集这个就需要按照格式手动生成的哈~,如下格式即可
* 自我认知数据集需要按照格式手动生成,如下格式即可
```json
[
{
Expand All @@ -85,16 +87,17 @@
]
```

5. **数据集整合**
5. **数据集整合**

在进行数据集整合之前,我们要检查生成的数据是否存在格式错误,类型不符合等情况。我们需要check.py进行检查数据。最后再使用merge_json.py将所有的json整合为一个总的json文件。
6. **评估与优化**

7. **评估与优化**

* 使用适当的评估指标对生成的数据集进行评估
* 根据评估结果进行必要的优化和调整
* 使用适当的评估指标对生成的数据集进行评估
* 根据评估结果进行必要的优化和调整

7. **测试与部署**

* 使用独立测试集对训练好的模型进行评估
* 根据测试结果进行必要的调整和优化
* 将最终的模型部署到实际应用中
* 使用独立测试集对训练好的模型进行评估
* 根据测试结果进行必要的调整和优化
* 将最终的模型部署到实际应用中
58 changes: 32 additions & 26 deletions generate_data/tutorial_EN.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,73 @@
# EMO Psychological large model fine-tuning data generation tutorial
# EmoLLM fine-tuning data generation tutorial

**I. Objectives and Background**

In order to have a better representation of our large mental models, we must have high quality data sets. To achieve this goal, we decided to use four powerful AI grand models: Wenxin Yiyi, Tongyi Qianwen, Feifei Spark, and NXP AI to generate conversation data. In addition, we will enhance the cognitive depth of the dataset and improve the generalization ability of the model by adding a small number of self-cognitive datasets.
In order to have a better representation of our large mental models, we must have high quality datasets. To achieve this goal, we decided to use four powerful AI grand models: **Wenxin Yiyan**, **Tongyi Qianwen**, **Feifei Spark**, and **Zhipu GLM** to generate conversation data. In addition, we will enhance the cognitive depth of the dataset and improve the generalization ability of the model by adding a small number of self-cognitive datasets.

**II. Data set generation method**
**II. dataset generation method**

1. **Model selection and data preparation**

Choose four big language models, namely Wenxin Yiyi, Tongyi Qianwen, IFei Spark and Zhipu, obtain the API to call the corresponding interface, and prepare to generate dialogue data.
2. **Single round and multiple round dialogue data generation **
Choose four big language models, namely Wenxin Yiyan, Tongyi Qianwen, IFei Spark and Zhipu GLM, obtain the API to call the corresponding interface, and prepare to generate dialogue data.

3. **Single-turn and multi-turn dialogue data generation**

Using these four models, we generated 10,000 single - and multi-round conversation data. In doing so, we ensure the diversity, complexity and validity of our data.
Using these four models, we generated 10,000 single and multi-turn conversation data. In doing so, we ensure the diversity, complexity and validity of our data.

Because mental activity is often complex, in order to ensure the diversity of data. We selected a total of 16 * 28 `448` scenarios for data set generation. For specific scenario names, please refer to the configuration of the two parameters`emotions_list and areas_of_life`in config.yml.
3. **Inclusion of self-perception datasets**
Because mental activity is often complex, in order to ensure the diversity of data. We selected a total of 16 * 28 `448` scenarios for dataset generation. For specific scenario names, please refer to the configuration of the two parameters`emotions_list and areas_of_life`in config.yml.

4. **Inclusion of self-perception datasets**

In order to enhance the cognitive ability of the model, we specially added a part of self-cognitive data set. These data sets help the model better understand the context and improve the naturalness and coherence of the conversation.
In order to enhance the cognitive ability of the model, we specially added a part of self-cognitive dataset. These datasets help the model better understand the context and improve the naturalness and coherence of the conversation.

**III. Practical steps**

1. **Initialize**

* Install the required software and libraries.
* Install the required software and libraries

```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
* Prepare input data and configuration parameters.

* Prepare input data and configuration parameters

See `config.yml` for annotations

2. **Model selection and configuration**

* Select the right model for your needs.
* Select the right model for your needs
In order to enable everyone to play with the large model, we chose the InterLLM2-7B as our baseline model (consumer graphics cards can also be deployed fine-tuned oh).
* Make necessary configuration and adjustments to the model.
Use XTuner for fine-tuning based on our data set and configuration strategy

* Make necessary configurations and adjustments to the model
Use XTuner for fine-tuning based on our dataset and configuration strategy.

3. **Data generation**

* Data generation using Tongyi Qianwen large model.
* Data generation using Tongyi Qianwen
```bash
# Terminal operation
bash run_qwen.bash
```
* Use Baidu Wenxin large model for data generation.
* Data generation using Wenxin Yiyan
```bash
# Terminal operation
python ernie_gen_data.py
```
* Data generation using the NXP AI large model.
* Data generation using Zhipu GLM
```bash
# Terminal operation
python zhipuai_gen_data.py
```
* Use IFlystar Fire model for data generation.
* Data generation using IFlystar Fire
```bash
# Terminal operation
python ./xinghuo/gen_data.py
```

4. **Integration of self-cognition datasets**

* Self-cognition data set this needs to be manually generated in accordance with the format, the following format can be.
* Self-cognition dataset this needs to be manually generated in accordance with the format, the following format can be
```json
[
{
Expand All @@ -85,16 +89,18 @@
]
```

5. **Data set integration.**
5. **dataset integration**

Before dataset integration, we need to check whether the generated data has formatting errors, type mismatches, etc. We need check.py to check the data. Finally, merge_json.py is used to combine all the json into one overall json file.

Before data set integration, we need to check whether the generated data has formatting errors, type mismatches, etc. We need check.py to check the data. Finally, merge_json.py is used to combine all the json into one overall json file.
6. **Evaluation and optimization**

* Evaluate the generated dataset using appropriate evaluation metrics.
* Make necessary optimizations and adjustments based on the evaluation results.
* Evaluate the generated dataset using appropriate evaluation metrics
* Make necessary optimizations and adjustments based on the evaluation results

7. **Testing and deployment**

* Evaluate the trained model using an independent test set.
* Make necessary adjustments and optimizations based on test results.
* Deploy the final model into a real application.
* Evaluate the trained model using an independent test set
* Make necessary adjustments and optimizations based on test results
* Deploy the final model into a real application
*