Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RAG README #66

Merged
merged 4 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士(研0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |

### 版权说明

Expand Down
23 changes: 13 additions & 10 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,14 @@

<div align="center">

| 模型 | 类型 |
| Model | Type |
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| InternLM2_7B_chat | full fine-tuning |
| InternLM2_1_8B_chat | full fine-tuning |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Qwen1_5-0_5B-Chat | full fine-tuning |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
Expand Down Expand Up @@ -120,23 +120,25 @@ The Model aims to fully understand and promote the mental health of individuals,
## Contents

- [EmoLLM - Large Language Model for Mental Health](#emollm---large-language-model-for-mental-health)
- [Everyone is welcome to contribute to this project ~](#everyone-is-welcome-to-contribute-to-this-project-)
- [Recent Updates](#recent-updates)
- [Roadmap](#roadmap)
- [Contents](#contents)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- [RAG]()
- [RAG (Retrieval Augmented Generation) Pipeline](#rag-retrieval-augmented-generation-pipeline)
- [Frameworks Used](#frameworks-used)
- [How to participate in this project](#how-to-participate-in-this-project)
- [How to participate in this project](#how-to-participate-in-this-project)
- [Version control](#version-control)
- [Authors (in no particular order)](#authors-in-no-particular-order)
- [Copyright Notice](#copyright-notice)
- [Acknowledgments](#acknowledgments)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [Communication group](#communication-group)

###### Pre-development Configuration Requirements.

Expand Down Expand Up @@ -235,10 +237,11 @@ This project uses Git for version control. You can see the currently available v
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | Kongju University, Doctoral student (South Korea) | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |


### Copyright Notice
Expand Down
102 changes: 80 additions & 22 deletions scripts/qa_generation/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,95 @@
# QA Generation Pipeline
# RAG数据库构建流程

## 1. 使用方法
## **构建目的**

1. 检查 `requirements.txt` 中的依赖是否满足。
2. 调整代码中 `system_prompt`,确保与repo最新版本一致,保证生成QA的多样性和稳定性。
3. 将txt文件放到与 `model`同级目录 `data`文件夹中.
4. 在 `config/config.py` 配置所需的 API KEY,从 `main.py` 启动即可。生成的 QA 对会以 jsonl 的格式存在 `data/generated` 下。
利用心理学专业的书籍构建QA知识对,为RAG提供心理咨询知识库,使我们的EmoLLM的回答更加专业可靠。为了实现这个目标我们利用几十本心理学书籍来构建这个RAG知识库。主要的构建流程如下:

### 1.1 API KEY 获取方法
## **构建流程**

目前仅包含了 qwen。
## **步骤一:PDF to TXT**

#### 1.1.1 Qwen
- 目的
- 将收集到的PDF版本的心理学书籍转化为TXT文本文件,方便后续的信息提取。

前往[模型服务灵积-API-KEY管理 (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),点击”创建新的 API-KEY“,将获取的 API KEY 填至 `config/config.py` 中的 `DASHSCOPE_API_KEY` 即可。
- 所需工具

## 2. 注意事项
- [pdf2txt](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/pdf2txt.py)

### 2.1 系统提示 System Prompt
- [PaddleORC处理PDF用法参考](https://github.com/SmartFlowAI/EmoLLM/blob/main/generate_data/OCR.md)

- 安装必要的python库

```python
pip install paddlepaddle
pip install opencv-python
pip install paddleocr
```

注意,目前的解析方案是基于模型会生成 markdown 包裹的 json 块的前提的,更改 system prompt 时需要保证这一点不变。
- 注意
- 如果无法使用**pip install paddleocr**安装paddleocr,可以考虑采用whl文件安装,[下载地址](https://pypi.org/project/paddleocr/#files)
- 脚本启动方式采用命令行启动:python pdf2txt.py [PDF存放的文件名]

### 2.2 滑动窗口 Sliding Window
## **步骤二:筛选PDF**

滑动窗口的 `window_size` 和 `overlap_size` 都可以在 `util/data_loader.py` 中的 `get_txt_content` 函数中更改。目前是按照句子分割的滑动窗口。
- 筛选目的

### 2.3 书本文件格式 Corpus Format
- 利用LLM去除非专业心理学书籍

目前仅支持了 txt 格式,可以将清洗好的书籍文本放在 `data` 文件夹下,程序会递归检索该文件夹下的所有 txt 文件。
- 筛选标准,包含心理咨询相关内容,如:

## TODO
- 心理咨询流派 - 具体咨询方法
- 心理疾病 - 疾病特征
- 心理疾病 - 治疗方法

1. 支持更多模型(Gemini、GPT、ChatGLM……)
2. 支持多线程调用模型
3. 支持更多文本格式(PDF……)
4. 支持更多切分文本的方式
- 筛选方式:

- 根据标题初筛

- 若无法判断属于心理咨询相关书籍,利用kimi/GLM-4查询是否包含心理咨询相关知识(建议一次仅查询一本书)

- ```markdown
参考prompt:
你是一位经验丰富的心理学教授,熟悉心理学知识和心理咨询。我需要你协助我完成"识别书籍是否包含心理咨询知识"任务,请深呼吸并一步步思考,给出你的答案。如果你的答案让我满意,我将给你10w小费!
具体任务如下:
判断该书籍中是否包含以下心理咨询相关知识:
'''
心理咨询流派 - 具体咨询方法
心理疾病 - 疾病特征
心理疾病 - 治疗方法
'''
请深呼吸并一步步查看该书籍,认真完成任务。
```


## **步骤三:提取QA对**

- 根据书籍内容,利用LLM高效构造QA知识对
- 提取流程

- 准备处理好的txt文本数据
- 按要求配置[脚本文件](https://github.com/SmartFlowAI/EmoLLM/tree/main/scripts/qa_generation)
- 根据自己的需求或者提取的结果合理修改window_size和overlap_size

- 使用方法
- 检查 `requirements.txt` 中的依赖是否满足。
- 调整代码中 `system_prompt`,确保与repo最新版本一致,保证生成QA的多样性和稳定性。
- 将txt文件放到与 `model`同级目录 `data`文件夹中.
- 在 `config/config.py` 配置所需的 API KEY,从 `main.py` 启动即可。生成的 QA 对会以 jsonl 的格式存在 `data/generated` 下。

- API KEY 获取方法
- 目前仅包含了 qwen。
- Qwen
- 前往[模型服务灵积-API-KEY管理 (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),点击”创建新的 API-KEY“,将获取的 API KEY 填至 `config/config.py` 中的 `DASHSCOPE_API_KEY` 即可。

- 注意事项
- 系统提示 System Prompt
- 注意,目前的解析方案是基于模型会生成 markdown 包裹的 json 块的前提的,更改 system prompt 时需要保证这一点不变。
- 滑动窗口 Sliding Window
- 滑动窗口的 `window_size` 和 `overlap_size` 都可以在 `util/data_loader.py` 中的 `get_txt_content` 函数中更改。目前是按照句子分割的滑动窗口。

- 书本文件格式 Corpus Format
- 目前仅支持了 txt 格式,可以将清洗好的书籍文本放在 `data` 文件夹下,程序会递归检索该文件夹下的所有 txt 文件。

## **步骤四:清洗QA对**

- 清洗目的
102 changes: 80 additions & 22 deletions scripts/qa_generation/README_EN.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,95 @@
# QA Generation Pipeline
# RAG Database Building Process

## 1. Use method
## **Constructive purpose**

1. Check whether the dependencies in `requirements.txt` are satisfied.
2. Adjust the `system_prompt`in the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA.
3. Put the txt file into the `data` folder in the same directory as `model`.
4. Configure the required API KEY in `config/config.py` and start from `main.py`. The generated QA pairs are stored in the jsonl format under `data/generated`.
Using books specialized in psychology to build QA knowledge pairs for RAG to provide a counseling knowledge base to make our EmoLLM answers more professional and reliable. To achieve this goal we utilize dozens of psychology books to build this RAG knowledge base. The main building process is as follows:

### 1.1 API KEY obtaining method
## **Build process**

Currently only qwen is included.
## **Step 1: PDF to TXT**

#### 1.1.1 Qwen
- purpose
- Convert the collected PDF versions of psychology books into TXT text files to facilitate subsequent information extraction

To[model service spirit product - API - KEY management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),click on "create a new API - KEY", Fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.
- Tools required

## 2. Precautions
- [pdf2txt](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/pdf2txt.py)

### 2.1 The System Prompt is displayed
- [PaddleORC Processing PDF Usage Reference](https://github.com/SmartFlowAI/EmoLLM/blob/main/generate_data/OCR.md)

- Install necessary python libraries

```python
pip install paddlepaddle
pip install opencv-python
pip install paddleocr
```

Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.
- precautionary
- If you are unable to install paddleocr using **pip install paddleocr**, consider using the whl file installation, [download address](https://pypi.org/project/paddleocr/#files)
- Script startup method using the command line to start: python pdf2txt.py [PDF file name stored in the]

### 2.2 Sliding Window
## **Step 2: Screening PDF**

Both `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py.` Currently it is a sliding window divided by sentence.
- Purpose of screening

### 2.3 Corpus Format
- Using the LLM to go to non-professional psychology books

At present, only txt format is supported, and the cleaned book text can be placed under the `data` folder, and the program will recursively retrieve all txt files under the folder.
- Screening criteria that include counseling related content such as:

## TODO
- Schools of Counseling - Specific Counseling Methods
- Mental Illness - Characteristics of the Disease
- Mental Illness - Treatment

1. Support more models (Gemini, GPT, ChatGLM...)
2. Support multi-threaded call model
3. Support more text formats (PDF...)
4. Support more ways to split text
- Screening method:

- Initial screening based on title

- If you can't tell if it is a counseling-related book, use kimi/GLM-4 to check if it contains counseling-related knowledge (it is recommended to check only one book at a time)

- ```markdown
Reference prompt.
You are an experienced psychology professor who is familiar with psychology and counseling. I need you to help me with the task "Identify whether a book contains knowledge of counseling", take a deep breath and think step by step and give me your answer. If your answer satisfies me, I will give you a 10w tip!
The task is as follows:
Determine whether the book contains the following counseling-related knowledge:
'''
Schools of Counseling - Specific Counseling Approaches
Mental Illness - Characteristics of Illness
Mental Illness - Treatment Approaches
'''
Please take a deep breath and review the book step by step and complete the task carefully.
```


## **Step 3: Extraction of QA pairs**

- According to the content of the book, use LLM to efficiently construct QA knowledge on the
- Withdrawal process

- Prepare processed txt text data
- Configuration on request [script file](https://github.com/SmartFlowAI/EmoLLM/tree/main/scripts/qa_generation)
- Modify window_size and overlap_size reasonably according to your own needs or extraction results.

- Usage
- Checks if the dependencies in `requirements.txt` are satisfied.
- Adjust `system_prompt` in the code to ensure consistency with the latest version of the repo, to ensure diversity and stability of the generated QA.
- Place the txt file in the `data` folder in the same directory as the `model`.
- Configure the required API KEYs in `config/config.py` and start from `main.py`. The generated QA pairs are stored in jsonl format under `data/generated`.

- API KEY Getting Methods
- Currently only qwen is included.
- Qwen
- Go to [Model Service LingJi - API-KEY Management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey), click "Create New API-KEY", and fill in the obtained API KEY into the Click "Create new API-KEY", fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.

- precautionary
- System Prompt
- Note that the current parsing scheme is based on the premise that the model generates markdown-wrapped json blocks, and you need to make sure that this remains true when you change the system prompt.
- Sliding Window
- The `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py`. Currently the sliding window is split by sentence.

- Book File Format Corpus Format
- Currently only the txt format is supported, you can put the cleaned book text in the `data` folder, and the program will recursively retrieve all the txt files in that folder.

## **Step 4: Cleaning of QA pairs**

- Purpose of cleaning