Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update RAG pipeline #106

Merged
merged 22 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ pdf/

*.jsonl
*.json
*.txt
# ./generate_data/*.josnl
# ./generate_data/*/*/*.josnl

Expand Down
40 changes: 21 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,25 +210,27 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git

### 作者(排名不分先后)

| 用户名 | 学校/组织 | 备注 | 贡献 |
| :----------: | :--------------------: | :-------------------: | :----------: |
| [aJupyter](https://github.com/aJupyter) | 南开大学在读硕士 | DataWhale成员 | 项目发起人 |
| [jujimeizuo](https://github.com/jujimeizuo) | 江南大学在读硕士 | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | 哈尔滨工业大学(威海)在读本科生 | | |
| [8baby8](https://github.com/8baby8) | 飞桨领航团区域主管 | 文心大模型核心开发者 | |
| [zxazys](https://github.com/zxazys) | 南开大学在读硕士 | | |
| [MING-ZCH](https://github.com/MING-ZCH) | 华中科技大学在读本科生 | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | swufe | | |
| [MrCatAI](https://github.com/MrCatAI) | AI搬用工 | | |
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护|数据清洗、文档翻译|
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士(研0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |
| [zealot52099](https://github.com/zealot52099) | AI搬用工 | |清洗数据、RAG|
| 用户名 | 学校/组织 | 备注 | 贡献 |
|:-------------------------------------------------------------:|:--------------------------------------------------:| :-------------------: | :----------: |
| [aJupyter](https://github.com/aJupyter) | 南开大学在读硕士 | DataWhale成员 | 项目发起人 |
| [jujimeizuo](https://github.com/jujimeizuo) | 江南大学在读硕士 | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | 哈尔滨工业大学(威海)在读本科生 | | |
| [8baby8](https://github.com/8baby8) | 飞桨领航团区域主管 | 文心大模型核心开发者 | |
| [zxazys](https://github.com/zxazys) | 南开大学在读硕士 | | |
| [MING-ZCH](https://github.com/MING-ZCH) | 华中科技大学在读本科生 | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | swufe | | |
| [MrCatAI](https://github.com/MrCatAI) | AI搬用工 | | |
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护|数据清洗、文档翻译|
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士(研0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |
| [zealot52099](https://github.com/zealot52099) | AI搬用工 | |清洗数据、RAG|
| [wwwyfff](https://github.com/wwwyfff) | 复旦大学在读硕士 | ||
| [jkhumor](https://github.com/jkhumor) | 南开大学在读硕士 | |RAG|

### 版权说明

Expand Down
40 changes: 21 additions & 19 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,25 +226,27 @@ This project uses Git for version control. You can see the currently available v

### Authors (in no particular order)

| Username | School/Organization | Remarks | Contributions |
| :-------: | :-------------------: | :------------------: | :--------: |
| [aJupyter](https://github.com/aJupyter) | Nankai University, Master's student | DataWhale member | Project initiator |
| [jujimeizuo](https://github.com/jujimeizuo) | Jiangnan University, Master's student | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | Harbin Institute of Technology (Weihai), Undergraduate student | | |
| [8baby8](https://github.com/8baby8) | PaddlePaddle Pilot Team Regional Director | Wenxin Large Model core developer | |
| [zxazys](https://github.com/zxazys) | Nankai University, Master's student | | |
| [MING-ZCH](https://github.com/MING-ZCH) | Huazhong University of Science and Technology, Undergraduate student | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | SWUFE (Southwestern University of Finance and Economics) | | |
| [MrCatAI](https://github.com/MrCatAI) | AI Mover | | |
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |
| [zealot52099](https://github.com/zealot52099) | AI Mover | |Data Processing and RAG|
| Username | School/Organization | Remarks | Contributions |
|:-------------------------------------------------------------:|:--------------------------------------------------------------------:| :------------------: | :--------: |
| [aJupyter](https://github.com/aJupyter) | Nankai University, Master's student | DataWhale member | Project initiator |
| [jujimeizuo](https://github.com/jujimeizuo) | Jiangnan University, Master's student | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | Harbin Institute of Technology (Weihai), Undergraduate student | | |
| [8baby8](https://github.com/8baby8) | PaddlePaddle Pilot Team Regional Director | Wenxin Large Model core developer | |
| [zxazys](https://github.com/zxazys) | Nankai University, Master's student | | |
| [MING-ZCH](https://github.com/MING-ZCH) | Huazhong University of Science and Technology, Undergraduate student | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | SWUFE (Southwestern University of Finance and Economics) | | |
| [MrCatAI](https://github.com/MrCatAI) | AI Mover | | |
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |
| [zealot52099](https://github.com/zealot52099) | AI Mover | |Data Processing and RAG|
| [wwwyfff](https://github.com/wwwyfff) | FuDan University, Master's student | ||
| [jkhumor](https://github.com/jkhumor) | Nankai University, Master's student | |RAG|

### Copyright Notice

Expand Down
68 changes: 68 additions & 0 deletions datasets/deduplicate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import json
from loguru import logger
import os
from datasketch import MinHash
from hashlib import md5

def is_json_file(filename):
return filename.endswith('.json')

# 绝对匹配
def is_duplicate_absolutely(d1, d2):
return md5(d1.encode('utf-8')).hexdigest() == md5(d2.encode('utf-8')).hexdigest()

# 使用MinHash生成器计算dict的签名
def hash_dict(dict_obj):
m = MinHash()
for key, value in sorted(dict_obj.items()):
# 对于非str类型值需要先转为str
m.update(str(value).encode('utf8'))
return m

# 使用绝对匹配和MinHash对dict列表去重
def deduplicate_json(data_list, threshold=0.8):
seen_hashes = []
duplicates_removed = []

for item in data_list:
# print(item)
# print('###########')
min_hash = hash_dict(item)
# print(f'min_hash: {min_hash}')

# 绝对匹配去重
if not any(is_duplicate_absolutely(str(item), str(existing)) for existing in duplicates_removed):
# MinHash相似性去重
has_similar = False
for stored_min_hash, stored_text in seen_hashes:
if stored_min_hash.jaccard(min_hash) > threshold:
has_similar = True
break
if not has_similar:
seen_hashes.append((min_hash,item))
duplicates_removed.append(item)


return duplicates_removed

if __name__ == '__main__':
data_ai = 'qwen'
root_dir = rf'./{data_ai}/'
dedup_output_dir = os.path.join(root_dir,'dedup')
if not os.path.exists(dedup_output_dir):
os.mkdir(dedup_output_dir)
if not os.path.exists(root_dir):
logger.error(f"folder {root_dir} not exist" )

else:
for file in os.listdir(root_dir):
file_path = os.path.join(root_dir, file)
if os.path.isfile(file_path):
print(f'file name: {file_path}')
if is_json_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
dedup_data = deduplicate_json(data)
with open(os.path.join(root_dir, 'dedup','dedup_' + file), 'w', encoding='utf-8') as output_file:
json.dump(dedup_data, output_file, ensure_ascii=False, indent=4)

40 changes: 40 additions & 0 deletions generate_data/merge_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import json
import os


def save_merge_json(data_lis, file_path):
import json

with open(file_path, 'wt', encoding='utf-8') as file:
json.dump(data_lis, file, ensure_ascii=False)


def get_all_file_paths(folder_path):
# 确保传入的是一个目录
if not os.path.isdir(folder_path):
raise ValueError(f"{folder_path} is not a valid directory")

# 获取文件夹下所有文件的路径
file_paths = [os.path.join(folder_path, file) for file in os.listdir(
folder_path) if os.path.isfile(os.path.join(folder_path, file))]
return file_paths


if __name__ == '__main__':
conversion_lis = []

for path in get_all_file_paths(r'data\res-aiwei'):
print(path)

with open(path, 'rt', encoding='utf-8') as file:
for line in file:
# 移除行尾的换行符
line = line.rstrip('\n')
# 解析JSON
try:
data = json.loads(line)
conversion_lis.append(data)
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
save_merge_json(data_lis=conversion_lis,
file_path=r'.\merge.json')
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# -*- coding: utf-8 -*-

import json
import os

Expand All @@ -21,7 +23,7 @@ def get_all_file_paths(folder_path, file_type='.jsonl'):
if __name__ == '__main__':
conversion_lis = []

folder_path = r'./'
folder_path = r'./' # python merge_jsonl.py > curr.txt

merge_path = folder_path.split('/')[-1]
try:
Expand All @@ -32,7 +34,7 @@ def get_all_file_paths(folder_path, file_type='.jsonl'):


for path in get_all_file_paths(folder_path):
print(path)
print(path.encode("utf-8"))

with open(path, 'rt', encoding='utf-8') as file:
for line in file:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# -*- coding: utf-8 -*-

import json
import os

Expand Down Expand Up @@ -36,11 +38,11 @@ def get_all_file_paths(folder_path, file_type='.jsonl'):
merge_last_path = folder_path.split('/')[-2] if folder_path.split('/')[-2]!='.' else ''
except:
merge_last_path = ''
print(f'merge_path={merge_path},merge_last_path={merge_last_path}')
print(f'merge_path={merge_path},merge_last_path={merge_last_path}'.encode("utf-8"))


for path in get_all_file_paths(folder_path):
print(path)
print(path.encode("utf-8"))

with open(path, 'rt', encoding='utf-8') as file:
for line in file:
Expand All @@ -67,9 +69,9 @@ def get_all_file_paths(folder_path, file_type='.jsonl'):
file_path=save_merge_json_path)

final_list = final_list+conversion_lis
print(len(conversion_lis),len(final_list),save_merge_json_path)
print(f'{len(conversion_lis)},{len(final_list)},{save_merge_json_path}'.encode("utf-8"))

save_merge_json(data_lis=final_list,file_path=save_final_merge_json_path)
print(save_final_merge_json_path)
print(len(conversion_lis),save_final_merge_json_path.encode("utf-8"))


Loading