Awesome training/finetuning datasets for LLMs
-
alpaca_data.json
: 52K instruction data generated bytext-davinci-003
(from tatsu-lab/stanford_alpaca) -
Anthropic HH RLHF: Human preference data about helpfulness and harmlessness / Human-generated red teaming data
-
Stanford Human Preferences Dataset (SHP): 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice
-
BELLE/1.0M, 0.5M: 由ChatGPT生成的1.5M中文数据集,包含数个不同指令类型、不同领域的子集 (from LianjiaTech/BELLE)
-
BELLE/
School Math
,Multiturn Chat
: 由BELLE项目开源的10M数据集,目前已开放数学习题和多轮对话 (from LianjiaTech/BELLE) -
Alpaca-CoT datasets
Finetuning bases on Stanford Alpaca project. The datasets include CoT, dialog, FastChat, instinwild (from PhoebusSi/Alpaca-CoT) -
ShareGPT Vicuna unfiltered: 48k ShareGPT conversations used by Vicuna (from lm-sys/FastChat, issue #90)
-
instinwild_en.json
,instinwild_ch.json
: A larger set of instructions generated by 479 seed instructions (from XueFuzhao/InstructionWild) -
lvwerra/stack-exchange-paired: a processed version of the HuggingFaceH4/stack-exchange-preferences (from trl-lib/llama-7b-se-rl-peft)
-
HealthCareMagic-200k: 200k real conversations between patients and doctors from HealthCareMagic.com (from Kent0n-Li/ChatDoctor)
-
icliniq-15k: 15k real conversations between patients and doctors from icliniq.com (from Kent0n-Li/ChatDoctor)
-
ADGEN dataset
/清华备份
: 根据输入(content)生成一段广告词(summary) (from THUDM/ChatGLM-6B)