Awesome datasets for LLMs

Awesome training/finetuning datasets for LLMs

fine-tuning

alpaca_data.json: 52K instruction data generated by text-davinci-003 (from tatsu-lab/stanford_alpaca)
Anthropic HH RLHF: Human preference data about helpfulness and harmlessness / Human-generated red teaming data
Stanford Human Preferences Dataset (SHP): 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice
BELLE/1.0M, 0.5M: 由ChatGPT生成的1.5M中文数据集，包含数个不同指令类型、不同领域的子集 (from LianjiaTech/BELLE)
BELLE/School Math, Multiturn Chat: 由BELLE项目开源的10M数据集，目前已开放数学习题和多轮对话 (from LianjiaTech/BELLE)
Alpaca-CoT datasets Finetuning bases on Stanford Alpaca project. The datasets include CoT, dialog, FastChat, instinwild (from PhoebusSi/Alpaca-CoT)
ShareGPT Vicuna unfiltered: 48k ShareGPT conversations used by Vicuna (from lm-sys/FastChat, issue #90)
instinwild_en.json, instinwild_ch.json: A larger set of instructions generated by 479 seed instructions (from XueFuzhao/InstructionWild)
lvwerra/stack-exchange-paired: a processed version of the HuggingFaceH4/stack-exchange-preferences (from trl-lib/llama-7b-se-rl-peft)

HealthCareMagic-200k: 200k real conversations between patients and doctors from HealthCareMagic.com (from Kent0n-Li/ChatDoctor)
icliniq-15k: 15k real conversations between patients and doctors from icliniq.com (from Kent0n-Li/ChatDoctor)
ADGEN dataset/清华备份: 根据输入（content）生成一段广告词（summary） (from THUDM/ChatGLM-6B)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md