Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some use cases, large language models have made synthetic datasets more popular for pre- and post-training, and the evaluation of language models.
We'll use distilabel
, a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. For a deeper dive into the package and best practices, check out the documentation.
Synthetic data for language models can be categorized into three taxonomies: instructions, preferences and critiques. We will focus on the first two categories, which focus on the generation of datasets for instruction tuning and preference alignment. In both categories, we will cover aspects of the third category, which focuses on improving existing data with model critiques and rewrites.
Learn how to generate instruction datasets for instruction tuning. We will explore creating instruction tuning datasets thorugh basic prompting and using prompts more refined techniques from papers. Instruction tuning datasets with seed data for in-context learning can be created through methods like SelfInstruct and Magpie. Additionally, we will explore instruction evolution through EvolInstruct. Start learning.
Learn how to generate preference datasets for preference alignment. We will build on top of the methods and techniques introduced in section 1, by generating additional responses. Next, we will learn how to improve such responses with the EvolQuality prompt. Finally, we will explore how to evaluate responses with the the UltraFeedback prompt which will produce a score and critique, allowing us to create preference pairs. Start learning.
Title | Description | Exercise | Link | Colab |
---|---|---|---|---|
Instruction Dataset | Generate a dataset for instruction tuning | 🐢 Generate an instruction tuning dataset 🐕 Generate a dataset for instruction tuning with seed data 🦁 Generate a dataset for instruction tuning with seed data and with instruction evolution |
Link | Colab |
Preference Dataset | Generate a dataset for preference alignment | 🐢 Generate a preference alignment dataset 🐕 Generate a preference alignment dataset with response evolution 🦁 Generate a preference alignment dataset with response evolution and critiques |
Link | Colab |