Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
ab8c385
feat: implement abstraction on pipeline form datasets
burtenshaw Nov 25, 2024
81697ca
docs: update class doc string and examples
burtenshaw Dec 2, 2024
ff18c78
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 2, 2024
3266e70
feat: respond to small changes
burtenshaw Dec 16, 2024
d8f3310
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 16, 2024
45e10f1
add kwargs to docstring
burtenshaw Dec 16, 2024
6e69361
Merge branch 'feat/dataset-instruction-response-pipeline' of https://…
burtenshaw Dec 16, 2024
68524f5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 16, 2024
a2b7356
remove notebook
burtenshaw Dec 16, 2024
f76bc38
Merge branch 'feat/dataset-instruction-response-pipeline' of https://…
burtenshaw Dec 16, 2024
a0c23f6
Merge branch 'develop' into feat/dataset-instruction-response-pipeline
davidberenstein1957 Jan 20, 2025
f1fb538
remove ununsed import
burtenshaw Jan 28, 2025
d8e71ea
Merge branch 'feat/dataset-instruction-response-pipeline' of https://…
burtenshaw Jan 28, 2025
636e2c2
add documentation
burtenshaw Jan 28, 2025
fb55221
Update Hugging Face Inference Endpoints tests to mock both sync and a…
davidberenstein1957 Jan 29, 2025
da1720f
Merge branch 'develop' into feat/dataset-instruction-response-pipeline
davidberenstein1957 Jan 29, 2025
2134f90
Add BasePipelineTemplate and update pipeline documentation
davidberenstein1957 Jan 29, 2025
f0a1c46
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2025
a2e092d
fix: change liscence spacing
burtenshaw Jan 29, 2025
786be3e
docs: Add HuggingFaceHubCheckpointer to documentation
davidberenstein1957 Jan 30, 2025
0303d0f
docs: Update checkpointing documentation numbering and clarify input …
davidberenstein1957 Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/api/step_gallery/hugging_face.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ This section contains the existing steps integrated with `Hugging Face` so as to
::: distilabel.steps.LoadDataFromDisk
::: distilabel.steps.LoadDataFromFileSystem
::: distilabel.steps.LoadDataFromHub
::: distilabel.steps.PushToHub
::: distilabel.steps.PushToHub
::: distilabel.steps.HuggingFaceHubCheckpointer
31 changes: 30 additions & 1 deletion docs/sections/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,11 @@ To install the latest release with `hf-inference-endpoints` extra of the package
pip install distilabel[hf-inference-endpoints] --upgrade
```

## Use a generic pipeline
## Use a generic pipeline template

Distilabel comes with some built in templates for taks like Supervised Fine-Tuning. You can use these templates to generate data for your tasks. The templates are built using the `InstructionResponsePipeline` class, which uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.

### Generate Instructions and Responses

To use a generic pipeline for an ML task, you can use the `InstructionResponsePipeline` class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.

Expand All @@ -41,6 +45,31 @@ dataset = pipeline.run()

The `InstructionResponsePipeline` class will use the `InferenceEndpointsLLM` class with the model `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate data based on the system prompt. The output data will be a dataset with the columns `instruction` and `response`. The class uses a generic system prompt, but you can customize it by passing the `system_prompt` parameter to the class.

### Generate based on seed data

You can also use distilabel to generate data based on seed data. This is useful when you have an unstructured dataset that represents your domain and you want instruction response pairs for fine-tuning a model. You can use the `DatasetInstructionResponsePipeline` class with the `dataset` parameter to generate data based on the seed data.

```python
from datasets import Dataset
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5) # define the number of instructions to generate per sample

distiset = pipeline.run(
use_cache=False,
dataset=Dataset.from_list(
mapping=[
{
"input": "<document>",
}
]
),
)

```



!!! note
We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.

Expand Down
8 changes: 4 additions & 4 deletions docs/sections/how_to_guides/advanced/checkpointing.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Push data to the hub while the pipeline is running

Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [HuggingFaceHubCheckpointer][distilabel.steps.checkpointer.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.
Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.

The [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).
The [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).

Just add the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) as any other step in your pipeline.
Just add the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] as any other step in your pipeline.

## Sample pipeline with dummy data to see the checkpoint strategy in action

The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) step to push the data to the hub.
The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step to push the data to the hub.

```python
from datasets import Dataset
Expand Down
2 changes: 2 additions & 0 deletions src/distilabel/pipeline/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,12 @@
sample_n_steps,
)
from distilabel.pipeline.templates import (
DatasetInstructionResponsePipeline,
InstructionResponsePipeline,
)

__all__ = [
"DatasetInstructionResponsePipeline",
"InstructionResponsePipeline",
"Pipeline",
"RayPipeline",
Expand Down
1 change: 1 addition & 0 deletions src/distilabel/pipeline/templates/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from .dataset_instruction import DatasetInstructionResponsePipeline # noqa: F401
from .instruction import InstructionResponsePipeline # noqa: F401
17 changes: 17 additions & 0 deletions src/distilabel/pipeline/templates/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


class BasePipelineTemplate: # defined for recursive subclass finder mkdocs
pass
167 changes: 167 additions & 0 deletions src/distilabel/pipeline/templates/dataset_instruction.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Optional

from distilabel.distiset import Distiset
from distilabel.llms import LLM, InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.pipeline.templates.base import BasePipelineTemplate
from distilabel.steps import ExpandColumns, KeepColumns
from distilabel.steps.tasks import SelfInstruct, TextGeneration

MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"


class DatasetInstructionResponsePipeline(BasePipelineTemplate):
"""Generates instructions and responses for a dataset with input documents.

This example pipeline can be used for a Supervised Fine-Tuning dataset which you
could use to train or evaluate a model. The pipeline generates instructions using the
SelfInstruct step and TextGeneration step.

Attributes:
llm: The LLM to use for generating instructions and responses. Defaults to
InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
system_prompt: The system prompt to use for generating instructions and responses.
Defaults to "You are a creative AI Assistant writer."
hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
num_instructions: The number of instructions to generate. Defaults to 2.
batch_size: The batch size to use for generation. Defaults to 1.

Input columns:
- input (`str`): The input document to generate instructions and responses for.

Output columns:
- conversation (`ChatType`): the generated conversation which is a list of chat
items with a role and a message.
- instruction (`str`): the generated instructions if `only_instruction=True`.
- response (`str`): the generated response if `n_turns==1`.
- system_prompt_key (`str`, optional): the key of the system prompt used to generate
the conversation or instruction. Only if `system_prompt` is a dictionary.
- model_name (`str`): The model name used to generate the `conversation` or `instruction`.

References:
- [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)

Examples:

Generate instructions and responses for a given system prompt:

```python
from datasets import Dataset
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5)

distiset = pipeline.run(
use_cache=False,
dataset=Dataset.from_list(
mapping=[
{
"input": "<document>",
}
]
),
)
```
"""

def __init__(
self,
llm: Optional[LLM] = None,
system_prompt: str = "You are a creative AI Assistant writer.",
hf_token: Optional[str] = None,
num_instructions: int = 2,
batch_size: int = 1,
) -> None:
burtenshaw marked this conversation as resolved.
Show resolved Hide resolved
"""Initializes the pipeline.

Args:
llm (Optional[LLM], optional): The language model to use. Defaults to None.
system_prompt (str, optional): The system prompt to use. Defaults to "You are a creative AI Assistant writer.".
hf_token (Optional[str], optional): The Hugging Face API token to use. Defaults to None.
num_instructions (int, optional): The number of instructions to generate. Defaults to 2.
batch_size (int, optional): The batch size to use. Defaults to 1.
"""
if llm is None:
self.llm: LLM = InferenceEndpointsLLM(
model_id=MODEL,
tokenizer_id=MODEL,
generation_kwargs={
"temperature": 0.9,
"do_sample": True,
"max_new_tokens": 2048,
},
api_key=hf_token,
)
else:
self.llm = llm

self.pipeline: Pipeline = self._get_pipeline(
system_prompt=system_prompt,
num_instructions=num_instructions,
batch_size=batch_size,
)

def run(self, dataset, **kwargs) -> Distiset:
burtenshaw marked this conversation as resolved.
Show resolved Hide resolved
burtenshaw marked this conversation as resolved.
Show resolved Hide resolved
"""Runs the pipeline and returns a Distiset.

Args:
dataset: The dataset to run the pipeline on.
**kwargs: Additional arguments to pass to the pipeline.
"""
return self.pipeline.run(dataset, **kwargs)

def _get_pipeline(
self, system_prompt: str, num_instructions: int, batch_size: int
) -> Pipeline:
"""Returns a pipeline that generates instructions and responses for a given system prompt."""
with Pipeline(name="dataset_chat") as pipeline:
self_instruct = SelfInstruct(
llm=self.llm,
num_instructions=num_instructions,
)

expand_columns = ExpandColumns(
columns=["instructions"],
output_mappings={"instructions": "instruction"},
)

keep_instruction = KeepColumns(
columns=["instruction", "input"],
)

response_generation = TextGeneration(
name="exam_generation",
system_prompt=system_prompt,
template="Respond to the instruction based on the document. Document:\n{{ input }} \nInstruction: {{ instruction }}",
llm=self.llm,
input_batch_size=batch_size,
output_mappings={"generation": "response"},
)

keep_response = KeepColumns(
columns=["input", "instruction", "response"],
)

(
self_instruct
>> expand_columns
>> keep_instruction
>> response_generation
>> keep_response
)

return pipeline
26 changes: 23 additions & 3 deletions src/distilabel/pipeline/templates/instruction.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,23 +17,43 @@
from distilabel.distiset import Distiset
from distilabel.llms import LLM, InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.pipeline.templates.base import BasePipelineTemplate
from distilabel.steps.tasks import MagpieGenerator

MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"


class InstructionResponsePipeline:
class InstructionResponsePipeline(BasePipelineTemplate):
"""Generates instructions and responses for a given system prompt.

This example pipeline can be used for a Supervised Fine-Tuning dataset which you
could use to train or evaluate a model. The pipeline generates instructions using the
MagpieGenerator and responses for a given system prompt. The pipeline then keeps only
the instruction, response, and model_name columns.

Attributes:
llm: The LLM to use for generating instructions and responses. Defaults to
InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
system_prompt: The system prompt to use for generating instructions and responses.
Defaults to "You are a creative AI Assistant writer."
hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
n_turns: The number of turns to generate for each conversation. Defaults to 1.
num_rows: The number of rows to generate. Defaults to 10.
batch_size: The batch size to use for generation. Defaults to 1.

Output columns:
- conversation (`ChatType`): the generated conversation which is a list of chat
items with a role and a message.
- instruction (`str`): the generated instructions if `only_instruction=True`.
- response (`str`): the generated response if `n_turns==1`.
- system_prompt_key (`str`, optional): the key of the system prompt used to generate
the conversation or instruction. Only if `system_prompt` is a dictionary.
- model_name (`str`): The model name used to generate the `conversation` or `instruction`.

References:
- [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)
- [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)

Example:
Examples:

Generate instructions and responses for a given system prompt:

Expand Down
3 changes: 1 addition & 2 deletions src/distilabel/steps/checkpointer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,14 @@
import tempfile
from typing import TYPE_CHECKING, Optional

from huggingface_hub import HfApi
from pydantic import PrivateAttr

from distilabel.steps.base import Step, StepInput

if TYPE_CHECKING:
from distilabel.typing import StepOutput

from huggingface_hub import HfApi


class HuggingFaceHubCheckpointer(Step):
"""Special type of step that uploads the data to a Hugging Face Hub dataset.
Expand Down
21 changes: 21 additions & 0 deletions src/distilabel/utils/export_components_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from distilabel.models.embeddings.base import Embeddings
from distilabel.models.image_generation.base import ImageGenerationModel
from distilabel.models.llms.base import LLM
from distilabel.pipeline.templates.base import BasePipelineTemplate
from distilabel.steps.base import _Step
from distilabel.steps.tasks.base import _Task
from distilabel.steps.tasks.generate_embeddings import GenerateEmbeddings
Expand Down Expand Up @@ -68,6 +69,13 @@ def export_components_info() -> ComponentsInfo:
}
for embeddings_type in _get_embeddings()
],
"pipelines": [
{
"name": pipeline_type.__name__,
"docstring": parse_google_docstring(pipeline_type),
}
for pipeline_type in _get_pipelines()
],
}


Expand Down Expand Up @@ -148,6 +156,19 @@ def _get_embeddings() -> List[Type["Embeddings"]]:
]


def _get_pipelines() -> List[Type["BasePipelineTemplate"]]:
"""Get all `Pipeline` subclasses, that are not abstract classes.

Returns:
A list of `Pipeline` subclasses
"""
return [
pipeline_type
for pipeline_type in _recursive_subclasses(BasePipelineTemplate)
if not inspect.isabstract(pipeline_type)
]


# Reference: https://adamj.eu/tech/2024/05/10/python-all-subclasses/
def _recursive_subclasses(klass: Type[T]) -> Generator[Type[T], None, None]:
"""Recursively get all subclasses of a class.
Expand Down
Loading