Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tasks to replicate APIGen #925

Merged
merged 74 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
edc1e06
Add apigen task module
plaguss Aug 21, 2024
4b372fb
Add tests for apigen
plaguss Aug 22, 2024
3a0da42
Fix default name for dataset info when requesting the number of examples
plaguss Aug 22, 2024
01b43ab
checkpoint
plaguss Aug 22, 2024
5058f65
Add tests for apigen generator
plaguss Aug 23, 2024
8ee19e9
Create jinja template, split methods and add docstrings
plaguss Aug 23, 2024
d95375b
Update string format
plaguss Aug 23, 2024
02d6803
Simplify function setting and move it to load method
plaguss Aug 23, 2024
3371a37
Add tests for semantic checker
plaguss Aug 23, 2024
19b9576
Add prompt template for semantic checker
plaguss Aug 23, 2024
9f90191
Redirect import for semantic checker
plaguss Aug 23, 2024
ef7f263
Fix docstrins for output columns
plaguss Aug 23, 2024
76a6da0
Add semantic checker task from apigen
plaguss Aug 23, 2024
050a744
Add notes for execution checker
plaguss Aug 24, 2024
e4be16d
Merge with develop and fix conflicts
plaguss Sep 11, 2024
b8c356c
Remove extra jump of line
plaguss Sep 12, 2024
94ef973
Add first version of data sampler, step helper for apigen
plaguss Sep 16, 2024
5cffc3f
Add tests for data sampler
plaguss Sep 16, 2024
952a640
Add integration test to check the sampler can be mixed with another g…
plaguss Sep 16, 2024
f5994d8
Draft tests for new execution checker
plaguss Sep 17, 2024
5c8974a
Move helper functions
plaguss Sep 17, 2024
dab8a8b
Draft for execution checker functionality
plaguss Sep 17, 2024
d17cbde
Add first version of execution checker and tests
plaguss Sep 18, 2024
a2ae5f2
Add tests for utils module of apigen
plaguss Sep 18, 2024
18be0b8
Remove unnecessary step for transformation and rename files for clarity
plaguss Sep 18, 2024
71c0729
Fix import
plaguss Sep 18, 2024
cc25c8f
Change function results name to show the original results from the ex…
plaguss Sep 19, 2024
c5fae66
Remove print when the url for a reference doesn't contain https://arxiv
plaguss Sep 19, 2024
127e377
first working version
plaguss Sep 19, 2024
a2279cd
Merge branch 'develop' of https://github.com/argilla-io/distilabel in…
plaguss Sep 20, 2024
bdbd8b3
Fix tests including previous columns
plaguss Sep 23, 2024
7648535
Go back to previous name for dummy llm
plaguss Sep 23, 2024
421125c
Change dummy llm names on tests
plaguss Sep 23, 2024
051de38
Read the answers from the model parsed instead of dumped string
plaguss Sep 23, 2024
614a817
Add option to include the tools if available for few shot
plaguss Sep 23, 2024
fa3cbf4
Allow extra checks for the parameter types and tests for those
plaguss Sep 23, 2024
0cd8bc6
Add docs for the execution checker
plaguss Sep 24, 2024
f27186d
Add new icon for execution
plaguss Sep 24, 2024
70260f4
Fix return type for outputs column
plaguss Sep 24, 2024
c49092a
Fix docstrings
plaguss Sep 24, 2024
93fb319
Redirect imports to top level
plaguss Sep 24, 2024
3e64236
Update docstrings to render on components gallery
plaguss Sep 24, 2024
1560e0c
Improve docstrings for fields in the data sampler
plaguss Sep 24, 2024
0fdde41
Remove unnecesary data from docstrings and remove TODO
plaguss Sep 24, 2024
8d76bbe
Add missing data variable in example
plaguss Sep 24, 2024
cf74fae
Update src/distilabel/steps/tasks/apigen/execution_checker.py
plaguss Sep 24, 2024
ae3e4e2
Refactor to return formatted json string instead of dict to simplify …
plaguss Sep 25, 2024
0ea95a7
Draft tutorial to replicate paper
plaguss Sep 25, 2024
d7c6a64
Allow number to be a dict with values and probabilities
plaguss Sep 26, 2024
21e0757
Update pipeline run call
plaguss Sep 26, 2024
82aa352
Add functionality to load functions from a folder with .py files
plaguss Sep 26, 2024
e70a258
Fix comment for arg
plaguss Sep 26, 2024
cbc288c
Add example implementation
plaguss Sep 26, 2024
71a3517
Add dependency for vllm
plaguss Sep 26, 2024
8dceb11
Fix dependency name
plaguss Sep 26, 2024
f363292
Add setuptools-scm in the script with the dependencies to install it …
plaguss Sep 26, 2024
a43b8e9
Another attempt with system
plaguss Sep 26, 2024
7325cef
Add tests to take into account casting methods
plaguss Sep 27, 2024
2f7418a
Avoid casting and update prompt to ensure argument order is respected
plaguss Sep 27, 2024
4ac735c
Inform error type on generator
plaguss Sep 27, 2024
60c1cd9
Add extra checks and safeguards for failed answer generation
plaguss Sep 27, 2024
bf1baed
Ensure the error is of the expected type
plaguss Sep 27, 2024
740d3fe
Fix unstructured generation
plaguss Sep 27, 2024
841d985
Remove json fences and fix semantic checker
plaguss Sep 27, 2024
dcded6a
Control case of functions without arguments
plaguss Sep 27, 2024
2a76812
Add additional checks to run the execution checker
plaguss Oct 1, 2024
58b92be
Remove additional dependency
plaguss Oct 2, 2024
f2eb160
Merge branch 'develop' of https://github.com/argilla-io/distilabel in…
plaguss Oct 2, 2024
8a9743f
Try fixing CI error with dependencies
plaguss Oct 3, 2024
c26cca4
Install dependency for the system
plaguss Oct 3, 2024
9c756b2
Undo fix attempt
plaguss Oct 3, 2024
55ccc1e
Try fixing llvmlite dependency issue
plaguss Oct 3, 2024
c5ccf5a
Remove additional dependency as it breaks other tests
plaguss Oct 3, 2024
776b36c
Merge with develop and fix conflict
plaguss Oct 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions src/distilabel/distiset.py
Original file line number Diff line number Diff line change
Expand Up @@ -695,8 +695,9 @@ def _grab_citations(dag: "DAG") -> List[str]:
for ref in references.values():
try:
bibtex_refs.append(get_bibtex(ref))
except ValueError as e:
print(f"Error: {e}")
except ValueError:
# No need to inform in this case, it's noise
pass
except AttributeError as e:
print(
f"Couldn't obtain the bibtex format for the ref: '{ref}', error: {e}"
Expand Down
70 changes: 70 additions & 0 deletions src/distilabel/llms/_dummy.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not at all, I added this one because I thought it could simplify testing examples with a dummy LLM without having to access to the tests. WDYT? No problem removing it

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import TYPE_CHECKING, Any, List

from distilabel.llms.base import LLM, AsyncLLM
from distilabel.llms.mixins.magpie import MagpieChatTemplateMixin

if TYPE_CHECKING:
from distilabel.llms.typing import GenerateOutput
from distilabel.steps.tasks.typing import FormattedInput


class DummyAsyncLLM(AsyncLLM):
structured_output: Any = None

def load(self) -> None:
pass

@property
def model_name(self) -> str:
return "test"

async def agenerate( # type: ignore
self, input: "FormattedInput", num_generations: int = 1
) -> "GenerateOutput":
return ["output" for _ in range(num_generations)]


class DummySyncLLM(LLM):
structured_output: Any = None

def load(self) -> None:
super().load()

@property
def model_name(self) -> str:
return "test"

def generate( # type: ignore
self, inputs: "FormattedInput", num_generations: int = 1
) -> "GenerateOutput":
return [["output" for _ in range(num_generations)] for _ in range(len(inputs))]


class DummyMagpieLLM(LLM, MagpieChatTemplateMixin):
def load(self) -> None:
pass

@property
def model_name(self) -> str:
return "test"

def generate(
self, inputs: List["FormattedInput"], num_generations: int = 1, **kwargs: Any
) -> List["GenerateOutput"]:
return [
["Hello Magpie" for _ in range(num_generations)] for _ in range(len(inputs))
]
2 changes: 2 additions & 0 deletions src/distilabel/steps/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
FormatTextGenerationSFT,
)
from distilabel.steps.generators.data import LoadDataFromDicts
from distilabel.steps.generators.data_sampler import DataSampler
from distilabel.steps.generators.huggingface import (
LoadDataFromDisk,
LoadDataFromFileSystem,
Expand Down Expand Up @@ -83,6 +84,7 @@
"FormatChatGenerationSFT",
"FormatTextGenerationSFT",
"LoadDataFromDicts",
"DataSampler",
"LoadDataFromDisk",
"LoadDataFromFileSystem",
"LoadDataFromHub",
Expand Down
144 changes: 144 additions & 0 deletions src/distilabel/steps/generators/data_sampler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import random
from itertools import islice
from typing import TYPE_CHECKING, Any, Dict, List

from pydantic import Field
from typing_extensions import override

from distilabel.steps.base import GeneratorStep

if TYPE_CHECKING:
from distilabel.steps.base import GeneratorStepOutput


class DataSampler(GeneratorStep):
"""Step to sample from a dataset.

`GeneratorStep` that samples from a dataset and yields it in batches.

Output columns:
- dynamic (based on the keys found on the first dictionary of the list): The columns
of the dataset.

Categories:
- load

Attributes:
data: The list of dictionaries to sample from.
size: The number of samples per example.
samples: The number of examples.
plaguss marked this conversation as resolved.
Show resolved Hide resolved

Examples:
Sample data from a list of dictionaries:

```python
from distilabel.steps import DataSampler

sampler = DataSampler(
data=[{"sample": f"sample {i}"} for i in range(30)],
samples=10,
size=2,
batch_size=4
)
sampler.load()

result = next(sampler.process())
# >>> result
# ([{'sample': ['sample 7', 'sample 0']}, {'sample': ['sample 2', 'sample 21']}, {'sample': ['sample 17', 'sample 12']}, {'sample': ['sample 2', 'sample 14']}], False)
```

Pipeline with a loader and a sampler combined in a single stream:

```python
from datasets import load_dataset

from distilabel.steps import LoadDataFromDicts, DataSampler
from distilabel.steps.tasks.apigen.utils import PrepareExamples
from distilabel.pipeline import Pipeline

ds = (
load_dataset("Salesforce/xlam-function-calling-60k", split="train")
.shuffle(seed=42)
.select(range(500))
.to_list()
)

with Pipeline(name="APIGenPipeline") as pipeline:
loader_seeds = LoadDataFromDicts(data=data)
plaguss marked this conversation as resolved.
Show resolved Hide resolved
sampler = DataSampler(
data=ds,
size=2,
samples=len(data),
batch_size=8,
)
prep_examples = PrepareExamples()

sampler >> prep_examples
(
[loader_seeds, prep_examples]
>> combine_steps
)
# Now we have a single stream of data with the loader and the sampler data
```
"""

data: List[Dict[str, Any]] = Field(default_factory=list, exclude=True)
size: int = 2 # Number of samples per example
samples: int = 100 # Number of examples

@override
def process(self, offset: int = 0) -> "GeneratorStepOutput": # type: ignore
"""Yields batches from a list of dictionaries.

Args:
offset: The offset to start the generation from. Defaults to `0`.

Yields:
A list of Python dictionaries as read from the inputs (propagated in batches)
and a flag indicating whether the yield batch is the last one.
"""

total_samples = 0

while total_samples < self.samples:
batch = []
bs = min(self.batch_size, self.samples - total_samples)
for _ in range(self.batch_size):
choices = random.choices(self.data, k=self.size)
choices = self._transform_data(choices)
batch.extend(choices)
total_samples += bs
batch = list(islice(batch, bs))
yield (batch, True if total_samples >= self.samples else False)
batch = []

@staticmethod
def _transform_data(data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
if not data:
return []

result = {key: [] for key in data[0].keys()}

for item in data:
for key, value in item.items():
result[key].append(value)

return [result]

@property
def outputs(self) -> List[str]:
return list(self.data[0].keys())
10 changes: 5 additions & 5 deletions src/distilabel/steps/generators/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,11 +218,11 @@ def _get_dataset_num_examples(self) -> int:
Returns:
The number of examples in the dataset.
"""
return (
self._dataset_info[self.config if self.config else "default"]
.splits[self.split]
.num_examples
)
default_config = self.config
if not default_config:
default_config = list(self._dataset_info.keys())[0]

return self._dataset_info[default_config].splits[self.split].num_examples

def _get_dataset_columns(self) -> List[str]:
"""Get the columns of the dataset, based on the `config` runtime parameter provided.
Expand Down
6 changes: 6 additions & 0 deletions src/distilabel/steps/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from distilabel.steps.tasks.apigen.execution_checker import APIGenExecutionChecker
from distilabel.steps.tasks.apigen.generator import APIGenGenerator
from distilabel.steps.tasks.apigen.semantic_checker import APIGenSemanticChecker
from distilabel.steps.tasks.base import GeneratorTask, Task
from distilabel.steps.tasks.complexity_scorer import ComplexityScorer
from distilabel.steps.tasks.evol_instruct.base import EvolInstruct
Expand Down Expand Up @@ -52,6 +55,9 @@
__all__ = [
"GeneratorTask",
"Task",
"APIGenExecutionChecker",
"APIGenGenerator",
"APIGenSemanticChecker",
"ComplexityScorer",
"EvolInstruct",
"EvolComplexity",
Expand Down
14 changes: 14 additions & 0 deletions src/distilabel/steps/tasks/apigen/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Loading
Loading