Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distilabel 1.3.0 #857

Merged
merged 48 commits into from
Aug 6, 2024
Merged

distilabel 1.3.0 #857

merged 48 commits into from
Aug 6, 2024

Conversation

gabrielmbmb
Copy link
Member

No description provided.

gabrielmbmb and others added 30 commits June 18, 2024 14:37
* Add step to combine keys in a dict

* Redirect import

* Add tests

* Add internal function to combine keys in a dict

* Fix docstrings per code review
#758)

* Update: naming of CombineKeys to MergeColumns

* Update: CombineColumns to GroupColumns

* Fix: broken tests after refactor to columns directory

* Add: deprecation test CombineColumns

* Update src/distilabel/pipeline/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Add requirements list for a pipeline

* Add tests for the new requirements of a Pipeline

* Create a RequirementsMixin class to contain common requirements functionality for Step and BasePipeline

* Create a decorator to add requirements to Steps

* Make the _Step inherit from RequirementsMixin to contain the needed functionality

* Implement functionality to check requirements before starting running a Pipeline

* Fix test to run with DummyPipeline

* Add test for requirements to step created via decorator

* Add requirements info to dump and ensure it's loaded back (if found

* Update src/distilabel/mixins/requirements.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Apply suggestions from code review!

* Update src/distilabel/pipeline/base.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update requirements to store the list of Requirement instances to avoid reinstantiation

* Update tests

* Fix doc errors from column step refactor

* Add missing llm serving/sharing in how to guides

* Fix error on internal requirements variable

* Include guide to use the requirements decorator

* Update docs/sections/how_to_guides/advanced/pipeline_requirements.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/how_to_guides/advanced/pipeline_requirements.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/how_to_guides/advanced/pipeline_requirements.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/how_to_guides/advanced/pipeline_requirements.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update mkdocs.yml

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Change ValueError with ModuleNotFoundError when stopping a pipeline due to requirements not installed

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Create N replicas per `Step`

* Update `_BatchManager` to handle batch sorting uncertainty

* Add multiple replicas test

* Fix unit tests

* Fix `next_expected_seq_no` needed to be updated if
`routing_batch_function`

* Update `set_next_expected_batch_seq_no` only if no `data`

* Fix `next_expected_seq_no` with `routing_batch_function`

* Remove prints

* Add `StepResource` import

* Add missing return type hint

* Add `StepResources` docs

* Fix typos

Co-authored-by: Agus <[email protected]>

---------

Co-authored-by: Agus <[email protected]>
* Create N replicas per `Step`

* Update `_BatchManager` to handle batch sorting uncertainty

* Add multiple replicas test

* Fix unit tests

* Fix `next_expected_seq_no` needed to be updated if
`routing_batch_function`

* Update `set_next_expected_batch_seq_no` only if no `data`

* Fix `next_expected_seq_no` with `routing_batch_function`

* Remove prints

* Add `StepResource` import

* Add missing return type hint

* Add `StepResources` docs

* Add `get_steps_load_stages` method

* Update to load steps in stages

* Add `_teardown` method

* Add load stages

* Add printing info about stages

* Refactor load stages to avoid race conditions

* Add load stages integration test

* Fix unit tests

* Add unit tests for new methods

* Move send last batch message

* Refactor to make it work with routing batch function

* Add integration test for load stages & routing batch function

* Update docs to tell about resources as runtime parameters

* Add missing doc pages

* Update to load stages from cache

* Fix bugs requesting initial batches

* Add integration tests for recovering states from cache

* Remove atexit

* Fix docstring typos

Co-authored-by: Agus <[email protected]>

---------

Co-authored-by: Agus <[email protected]>
* Deprecate `python==3.8`

* Fix format
…istiset (#762)

* Add option to include the pipeline script as another artifact when pushing a distiset to the hub

* Add documentation for the pipeline script uploaded

* Inform of the new pieline script uploaded to the repository in the README

* Add docs explaining how to run a pipeline using the CLI

* Run python file with distilabel pipeline from CLI

* Update docs with new running method

* Run script by importing the pipeline from the remote module

* Update src/distilabel/cli/pipeline/app.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/cli/pipeline/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/how_to_guides/advanced/cli/index.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update to importerror as per code review

* Add missing import

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Add `docs-pr.yml` workflow

* Remove if condition

* Add workflow to remove PR docs on close

* Add `GITHUB_TOKEN`
* Create N replicas per `Step`

* Update `_BatchManager` to handle batch sorting uncertainty

* Add multiple replicas test

* Fix unit tests

* Fix `next_expected_seq_no` needed to be updated if
`routing_batch_function`

* Update `set_next_expected_batch_seq_no` only if no `data`

* Fix `next_expected_seq_no` with `routing_batch_function`

* Remove prints

* Add `StepResource` import

* Add missing return type hint

* Add `StepResources` docs

* Add `get_steps_load_stages` method

* Update to load steps in stages

* Add `_teardown` method

* Add load stages

* Add printing info about stages

* Refactor load stages to avoid race conditions

* Add load stages integration test

* Fix unit tests

* Add unit tests for new methods

* Move send last batch message

* Refactor to make it work with routing batch function

* Add integration test for load stages & routing batch function

* Update docs to tell about resources as runtime parameters

* Add missing doc pages

* Add `ray>=2.31.0` optional dependency

* Initial work for `RayPipeline`

* Update to load stages from cache

* Fix bugs requesting initial batches

* Add integration tests for recovering states from cache

* Remove atexit

* Move `_ProcessWrapper` to different file

* `RayPipeline` mvp

* Install `ray` if `python!=3.12`

* Assign ray actor name

* Fix setting `options` for Ray actor

* Set name for all the queues

* Add requirements

* Add docstrings

* Remove unit test

* Add extra `resources`

* Add `ray` method

* Add `ray[default]` as dependency

* Add `script_executed_in_ray_cluster` function

* Fix step load fail didn't stop the pipeline

* Run with `RayPipeline` if detected Ray cluster

* Set built dag

* Fix unit tests

* Add `Pipeline` to `RayPipeline` unit tests

* Add `ray_init_kwargs` argument

* Add `memory` attribute

* Add simple `RayPipeline` integration test

* Override `RayPipeline.dump` method

* Add docs for `RayPipeline`

* Fix close PR docs
* Move `CudaDevicePlacementMixin` to new module

* Initial work for implementing Magpie

* Simplify magpie implementation

* Remove `use_open_ai` and add `MagpieChatTemplateMixin` to
`InferenceEndpointsLLM`

* Add `MagpieChatTemplateMixin` to `vLLM`

* Add `MagpieGenerator` task

* Fix unit tests

* Fix docstrings

* Mock `HF_TOKEN` environment variable

* Fix list index out of range

* Fix `MagpieGenerator` last batch

* Add `only_instruction` attribute

* Update categories

* testing

* Worth trying

* Add examples

* Add magpie unit tests

* Fix docstring

* Update docstrings

* Apply suggestions from code review

Co-authored-by: Agus <[email protected]>

* Update to `huggingface_hub >= 0.22.0`

* Add generation with `chat_completion`

* Update `agenerate` arguments

* Update unit tests

* Fix `tools` were not being used

* Update unit tests

* Fix list of tuples instead of list of list

* Add missing docstring

* Add `chat_completion` unit tests

* Fix `GroqLLM.generate` unit test after updating `_agenerate`

---------

Co-authored-by: Agus <[email protected]>
* Fix input columns not included in output

* Include `model_name` column as output

* Include `model_name` column

* Fix unit tests magpie

* Fix typo in docstring
…ks and handle `None`s. (#784)

* Add `end_with_user` flag

* Add `include_system_prompt` attribute to `Magpie`

* Update docstrings

* Update `MagpieBase` to handle `None`s

* Fix `InferenceEndpointsLLM` unit tests after release of
`huggingface_hub==0.24.0`
* Add `_NoDaemonPool` class

* Use `Union`

* Update src/distilabel/pipeline/local.py

Co-authored-by: Agus <[email protected]>

* Update dependency version to `vllm>=0.5.3` and add `setuptools`

* Remove pinned `outlines==0.34.0`

* Fix docstring

* Add docs about `vLLM` with `ray`

---------

Co-authored-by: Agus <[email protected]>
* Update default names in GroupColumns

* Fix integration test
* Add generating batches to `GeneratorStep` if unique step in the pipeline

* Remove print
* Add default name for a pipeline

* Move to uuid instead

* Fix test and update final name based on uuid
* Update distilabel phrasing based on PR hugging face hub

* Update README.md

* Update index.md

* Fix typos
* Return `instruction` and `response` if `n_turns==1`

* Update `system_prompt` so it can be also a list

* Update outputs in docstrings
…EmbeddingGeneration` and `FaissNearestNeighbour` steps (#830)

* Add `Embeddings` base class and `SentenceTransformers` class

* Add `EmbeddingGeneration` step

* Add `precision` attribute

* Add docstrings

* Add example to docstring

* Update component gallery to include `Embeddings` models

* Add `sentence-transformers` extra

* Add `FaissNearestNeighbour` step

* Add category and example

* Merge category to icons dictionaries

* Add missing unit tests

* Add `faiss-cpu` and `faiss-gpu` extras

* Update unit tests
* Create file per hostname

* Set default `_desired_num_gpus` to `1`

* Fix `GeneratorTask`s not getting assigned gpus and name

* Add `_init_cuda_device_placement` method

* Remove info message

* Add disabling `CudaDevicePlacementMixin` if `RayPipeline`

* Fix unit test
plaguss and others added 18 commits July 29, 2024 10:09
* Add helper function to create generator step from dataset

* Add integration tests for make_generator_step

* Redirect import

* Update LoadDataFromHub to not call load if a dataset is already defined

* Update docs

* Add unit tests for the new helper function

* Update filename to utils

* Add helper method to insert a root step

* Add logic to create a generator step internally from a dataset

* Pass the dataset variable from all the pipeline implementations

* Add type for the input datasets

* Avoid circular imports

* Add test for pipelines with generator step and dataset

* Add integration tests for dataset passed via run method

* Fix error evaluation dataframe

* Add example on quickstart and entry on how to guide

* Update docs/sections/getting_started/quickstart.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update docs/sections/getting_started/quickstart.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/pipeline/base.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/pipeline/ray.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/generators/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/steps/generators/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/pipeline/local.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Respect import order

* Move functionality to a proper internal method

* Run linter

* Fix format

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
Co-authored-by: David Berenstein <[email protected]>
… signature (#838)

* Do not take into account `disable_cuda_device_placement` for pipeline
signature

* Fix unit test
* Add `RewardModelScore` step

* Use logits

* Update docstring

* Fix unit test

* Adjust abs tolerance
…instead of `None` (#841)

* Fix default value was `ellipsis` instead of `None`

* Fix unit test
* Create placement group for `vLLM`

* Use `SPREAD` if `pipeline_parallel_size>1`

* Fix bundle initialization

* Fix wrong dictionary

* Remove using `SPMD` from ray docs

* Refactor creating `PlacementGroup` for `vLLM`
* Update `_Argilla` base and `TextGenerationToArgilla`

* Fix `_dataset.records.log` and rename to `ArgillaBase`

Co-authored-by: Ben Burtenshaw <[email protected]>

* Update `TextGenerationToArgilla` subclass inheritance

* Remove unused `logger.info` message

* Update `PreferenceToArgilla`

* Update `argilla` extra to install `argilla_sdk`

For the moment it's being installed as `pip install git+https://github.com/argilla-io/argilla-python.git@main`

* Add `ArgillaBase` and subclasses unit tests

* Install `argilla_sdk` from source and add `ipython`

* upgrade argilla dep to latest rc

* udate code with latest changes

* chore: remove unnecessary workspace definition

* fix: wrong argilla module import

* Update docstrings

* Fix lint

* Add check for `api_url` and `api_key`

* Fix unit tests

* Fix unit tests

* Update argilla dependency version

---------

Co-authored-by: Ben Burtenshaw <[email protected]>
Co-authored-by: Francisco Aranda <[email protected]>
Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Use `CudaDevicePlacementMixin` in `RewardModelScore` step

* Update `_init_cuda_device_placement` to be `LLM` attribute agnostic

* Check if `Step` is instance of `CudaDevicePlacementMixin`
* Allow getting GPUs from several nodes

* Fix multiply by float

* Fix 0 gpus

* Rename variable
* Add Google Analytics and feedback form per page

* Remove duplicate extra tag
* Add `ClientvLLM` class

* Update `ClientvLLM` to use `openai` clients

* Fix lint

* Add unit tests

* Use unrestricted tokenizer
…iplets (#856)

* Add hard-negative flag to include similar challenging negatives on triplets

* Update src/distilabel/steps/tasks/sentence_transformers.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
* Grab citations from dag

* Include citations in README template

* Add test to check citations are parsed

* Pass dag to create_distiset function

* Update citation section in steps that are backed by a paper

* Add reference in the docs for the Citations section

* Update docs/sections/how_to_guides/advanced/distiset.md

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

* Refactor function to grab citations when creating a distiset

---------

Co-authored-by: Gabriel Martín Blázquez <[email protected]>
Copy link

github-actions bot commented Aug 6, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-857/

Copy link

codspeed-hq bot commented Aug 6, 2024

CodSpeed Performance Report

Merging #857 will not alter performance

Comparing develop (4cbcb90) with develop (ebab004)

Summary

✅ 1 untouched benchmarks

@gabrielmbmb gabrielmbmb merged commit 63f948b into main Aug 6, 2024
13 checks passed
@gabrielmbmb gabrielmbmb deleted the develop branch August 6, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants