Release 1.4.0

argilla-io · Oct 8, 2024 · c0d798a · c0d798a
2 parents ed88585 + 6ef15f4
commit c0d798a
Show file tree

Hide file tree

Showing 298 changed files with 18,132 additions and 2,191 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -42,6 +42,9 @@ jobs:
         if: steps.cache.outputs.cache-hit != 'true'
         run: pip install -e .[docs]
 
+      - name: Check no warnings
+        run: mkdocs build --strict
+
       - name: Set git credentials
         run: |
           git config --global user.name "${{ github.actor }}"

diff --git a/README.md b/README.md
@@ -78,6 +78,8 @@ Requires Python 3.9+
 
 In addition, the following extras are available:
 
+### LLMs
+
 - `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
 - `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
 - `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
@@ -91,19 +93,32 @@ In addition, the following extras are available:
 - `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
 - `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
 - `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
+- `sentence-transformers`: for generating sentence embeddings using [sentence-transformers](https://github.com/UKPLab/sentence-transformers).
+
+### Structured generation
+
+- `outlines`: for using structured generation of LLMs with [outlines](https://github.com/outlines-dev/outlines).
+- `instructor`: for using structured generation of LLMs with [Instructor](https://github.com/jxnl/instructor/).
+
+### Data processing
+
+- `ray`: for scaling and distributing a pipeline with [Ray](https://github.com/ray-project/ray).
+- `faiss-cpu` and `faiss-gpu`: for generating sentence embeddings using [faiss](https://github.com/facebookresearch/faiss).
+- `text-clustering`: for using text clustering with [UMAP](https://github.com/lmcinnes/umap) and [Scikit-learn](https://github.com/scikit-learn/scikit-learn).
+- `minhash`: for using minhash for duplicate detection with [datasketch](https://github.com/datasketch/datasketch) and [nltk](https://github.com/nltk/nltk).
 
 ### Example
 
-To run the following example you must install `distilabel` with both `openai` extra:
+To run the following example you must install `distilabel` with the `hf-inference-endpoints` extra:
 
 ```sh
-pip install "distilabel[openai]" --upgrade
+pip install "distilabel[hf-inference-endpoints]" --upgrade
 ```
 
 Then run:
 
 ```python
-from distilabel.llms import OpenAILLM
+from distilabel.llms import InferenceEndpointsLLM
 from distilabel.pipeline import Pipeline
 from distilabel.steps import LoadDataFromHub
 from distilabel.steps.tasks import TextGeneration
@@ -114,9 +129,14 @@ with Pipeline(
 ) as pipeline:
     load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})
 
-    generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
+    text_generation = TextGeneration(
+        llm=InferenceEndpointsLLM(
+            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
+            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
+        ),
+    )
 
-    load_dataset >> generate_with_openai
+    load_dataset >> text_generation
 
 if __name__ == "__main__":
     distiset = pipeline.run(
@@ -125,7 +145,7 @@ if __name__ == "__main__":
                 "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                 "split": "test",
             },
-            generate_with_openai.name: {
+            text_generation.name: {
                 "llm": {
                     "generation_kwargs": {
                         "temperature": 0.7,
@@ -135,6 +155,7 @@ if __name__ == "__main__":
             },
         },
     )
+    distiset.push_to_hub(repo_id="distilabel-example")
 ```
 
 ## Badges

diff --git a/docs/api/embedding/embedding_gallery.md b/docs/api/embedding/embedding_gallery.md
@@ -0,0 +1,8 @@
+# Embedding Gallery
+
+This section contains the existing [`Embeddings`][distilabel.embeddings] subclasses implemented in `distilabel`.
+
+::: distilabel.embeddings
+    options:
+        filters:
+        - "!^Embeddings$"
diff --git a/docs/api/embedding/index.md b/docs/api/embedding/index.md
@@ -0,0 +1,7 @@
+# Embedding
+
+This section contains the API reference for the `distilabel` embeddings.
+
+For more information on how the [`Embeddings`][distilabel.steps.tasks.Task] works and see some examples.
+
+::: distilabel.embeddings.base
diff --git a/docs/api/errors.md b/docs/api/errors.md
@@ -0,0 +1,8 @@
+# Errors
+
+This section contains the `distilabel` custom errors. Unlike [exceptions](exceptions.md), errors in `distilabel` are used to handle unexpected situations that can't be anticipated and that can't be handled in a controlled way.
+
+:::distilabel.errors.DistilabelError
+:::distilabel.errors.DistilabelUserError
+:::distilabel.errors.DistilabelTypeError
+:::distilabel.errors.DistilabelNotImplementedError
diff --git a/docs/api/exceptions.md b/docs/api/exceptions.md
@@ -0,0 +1,7 @@
+# Exceptions
+
+This section contains the `distilabel` custom exceptions. Unlike [errors](errors.md), exceptions in `distilabel` are used to handle specific situations that can be anticipated and that can be handled in a controlled way internally by the library.
+
+:::distilabel.exceptions.DistilabelException
+:::distilabel.exceptions.DistilabelGenerationException
+:::distilabel.exceptions.DistilabelOfflineBatchGenerationNotFinishedException
diff --git a/docs/api/llm/anthropic.md b/docs/api/llm/anthropic.md
diff --git a/docs/api/llm/anyscale.md b/docs/api/llm/anyscale.md
diff --git a/docs/api/llm/azure.md b/docs/api/llm/azure.md
diff --git a/docs/api/llm/cohere.md b/docs/api/llm/cohere.md
diff --git a/docs/api/llm/groq.md b/docs/api/llm/groq.md
diff --git a/docs/api/llm/huggingface.md b/docs/api/llm/huggingface.md
diff --git a/docs/api/llm/litellm.md b/docs/api/llm/litellm.md
diff --git a/docs/api/llm/llamacpp.md b/docs/api/llm/llamacpp.md
diff --git a/docs/api/llm/llm_gallery.md b/docs/api/llm/llm_gallery.md
@@ -0,0 +1,10 @@
+# LLM Gallery
+
+This section contains the existing [`LLM`][distilabel.llms] subclasses implemented in `distilabel`.
+
+::: distilabel.llms
+    options:
+        filters:
+        - "!^LLM$"
+        - "!^AsyncLLM$"
+        - "!typing"
diff --git a/docs/api/llm/mistral.md b/docs/api/llm/mistral.md
diff --git a/docs/api/llm/ollama.md b/docs/api/llm/ollama.md
diff --git a/docs/api/llm/openai.md b/docs/api/llm/openai.md
diff --git a/docs/api/llm/together.md b/docs/api/llm/together.md
diff --git a/docs/api/llm/vertexai.md b/docs/api/llm/vertexai.md
diff --git a/docs/api/llm/vllm.md b/docs/api/llm/vllm.md
diff --git a/docs/api/pipeline/step_wrapper.md b/docs/api/pipeline/step_wrapper.md
@@ -0,0 +1,4 @@
+# Step Wrapper
+
+::: distilabel.pipeline.step_wrapper._StepWrapper
+::: distilabel.pipeline.step_wrapper._StepWrapperException
diff --git a/docs/api/pipeline/utils.md b/docs/api/pipeline/utils.md
diff --git a/docs/api/step/typing.md b/docs/api/step/typing.md
@@ -0,0 +1,3 @@
+# Step Typing
+
+::: distilabel.steps.typing
diff --git a/docs/api/step_gallery/columns.md b/docs/api/step_gallery/columns.md
@@ -6,3 +6,4 @@ This section contains the existing steps intended to be used for common column o
 ::: distilabel.steps.columns.keep
 ::: distilabel.steps.columns.merge
 ::: distilabel.steps.columns.group
+::: distilabel.steps.columns.utils
diff --git a/docs/api/step_gallery/extra.md b/docs/api/step_gallery/extra.md
@@ -1,6 +1,11 @@
 # Extra
 
-::: distilabel.steps.generators.data
-::: distilabel.steps.deita
-::: distilabel.steps.formatting
-::: distilabel.steps.typing
+::: distilabel.steps
+    options:
+        filters:
+        - "!Argilla"
+        - "!Columns"
+        - "!From(Disk|FileSystem)"
+        - "!Hub"
+        - "![Ss]tep"
+        - "!typing"
diff --git a/docs/api/step_gallery/hugging_face.md b/docs/api/step_gallery/hugging_face.md
@@ -5,3 +5,4 @@ This section contains the existing steps integrated with `Hugging Face` so as to
 ::: distilabel.steps.LoadDataFromDisk
 ::: distilabel.steps.LoadDataFromFileSystem
 ::: distilabel.steps.LoadDataFromHub
+::: distilabel.steps.PushToHub
diff --git a/docs/api/task_gallery/index.md → docs/api/task/task_gallery.md b/docs/api/task_gallery/index.md → docs/api/task/task_gallery.md
diff --git a/docs/assets/images/sections/caching/caching_1.png b/docs/assets/images/sections/caching/caching_1.png
diff --git a/docs/assets/images/sections/caching/caching_2.png b/docs/assets/images/sections/caching/caching_2.png
diff --git a/docs/assets/images/sections/caching/caching_pipe_1.png b/docs/assets/images/sections/caching/caching_pipe_1.png
diff --git a/docs/assets/images/sections/caching/caching_pipe_2.png b/docs/assets/images/sections/caching/caching_pipe_2.png
diff --git a/docs/assets/images/sections/caching/caching_pipe_3.png b/docs/assets/images/sections/caching/caching_pipe_3.png
diff --git a/docs/assets/images/sections/caching/caching_pipe_4.png b/docs/assets/images/sections/caching/caching_pipe_4.png
diff --git a/docs/assets/images/sections/community/compare-pull-request.PNG b/docs/assets/images/sections/community/compare-pull-request.PNG
diff --git a/docs/assets/images/sections/community/create-branch.PNG b/docs/assets/images/sections/community/create-branch.PNG
diff --git a/docs/assets/images/sections/community/edit-file.PNG b/docs/assets/images/sections/community/edit-file.PNG
diff --git a/docs/assets/images/sections/how_to_guides/basic/pipeline.png b/docs/assets/images/sections/how_to_guides/basic/pipeline.png
diff --git a/docs/assets/images/sections/how_to_guides/tasks/task_print.png b/docs/assets/images/sections/how_to_guides/tasks/task_print.png
diff --git a/docs/assets/pipelines/arena-hard.png b/docs/assets/pipelines/arena-hard.png
diff --git a/docs/assets/pipelines/clair.png b/docs/assets/pipelines/clair.png
diff --git a/docs/assets/pipelines/clean-dataset.png b/docs/assets/pipelines/clean-dataset.png
diff --git a/docs/assets/pipelines/deepseek.png b/docs/assets/pipelines/deepseek.png
diff --git a/docs/assets/pipelines/deita.png b/docs/assets/pipelines/deita.png
diff --git a/docs/assets/pipelines/generate-preference-dataset.png b/docs/assets/pipelines/generate-preference-dataset.png
diff --git a/docs/assets/pipelines/instruction_backtranslation.png b/docs/assets/pipelines/instruction_backtranslation.png
diff --git a/docs/assets/pipelines/knowledge_graphs.png b/docs/assets/pipelines/knowledge_graphs.png
diff --git a/docs/assets/pipelines/prometheus.png b/docs/assets/pipelines/prometheus.png
diff --git a/docs/assets/pipelines/sentence-transformer.png b/docs/assets/pipelines/sentence-transformer.png
diff --git a/docs/assets/pipelines/ultrafeedback.png b/docs/assets/pipelines/ultrafeedback.png
diff --git a/docs/assets/tutorials-assets/overview-apigen.jpg b/docs/assets/tutorials-assets/overview-apigen.jpg
diff --git a/docs/index.md b/docs/index.md
@@ -38,21 +38,39 @@ hide:
 
 Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
 
-If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
+<div class="grid cards" markdown>
+
+-  __Get started in 5 minutes!__
+
+    ---
+
+    Install distilabel with `pip` and run your first `Pipeline` to generate and evaluate synthetic data.
+
+    [:octicons-arrow-right-24: Quickstart](./sections/getting_started/quickstart.md)
+
+-  __How-to guides__
+
+    ---
+
+    Get familiar with the basics of distilabel. Learn how to define `steps`, `tasks` and `llms` and run your `Pipeline`.
+
+    [:octicons-arrow-right-24: Learn more](./sections/how_to_guides/index.md)
+
+</div>
 
 ## Why use distilabel?
 
 Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
 
-### Improve your AI output quality through data quality
+<p style="font-size:20px">Improve your AI output quality through data quality</p>
 
 Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time **achieving and keeping high-quality standards for your synthetic data**.
 
-### Take control of your data and models
+<p style="font-size:20px">Take control of your data and models</p>
 
 **Ownership of data for fine-tuning your own LLMs** is not easy but distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.
 
-### Improve efficiency by quickly iterating on the right research and LLMs
+<p style="font-size:20px">Improve efficiency by quickly iterating on the right data and models</p>
 
 Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.