Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: API reference review #932

Merged
merged 19 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,16 @@ In addition, the following extras are available:

### Example

To run the following example you must install `distilabel` with both `openai` extra:
To run the following example you must install `distilabel` with the `hf-inference-endpoints` extra:

```sh
pip install "distilabel[openai]" --upgrade
pip install "distilabel[hf-inference-endpoints]" --upgrade
```

Then run:

```python
from distilabel.llms import OpenAILLM
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
Expand All @@ -114,9 +114,14 @@ with Pipeline(
) as pipeline:
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

generate_with_openai = TextGeneration(llm=OpenAILLM(model="gpt-3.5-turbo"))
text_generation = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
),
)

load_dataset >> generate_with_openai
load_dataset >> text_generation

if __name__ == "__main__":
distiset = pipeline.run(
Expand All @@ -125,7 +130,7 @@ if __name__ == "__main__":
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
generate_with_openai.name: {
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
Expand All @@ -135,6 +140,7 @@ if __name__ == "__main__":
},
},
)
distiset.push_to_hub(repo_id="distilabel-example")
```

## Badges
Expand Down
8 changes: 8 additions & 0 deletions docs/api/embedding/embedding_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Embedding Gallery

This section contains the existing [`Embeddings`][distilabel.embeddings] subclasses implemented in `distilabel`.

::: distilabel.embeddings
options:
filters:
- "!^Embeddings$"
7 changes: 7 additions & 0 deletions docs/api/embedding/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Embedding

This section contains the API reference for the `distilabel` embeddings.

For more information on how the [`Embeddings`][distilabel.steps.tasks.Task] works and see some examples.

::: distilabel.embeddings.base
3 changes: 0 additions & 3 deletions docs/api/llm/anthropic.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/anyscale.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/azure.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/cohere.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/groq.md

This file was deleted.

6 changes: 0 additions & 6 deletions docs/api/llm/huggingface.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/litellm.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/llamacpp.md

This file was deleted.

10 changes: 10 additions & 0 deletions docs/api/llm/llm_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# LLM Gallery

This section contains the existing [`LLM`][distilabel.llms] subclasses implemented in `distilabel`.

::: distilabel.llms
options:
filters:
- "!^LLM$"
- "!^AsyncLLM$"
- "!typing"
3 changes: 0 additions & 3 deletions docs/api/llm/mistral.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/ollama.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/openai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/together.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vertexai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/llm/vllm.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/api/step/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Step Typing

::: distilabel.steps.typing
13 changes: 9 additions & 4 deletions docs/api/step_gallery/extra.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Extra

::: distilabel.steps.generators.data
::: distilabel.steps.deita
::: distilabel.steps.formatting
::: distilabel.steps.typing
::: distilabel.steps
options:
filters:
- "!Argilla"
- "!Columns"
- "!From(Disk|FileSystem)"
- "!Hub"
- "![Ss]tep"
- "!typing"
1 change: 1 addition & 0 deletions docs/api/step_gallery/hugging_face.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ This section contains the existing steps integrated with `Hugging Face` so as to
::: distilabel.steps.LoadDataFromDisk
sdiazlor marked this conversation as resolved.
Show resolved Hide resolved
::: distilabel.steps.LoadDataFromFileSystem
::: distilabel.steps.LoadDataFromHub
::: distilabel.steps.PushToHub
File renamed without changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
159 changes: 159 additions & 0 deletions docs/sections/community/contributor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
description: This is a step-by-step guide to help you contribute to the distilabel project. We are excited to have you on board! 🚀
hide:
- footer
---

Thank you for investing your time in contributing to the project! Any contribution you make will be reflected in the most recent version of distilabel 🤩.

??? Question "New to contributing in general?"
If you're a new contributor, read the [README](https://github.com/argilla-io/distilabel/blob/develop/README.md) to get an overview of the project. In addition, here are some resources to help you get started with open-source contributions:

* **Discord**: You are welcome to join the [distilabel Discord community](http://hf.co/join/discord), where you can keep in touch with other users, contributors and the distilabel team. In the following [section](#first-contact-in-discord), you can find more information on how to get started in Discord.
* **Git**: This is a very useful tool to keep track of the changes in your files. Using the command-line interface (CLI), you can make your contributions easily. For that, you need to have it [installed and updated](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) on your computer.
* **GitHub**: It is a platform and cloud-based service that uses git and allows developers to collaborate on projects. To contribute to distilabel, you'll need to create an account. Check the [Contributor Workflow with Git and Github](#contributor-workflow-with-git-and-github) for more info.
* **Developer Documentation**: To collaborate, you'll need to set up an efficient environment. Check the [Installation](../getting_started/installation.md) guide to know how to do it.

## First Contact in Discord

Discord is a handy tool for more casual conversations and to answer day-to-day questions. As part of Hugging Face, we have set up some distilabel channels on the server. Click [here](http://hf.co/join/discord) to join the Hugging Face Discord community effortlessly.

When part of the Hugging Face Discord, you can select "Channels & roles" and select "Argilla" along with any of the other groups that are interesting to you. "Argilla" will cover anything about argilla and distilabel. You can join the following channels:

* **#argilla-distilabel-announcements**: 📣 Stay up-to-date.
* **#argilla-distilabel-general**: 💬 For general discussions.
* **#argilla-distilabel-help**: 🙋‍♀️ Need assistance? We're always here to help. Select the appropriate label (argilla or distilabel) for your issue and post it.

So now there is only one thing left to do: introduce yourself and talk to the community. You'll always be welcome! 🤗👋


## Contributor Workflow with Git and GitHub

If you're working with distilabel and suddenly a new idea comes to your mind or you find an issue that can be improved, it's time to actively participate and contribute to the project!

### Report an issue

If you spot a problem, [search if an issue already exists](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue), you can use the `Label` filter. If that is the case, participate in the conversation. If it does not exist, create an issue by clicking on `New Issue`. This will show various templates; choose the one that best suits your issue. Once you choose one, you will need to fill it in following the guidelines. Try to be as clear as possible. In addition, you can assign yourself to the issue and add or choose the right labels. Finally, click on `Submit new issue`.


### Work with a fork

#### Fork the distilabel repository

After having reported the issue, you can start working on it. For that, you will need to create a fork of the project. To do that, click on the `Fork` button. Now, fill in the information. Remember to uncheck the `Copy develop branch only` if you are going to work in or from another branch (for instance, to fix documentation, the `main` branch is used). Then, click on `Create fork`.

You will be redirected to your fork. You can see that you are in your fork because the name of the repository will be your `username/distilabel`, and it will indicate `forked from argilla-io/distilabel`.


#### Clone your forked repository

In order to make the required adjustments, clone the forked repository to your local machine. Choose the destination folder and run the following command:

```sh
git clone https://github.com/[your-github-username]/distilabel.git
cd distilabel
```

To keep your fork’s main/develop branch up to date with our repo, add it as an upstream remote branch.

```sh
git remote add upstream https://github.com/argilla-io/distilabel.git
```


### Create a new branch

For each issue you're addressing, it's advisable to create a new branch. GitHub offers a straightforward method to streamline this process.

> ⚠️ Never work directly on the `main` or `develop` branch. Always create a new branch for your changes.

Navigate to your issue, and on the right column, select `Create a branch`.

![Create a branch](../../assets/images/sections/community/create-branch.PNG)

After the new window pops up, the branch will be named after the issue and include a prefix such as feature/, bug/, or docs/ to facilitate quick recognition of the issue type. In the `Repository destination`, pick your fork ( [your-github-username]/distilabel), and then select `Change branch source` to specify the source branch for creating the new one. Complete the process by clicking `Create branch`.

> 🤔 Remember that the `main` branch is only used to work with the documentation. For any other changes, use the `develop` branch.

Now, locally, change to the new branch you just created.

```sh
git fetch origin
git checkout [branch-name]
```

### Make changes and push them

Make the changes you want in your local repository, and test that everything works and you are following the guidelines.

Once you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:

```sh
# Check the status of your repository
git status

# Synchronize with the upstreaming repo
git checkout [branch-name]
git rebase [default-branch]
```

If everything is right, we need to commit and push the changes to your fork. For that, run the following commands:

```sh
# Add the changes to the staging area
git add filename

# Commit the changes by writing a proper message
git commit -m "commit-message"

# Push the changes to your fork
git push origin [branch-name]
```

When pushing, you will be asked to enter your GitHub login credentials. Once the push is complete, all local commits will be on your GitHub repository.


### Create a pull request

Come back to GitHub, navigate to the original repository where you created your fork, and click on `Compare & pull request`.

![compare-and-pr](../../assets/images/sections/community/compare-pull-request.PNG)

First, click on `compare across forks` and select the right repositories and branches.

> In the base repository, keep in mind that you should select either `main` or `develop` based on the modifications made. In the head repository, indicate your forked repository and the branch corresponding to the issue.

Then, fill in the pull request template. You should add a prefix to the PR name, as we did with the branch above. If you are working on a new feature, you can name your PR as `feat: TITLE`. If your PR consists of a solution for a bug, you can name your PR as `bug: TITLE`. And, if your work is for improving the documentation, you can name your PR as `docs: TITLE`.

In addition, on the right side, you can select a reviewer (for instance, if you discussed the issue with a member of the team) and assign the pull request to yourself. It is highly advisable to add labels to PR as well. You can do this again by the labels section right on the screen. For instance, if you are addressing a bug, add the `bug` label, or if the PR is related to the documentation, add the `documentation` label. This way, PRs can be easily filtered.

Finally, fill in the template carefully and follow the guidelines. Remember to link the original issue and enable the checkbox to allow maintainer edits so the branch can be updated for a merge. Then, click on `Create pull request`.


### Review your pull request

Once you submit your PR, a team member will review your proposal. We may ask questions, request additional information, or ask for changes to be made before a PR can be merged, either using [suggested changes](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/incorporating-feedback-in-your-pull-request) or pull request comments.

You can apply the changes directly through the UI (check the files changed and click on the right-corner three dots; see image below) or from your fork, and then commit them to your branch. The PR will be updated automatically, and the suggestions will appear as `outdated`.

![edit-file-from-UI](../../assets/images/sections/community/edit-file.PNG)

> If you run into any merge issues, check out this [git tutorial](https://github.com/skills/resolve-merge-conflicts) to help you resolve merge conflicts and other issues.


### Your PR is merged!

Congratulations 🎉🎊 We thank you 🤩

Once your PR is merged, your contributions will be publicly visible on the [distilabel GitHub](https://github.com/argilla-io/distilabel#contributors).

Additionally, we will include your changes in the next release based on our [development branch](https://github.com/argilla-io/argilla/tree/develop).

## Additional resources

Here are some helpful resources for your reference.

* [Configuring Discord](https://support.discord.com/hc/en-us/categories/115000217151), a guide to learning how to get started with Discord.
* [Pro Git](https://git-scm.com/book/en/v2), a book to learn Git.
* [Git in VSCode](https://code.visualstudio.com/docs/sourcecontrol/overview), a guide to learning how to easily use Git in VSCode.
* [GitHub Skills](https://skills.github.com/), an interactive course for learning GitHub.
14 changes: 7 additions & 7 deletions docs/sections/getting_started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,20 @@ hide:
# Frequent Asked Questions (FAQ)

??? faq "How can I rename the columns in a batch?"
Every [`Step`][distilabel.steps.base.Step] has both `input_mappings` and `output_mappings` attributes, that can be used to rename the columns in each batch.
Every [`Step`][distilabel.steps.base.Step] has both `input_mappings` and `output_mappings` attributes that can be used to rename the columns in each batch.

But `input_mappings` will only map, meaning that if you have a batch with the column `A` and you want to rename to `B`, you should use `input_mappings={"A": "B"}`, but that will only be applied to that specific [`Step`][distilabel.steps.base.Step] meaning that the next step in the pipeline will still have the column `A` instead of `B`.
But `input_mappings` will only map, meaning that if you have a batch with the column `A` and you want to rename it to `B`, you should use `input_mappings={"A": "B"}`, but that will only be applied to that specific [`Step`][distilabel.steps.base.Step] meaning that the next step in the pipeline will still have the column `A` instead of `B`.

While `output_mappings` will indeed apply the rename, meaning that if the [`Step`][distilabel.steps.base.Step] produces the column `A` and you want to rename to `B`, you should use `output_mappings={"A": "B"}`, and that will be applied to the next [`Step`][distilabel.steps.base.Step] in the pipeline.

??? faq "Will the API Keys be exposed when sharing the pipeline?"
No, those will be masked out using `pydantic.SecretStr`, meaning that those won't be exposed when sharing the pipeline.

This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via attribute or runtime parameter, you will need to provide them again.
This also means that if you want to re-run your own pipeline and the API keys have not been provided via environment variable but either via an attribute or runtime parameter, you will need to provide them again.

??? faq "Does it work for Windows?"

Yes, but you may need to set the `multiprocessing` context in advance, to ensure that the `spawn` method is used, since the default method `fork` is not available on Windows.
Yes, but you may need to set the `multiprocessing` context in advance to ensure that the `spawn` method is used since the default method `fork` is not available on Windows.

```python
import multiprocessing as mp
Expand All @@ -29,16 +29,16 @@ hide:
```

??? faq "Will the custom Steps / Tasks / LLMs be serialized too?"
No, at the moment only the references to the classes within the `distilabel` library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available, unless used from the same file.
No, at the moment, only the references to the classes within the `distilabel` library will be serialized, meaning that if you define a custom class used within the pipeline, the serialization won't break, but the deserialize will fail since the class won't be available unless used from the same file.

??? faq "What happens if `Pipeline.run` fails? Do I lose all the data?"
No, indeed we're using a cache mechanism to store all the intermediate results in disk, so that if a [`Step`][distilabel.steps.base.Step] fails, the pipeline can be re-run from that point without losing the data, only if nothing is changed in the `Pipeline`.
No, indeed, we're using a cache mechanism to store all the intermediate results in the disk so, if a [`Step`][distilabel.steps.base.Step] fails; the pipeline can be re-run from that point without losing the data, only if nothing is changed in the `Pipeline`.

All the data will be stored in `.cache/distilabel`, but the only data that will persist at the end of the `Pipeline.run` execution is the one from the leaf step/s, so bear that in mind.

For more information on the caching mechanism in `distilabel`, you can check the [Learn - Advanced - Caching](../how_to_guides/advanced/caching.md) section.

Also note that when running a [`Step`][distilabel.steps.base.Step] or a [`Task`][distilabel.steps.tasks.Task] standalone, the cache mechanism won't be used, so if you want to use that, you should use the `Pipeline` context manager.
Also, note that when running a [`Step`][distilabel.steps.base.Step] or a [`Task`][distilabel.steps.tasks.Task] standalone, the cache mechanism won't be used, so if you want to use that, you should use the `Pipeline` context manager.

??? faq "How can I use the same `LLM` across several tasks without having to load it several times?"
You can serve the LLM using a solution like TGI or vLLM, and then connect to it using an `AsyncLLM` client like `InferenceEndpointsLLM` or `OpenAILLM`. Please refer to [Serving LLMs guide](../how_to_guides/advanced/serving_an_llm_for_reuse.md) for more information.
Loading
Loading