Skip to content

Commit

Permalink
[FEATURE] argilla: add support to distribution (#5187)
Browse files Browse the repository at this point in the history
# Description
<!-- Please include a summary of the changes and the related issue.
Please also include relevant motivation and context. List any
dependencies that are required for this change. -->

This PR adds support to configure the task distribution strategy when
creating or updating datasets.

We can create datasets with specific task distribution setup
```python

task_distribution = TaskDistribution(min_submitted=4)

settings = Settings(
    fields=[TextField(name="text", title="text")],
    questions=[LabelQuestion(name="label", title="text", labels=["positive", "negative"])],
    distribution=task_distribution,
)
dataset = Dataset(dataset_name, settings=settings).create()
```

or update an existing dataset (without any user response)
```python
dataset = client.datasets(...)

dataset.settings.distribution.min_submitted = 100
# or 
dataset.distribution.min_submitted = 100
# or 
dataset.distribution = TaskDistribution(min_submitted=100)
dataset.update()
```


Closes #5033
Closes #5034


Refs: #5246


**Type of change**
<!-- Please delete options that are not relevant. Remember to title the
PR according to the type of change -->

- New feature (non-breaking change which adds functionality)
- Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**
<!-- Please add some reference about how your feature has been tested.
-->

**Checklist**
<!-- Please go over the list and make sure you've taken everything into
account -->

- I added relevant documentation
- I followed the style guidelines of this project
- I did a self-review of my code
- I made corresponding changes to the documentation
- I confirm My changes generate no new warnings
- I have added tests that prove my fix is effective or that my feature
works
- I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: José Francisco Calvo <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Damián Pumar <[email protected]>
Co-authored-by: José Francisco Calvo <[email protected]>
Co-authored-by: Leire <[email protected]>
Co-authored-by: David Berenstein <[email protected]>
Co-authored-by: Natalia Elvira <[email protected]>
Co-authored-by: Sara Han <[email protected]>
  • Loading branch information
9 people authored Jul 19, 2024
1 parent 4237e68 commit e640924
Show file tree
Hide file tree
Showing 21 changed files with 391 additions and 49 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/argilla.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
build:
services:
argilla-quickstart:
image: argilla/argilla-quickstart:main
image: argilladev/argilla-quickstart:develop
ports:
- 6900:6900
env:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 10 additions & 16 deletions argilla/docs/how_to_guides/annotate.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,13 @@ If you are starting an annotation effort, all the records are initially kept in
- **Pending**: The records without a response.
- **Draft**: The records with partial responses. They can be submitted or discarded later. You can’t move them back to the pending queue.
- **Discarded**: The records may or may not have responses. They can be edited but you can’t move them back to the pending queue.
- **Submitted**: The records have been fully annotated and have already been submitted.
- **Submitted**: The records have been fully annotated and have already been submitted. You can remove them from this queue and send them to the draft or discarded queues, but never back to the pending queue.

!!! note
If you are working as part of a team, the number of records in your Pending queue may change as other members of the team submit responses and those records get completed.

!!! tip
If you are working as part of a team, the records in the draft queue that have been completed by other team members will show a check mark to indicate that there is no need to provide a response.

### Suggestions

Expand Down Expand Up @@ -115,9 +121,9 @@ The bulk view displays the records in a vertical list. Once this view is active,

### Annotation progress

The global progress of the annotation task from all users is displayed in the dataset list. This is indicated in the `Global progress` column, which shows the number of records still to be annotated, along with a progress bar. The progress bar displays the percentage and number of records submitted, conflicting (i.e., those with both submitted and discarded responses), discarded and pending by hovering your mouse over it.
You can track the progress of an annotation task in the progress bar shown in the dataset list and in the progress panel inside the dataset. This bar shows the number of records that have been completed (i.e., those that have the minimum number of submitted responses) and those left to be completed.

You can track your annotation progress in real time from the righ-bottom panel inside the dataset page. This means that, while you are annotating, the progress bar updates as you submit or discard a record. Expanding the panel, the distribution of `Pending``Draft`, `Submitted` and `Discarded` responses is displayed in a donut chart.
You can also track your own progress in real time expanding the right-bottom panel inside the dataset page. There you can see the number of records for which you have `Pending``Draft`, `Submitted` and `Discarded` responses.

## Use search, filters, and sort

Expand Down Expand Up @@ -173,16 +179,4 @@ You can sort your records according to one or several attributes.

The insertion time and last update are general to all records.

The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.

## Annotate in teams

!!! note
Argilla 2.1 will come with automatic task distribution, which will allow you to distribute the work across several users more efficiently.

### Edit guidelines in the settings

As an `owner` or `admin`, you can edit the guidelines as much as you need from the icon settings on the header. Markdown format is enabled.

!!! tip
If you want further guidance on good practices for guidelines during the project development, check this [blog post](https://argilla.io/blog/annotation-guidelines-practices/).
The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.
15 changes: 15 additions & 0 deletions argilla/docs/how_to_guides/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ A **dataset** is a collection of records that you can configure for labelers to
vectors=[rg.VectorField(name="vector", dimensions=10)],
guidelines="guidelines",
allow_extra_metadata=True,
distribution=2
)
```

Expand Down Expand Up @@ -96,6 +97,7 @@ settings = rg.Settings(
guidelines="Select the sentiment of the prompt.",
fields=[rg.TextField(name="prompt", use_markdown=True)],
questions=[rg.LabelQuestion(name="sentiment", labels=["positive", "negative"])],
distribution=rg.TaskDistribution(min_submitted=3)
)

dataset1 = rg.Dataset(name="sentiment_analysis_1", settings=settings)
Expand Down Expand Up @@ -395,6 +397,19 @@ It is good practice to use at least the dataset guidelines if not both methods.
!!! tip
If you want further guidance on good practices for guidelines during the project development, check our [blog post](https://argilla.io/blog/annotation-guidelines-practices/).

### Distribution

When working as a team, you may want to distribute the annotation task to ensure efficiency and quality. You can use the `TaskDistribution` settings to configure the number of minimum submitted responses expected for each record. Argilla will use this setting to automatically handle records in your team members' pending queues.

```python
rg.TaskDistribution(
min_submitted = 2
)
```

> To learn more about how to distribute the task among team members in the [Distribute the annotation guide](../how_to_guides/distribution.md).

## List datasets

You can list all the datasets available in a workspace using the `datasets` attribute of the `Workspace` class. You can also use `len(workspace.datasets)` to get the number of datasets in a workspace.
Expand Down
76 changes: 76 additions & 0 deletions argilla/docs/how_to_guides/distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
description: In this section, we will provide a step-by-step guide to show how to distribute the annotation task among team members.
---

# Distribute the annotation task among the team

This guide explains how you can use Argilla’s **automatic task distribution** to efficiently divide the task of annotating a dataset among multiple team members.

Owners and admins can define the minimum number of submitted responses expected for each record depending on whether the dataset should have annotation overlap and how much. Argilla will use this setting to handle automatically the records that will be shown in the pending queues of all users with access to the dataset.

When a record has met the minimum number of submissions, the status of the record will change to `completed` and the record will be removed from the `Pending` queue of all team members, so they can focus on providing responses where they are most needed. The dataset’s annotation task will be fully completed once all records have the `completed` status.

![Task Distribution diagram](../assets/images/how_to_guides/distribution/taskdistribution.svg)

!!! note
The status of a record can be either `completed`, when it has the required number of responses with `submitted` status, or `pending`, when it doesn’t meet this requirement.

Each record can have multiple responses and each of those can have the status `submitted`, `discarded` or `draft`.

!!! info "Main Class"

```python
rg.TaskDistribution(
min_submitted = 2
)
```
> Check the [Task Distribution - Python Reference](../reference/argilla/settings/task_distribution.md) to see the attributes, arguments, and methods of the `TaskDistribution` class in detail.

## Configure task distribution settings

By default, Argilla will set the required minimum submitted responses to 1. This means that whenever a record has at least 1 response with the status `submitted` the status of the record will be `completed` and removed from the `Pending` queue of other team members.

!!! tip
Leave the default value of minimum submissions (1) if you are working on your own or when you don't require more than one submitted response per record.

If you wish to set a different number, you can do so through the `distribution` setting in your dataset settings:

```python
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)
```

> Learn more about configuring dataset settings in the [Dataset management guide](../how_to_guides/dataset.md).
!!! tip
Increase the number of minimum subsmissions if you’d like to ensure you get more than one submitted response per record. Make sure that this number is never higher than the number of members in your team. Note that the lower this number is, the faster the task will be completed.

!!! note
Note that some records may have more responses than expected if multiple team members submit responses on the same record simultaneously.

## Change task distribution settings

If you wish to change the minimum submitted responses required in a dataset you can do so as long as the annotation hasn’t started, i.e. the dataset has no responses for any records.

Admins and owners can change this value from the dataset settings page in the UI or from the SDK:

```python
dataset = client.datasets(...)

dataset.settings.distribution.min_submitted = 4

dataset.update()
```
16 changes: 16 additions & 0 deletions argilla/docs/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,22 @@ These guides provide step-by-step instructions for common scenarios, including d

[:octicons-arrow-right-24: How-to guide](import_export.md)

- __Annotate a dataset__

---

Learn how to use the Argilla UI to navigate datasets and submit responses.

[:octicons-arrow-right-24: How-to guide](annotate.md)

- __Distribute the annotation__

---

Learn how to use Argilla's automatic task distribution to annotate as a team efficiently.

[:octicons-arrow-right-24: How-to guide](distribution.md)

</div>

## Advanced
Expand Down
9 changes: 7 additions & 2 deletions argilla/docs/how_to_guides/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ You can use the `Filter` class to define the conditions and pass them to the `Da

## Filter by status

You can filter records based on their status. The status can be `pending`, `draft`, `submitted`, or `discarded`.
You can filter records based on record or response status. Record status can be `pending` or `completed` and response status can be `pending`, `draft`, `submitted`, or `discarded`.

```python
import argilla as rg
Expand All @@ -134,7 +134,12 @@ workspace = client.workspaces("my_workspace")
dataset = client.datasets(name="my_dataset", workspace=workspace)

status_filter = rg.Query(
filter=rg.Filter(("response.status", "==", "submitted"))
filter=rg.Filter(
[
("status", "==", "completed"),
("response.status", "==", "discarded")
]
)
)

filtered_records = list(dataset.records(status_filter))
Expand Down
1 change: 1 addition & 0 deletions argilla/docs/reference/argilla/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
* [Questions](settings/questions.md)
* [Metadata](settings/metadata_property.md)
* [Vectors](settings/vectors.md)
* [Distribution](settings/task_distribution.md)
* [rg.Record](records/records.md)
* [rg.Response](records/responses.md)
* [rg.Suggestion](records/suggestions.md)
Expand Down
2 changes: 1 addition & 1 deletion argilla/docs/reference/argilla/settings/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ dataset.create()

```

> To define the settings for fields, questions, metadata, or vectors, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md) class documentation.
> To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.
---

Expand Down
42 changes: 42 additions & 0 deletions argilla/docs/reference/argilla/settings/task_distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
hide: footer
---
# Distribution

Distribution settings are used to define the criteria used by the tool to automatically manage records in the dataset depending on the expected number of submitted responses per record.

## Usage Examples

The default minimum submitted responses per record is 1. If you wish to increase this value, you can define it through the `TaskDistribution` class and pass it to the `Settings` class.

```python
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)

dataset = rg.Dataset(
name="my_dataset",
settings=settings
)
```

---

## `rg.TaskDistribution`

::: src.argilla.settings._task_distribution.OverlapTaskDistribution
options:
heading_level: 3
show_root_toc_entry: false
2 changes: 1 addition & 1 deletion argilla/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ nav:
- Query and filter records: how_to_guides/query.md
- Importing and exporting datasets: how_to_guides/import_export.md
- Annotate a dataset: how_to_guides/annotate.md
- Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md
- Distribute the annotation task: how_to_guides/distribution.md
- Advanced:
- Use Markdown to format rich content: how_to_guides/use_markdown_to_format_rich_content.md
- Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md
Expand Down
8 changes: 4 additions & 4 deletions argilla/src/argilla/_api/_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ class DatasetsAPI(ResourceAPI[DatasetModel]):

@api_error_handler
def create(self, dataset: "DatasetModel") -> "DatasetModel":
json_body = dataset.model_dump()
json_body = dataset.model_dump(exclude_unset=True)
response = self.http_client.post(
url=self.url_stub,
json=json_body,
Expand All @@ -48,13 +48,13 @@ def create(self, dataset: "DatasetModel") -> "DatasetModel":

@api_error_handler
def update(self, dataset: "DatasetModel") -> "DatasetModel":
json_body = dataset.model_dump()
json_body = dataset.model_dump(exclude_unset=True)
dataset_id = json_body["id"] # type: ignore
response = self.http_client.patch(f"{self.url_stub}/{dataset_id}", json=json_body)
response.raise_for_status()
response_json = response.json()
dataset = self._model_from_json(response_json=response_json)
self._log_message(message=f"Updated dataset {dataset.url}")
self._log_message(message=f"Updated dataset {dataset.id}")
return dataset

@api_error_handler
Expand All @@ -63,7 +63,7 @@ def get(self, dataset_id: UUID) -> "DatasetModel":
response.raise_for_status()
response_json = response.json()
dataset = self._model_from_json(response_json=response_json)
self._log_message(message=f"Got dataset {dataset.url}")
self._log_message(message=f"Got dataset {dataset.id}")
return dataset

@api_error_handler
Expand Down
9 changes: 5 additions & 4 deletions argilla/src/argilla/_models/_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,28 +12,29 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Optional
from datetime import datetime
from uuid import UUID
from typing import Literal
from typing import Optional
from uuid import UUID

from pydantic import field_serializer, ConfigDict

from argilla._models import ResourceModel

__all__ = ["DatasetModel"]

from argilla._models._settings._task_distribution import TaskDistributionModel


class DatasetModel(ResourceModel):
name: str
status: Literal["draft", "ready"] = "draft"

guidelines: Optional[str] = None
allow_extra_metadata: bool = True # Ideally, the default value should be provided by the server

distribution: Optional[TaskDistributionModel] = None
workspace_id: Optional[UUID] = None
last_activity_at: Optional[datetime] = None
url: Optional[str] = None

model_config = ConfigDict(
validate_assignment=True,
Expand Down
29 changes: 29 additions & 0 deletions argilla/src/argilla/_models/_settings/_task_distribution.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2024-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

__all__ = ["TaskDistributionModel", "OverlapTaskDistributionModel"]

from typing import Literal

from pydantic import BaseModel, PositiveInt, ConfigDict


class OverlapTaskDistributionModel(BaseModel):
strategy: Literal["overlap"]
min_submitted: PositiveInt

model_config = ConfigDict(validate_assignment=True)


TaskDistributionModel = OverlapTaskDistributionModel
Loading

0 comments on commit e640924

Please sign in to comment.