Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] argilla: add support to distribution #5187

Merged
merged 60 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
f62d58a
feat: add dataset support to be created using distribution settings (…
jfcalvo Jul 1, 2024
017001f
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 1, 2024
f084ab7
✨ Remove unused method
damianpumar Jul 4, 2024
c8ef4c6
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 4, 2024
6df5256
feat: improve Records `responses_submitted` relationship to be view o…
jfcalvo Jul 4, 2024
dbae135
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 4, 2024
cf3408c
feat: change metrics to support new distribution task logic (#5140)
jfcalvo Jul 4, 2024
8e8b116
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 5, 2024
808c837
[ENHANCEMENT]: `argilla-server`: allow update distribution for non an…
frascuchon Jul 8, 2024
3d74a33
chore: Add status field to record model
frascuchon Jul 9, 2024
7b7d2f5
feat: Add read-only property 'status' to the record resource
frascuchon Jul 9, 2024
736bfc9
tests: Update tests to reflect the status property
frascuchon Jul 9, 2024
f241e41
fix: wrong filter naming after merge from develop
frascuchon Jul 9, 2024
307b38c
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 9, 2024
67d4ee3
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 9, 2024
9b84dcf
chore: Remove message match (depends on python version
frascuchon Jul 9, 2024
08e5757
chore: Add task distribution model
frascuchon Jul 9, 2024
443b9d0
feat: Add support to task distribution
frascuchon Jul 9, 2024
303361a
tests: Update tests with task distribution
frascuchon Jul 9, 2024
43ba10f
chore: Use main TaskDistribution naning
frascuchon Jul 9, 2024
d6c186b
ci: Using feat branch docker image
frascuchon Jul 9, 2024
3e06890
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 9, 2024
aba06c7
Update argilla/src/argilla/_models/_dataset.py
frascuchon Jul 10, 2024
f2238e6
chore: Apply format suggestions
frascuchon Jul 10, 2024
b73004a
Merge branch 'feat/argilla/add-record-status-property' into feat/argi…
frascuchon Jul 10, 2024
2ea0a3e
chore: Export distribution in dataset
frascuchon Jul 10, 2024
b15de8f
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 11, 2024
f497140
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 11, 2024
bec0b0d
feat: add session helper with serializable isolation level (#5165)
jfcalvo Jul 12, 2024
8bf8abb
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
85e847f
[REFACTOR] `argilla-server`: remove deprecated records endpoint (#5206)
frascuchon Jul 12, 2024
1041487
chore: Add task distribution setter for dataset
frascuchon Jul 12, 2024
22263d8
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
c219764
[ENHANCEMENT] `argilla`: add record `status` property (#5184)
frascuchon Jul 12, 2024
ced0220
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
0c85b9d
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 15, 2024
a9375c1
[REFACTOR] cleaning list records endpoints (#5221)
frascuchon Jul 15, 2024
46f2640
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 15, 2024
f77341e
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 15, 2024
b456600
improvement: capture and retry database concurrent update errors (#5227)
jfcalvo Jul 16, 2024
8dd1c7e
chore: update CHANGELOG.md
jfcalvo Jul 16, 2024
4417af6
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
f284720
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
1a50c3a
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
ba3dc49
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 17, 2024
08b29e0
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 17, 2024
20ae663
🔀 Update UI for distribution task (#5219)
leiyre Jul 18, 2024
d77e9a8
fixing tests
frascuchon Jul 18, 2024
c9b865b
chore: Add distribution check
frascuchon Jul 18, 2024
9dba7ef
chore: set tools line-height to 88 characters
jfcalvo Jul 18, 2024
103556e
Revert "chore: set tools line-height to 88 characters"
jfcalvo Jul 18, 2024
e7d4b75
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 18, 2024
128e8a0
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 18, 2024
504ff7b
[ENHANCEMENT] improve es mappings for responses (#5228)
frascuchon Jul 18, 2024
940a812
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 18, 2024
42b80aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 18, 2024
b0b1846
[Docs] task distribution (#5246)
nataliaElv Jul 19, 2024
46bf786
Merge branch 'develop' into feat/argilla/add-support-to-distribution
frascuchon Jul 19, 2024
b3bca5f
Apply suggestions from code review
frascuchon Jul 19, 2024
0665271
Merge branch 'develop' into feat/argilla/add-support-to-distribution
frascuchon Jul 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/argilla.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
build:
services:
argilla-quickstart:
image: argilla/argilla-quickstart:main
image: argilladev/argilla-quickstart:pr-5136
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
ports:
- 6900:6900
env:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 10 additions & 16 deletions argilla/docs/how_to_guides/annotate.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,13 @@ If you are starting an annotation effort, all the records are initially kept in
- **Pending**: The records without a response.
- **Draft**: The records with partial responses. They can be submitted or discarded later. You can’t move them back to the pending queue.
- **Discarded**: The records may or may not have responses. They can be edited but you can’t move them back to the pending queue.
- **Submitted**: The records have been fully annotated and have already been submitted.
- **Submitted**: The records have been fully annotated and have already been submitted. You can remove them from this queue and send them to the draft or discarded queues, but never back to the pending queue.

!!! note
If you are working as part of a team, the number of records in your Pending queue may change as other members of the team submit responses and those records get completed.

!!! tip
If you are working as part of a team, the records in the draft queue that have been completed by other team members will show a check mark to indicate that there is no need to provide a response.

### Suggestions

Expand Down Expand Up @@ -115,9 +121,9 @@ The bulk view displays the records in a vertical list. Once this view is active,

### Annotation progress

The global progress of the annotation task from all users is displayed in the dataset list. This is indicated in the `Global progress` column, which shows the number of records still to be annotated, along with a progress bar. The progress bar displays the percentage and number of records submitted, conflicting (i.e., those with both submitted and discarded responses), discarded and pending by hovering your mouse over it.
You can track the progress of an annotation task in the progress bar shown in the dataset list and in the progress panel inside the dataset. This bar shows the number of records that have been completed (i.e., those that have the minimum number of submitted responses) and those left to be completed.

You can track your annotation progress in real time from the righ-bottom panel inside the dataset page. This means that, while you are annotating, the progress bar updates as you submit or discard a record. Expanding the panel, the distribution of `Pending`, `Draft`, `Submitted` and `Discarded` responses is displayed in a donut chart.
You can also track your own progress in real time expanding the right-bottom panel inside the dataset page. There you can see the number of records for which you have `Pending`, `Draft`, `Submitted` and `Discarded` responses.

## Use search, filters, and sort

Expand Down Expand Up @@ -173,16 +179,4 @@ You can sort your records according to one or several attributes.

The insertion time and last update are general to all records.

The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.

## Annotate in teams

!!! note
Argilla 2.1 will come with automatic task distribution, which will allow you to distribute the work across several users more efficiently.

### Edit guidelines in the settings

As an `owner` or `admin`, you can edit the guidelines as much as you need from the icon settings on the header. Markdown format is enabled.

!!! tip
If you want further guidance on good practices for guidelines during the project development, check this [blog post](https://argilla.io/blog/annotation-guidelines-practices/).
The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.
15 changes: 15 additions & 0 deletions argilla/docs/how_to_guides/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ A **dataset** is a collection of records that you can configure for labelers to
vectors=[rg.VectorField(name="vector", dimensions=10)],
guidelines="guidelines",
allow_extra_metadata=True,
distribution=2
)
```

Expand Down Expand Up @@ -96,6 +97,7 @@ settings = rg.Settings(
guidelines="Select the sentiment of the prompt.",
fields=[rg.TextField(name="prompt", use_markdown=True)],
questions=[rg.LabelQuestion(name="sentiment", labels=["positive", "negative"])],
distribution=rg.TaskDistribution(min_submitted=3)
)

dataset1 = rg.Dataset(name="sentiment_analysis_1", settings=settings)
Expand Down Expand Up @@ -395,6 +397,19 @@ It is good practice to use at least the dataset guidelines if not both methods.
!!! tip
If you want further guidance on good practices for guidelines during the project development, check our [blog post](https://argilla.io/blog/annotation-guidelines-practices/).

### Distribution

When working as a team, you may want to distribute the annotation task to ensure efficiency and quality. You can use the `TaskDistribution` settings to configure the number of minimum submitted responses expected for each record. Argilla will use this setting to automatically handle records in your team members' pending queues.

```python
rg.TaskDistribution(
min_submitted = 2
)
```

> To learn more about how to distribute the task among team members in the [Distribute the annotation guide](../how_to_guides/distribution.md).


## List datasets

You can list all the datasets available in a workspace using the `datasets` attribute of the `Workspace` class. You can also use `len(workspace.datasets)` to get the number of datasets in a workspace.
Expand Down
76 changes: 76 additions & 0 deletions argilla/docs/how_to_guides/distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
description: In this section, we will provide a step-by-step guide to show how to distribute the annotation task among team members.
---

# Distribute the annotation task among the team

This guide explains how you can use Argilla’s **automatic task distribution** to efficiently divide the task of annotating a dataset among multiple team members.

Owners and admins can define the minimum number of submitted responses expected for each record depending on whether the dataset should have annotation overlap and how much. Argilla will use this setting to handle automatically the records that will be shown in the pending queues of all users with access to the dataset.

When a record has met the minimum number of submissions, the status of the record will change to `completed` and the record will be removed from the `Pending` queue of all team members, so they can focus on providing responses where they are most needed. The dataset’s annotation task will be fully completed once all records have the `completed` status.

![Task Distribution diagram](../assets/images/how_to_guides/distribution/taskdistribution.svg)

!!! note
The status of a record can be either `completed`, when it has the required number of responses with `submitted` status, or `pending`, when it doesn’t meet this requirement.

Each record can have multiple responses and each of those can have the status `submitted`, `discarded` or `draft`.

!!! info "Main Class"

```python
rg.TaskDistribution(
min_submitted = 2
)
```
> Check the [Task Distribution - Python Reference](../reference/argilla/settings/task_distribution.md) to see the attributes, arguments, and methods of the `TaskDistribution` class in detail.

## Configure task distribution settings

By default, Argilla will set the required minimum submitted responses to 1. This means that whenever a record has at least 1 response with the status `submitted` the status of the record will be `completed` and removed from the `Pending` queue of other team members.

!!! tip
Leave the default value of minimum submissions (1) if you are working on your own or when you don't require more than one submitted response per record.

If you wish to set a different number, you can do so through the `distribution` setting in your dataset settings:

```python
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)
```

> Learn more about configuring dataset settings in the [Dataset management guide](../how_to_guides/dataset.md).

!!! tip
Increase the number of minimum subsmissions if you’d like to ensure you get more than one submitted response per record. Make sure that this number is never higher than the number of members in your team. Note that the lower this number is, the faster the task will be completed.

!!! note
Note that some records may have more responses than expected if multiple team members submit responses on the same record simultaneously.

## Change task distribution settings

If you wish to change the minimum submitted responses required in a dataset you can do so as long as the annotation hasn’t started, i.e. the dataset has no responses for any records.

Admins and owners can change this value from the dataset settings page in the UI or from the SDK:

```python
dataset = client.datasets(...)

dataset.settings.distribution.min_submitted = 4

dataset.update()
```
16 changes: 16 additions & 0 deletions argilla/docs/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,22 @@ These guides provide step-by-step instructions for common scenarios, including d

[:octicons-arrow-right-24: How-to guide](import_export.md)

- __Annotate a dataset__

---

Learn how to use the Argilla UI to navigate datasets and submit responses.

[:octicons-arrow-right-24: How-to guide](annotate.md)

- __Distribute the annotation__

---

Learn how to use Argilla's automatic task distribution to annotate as a team efficiently.

[:octicons-arrow-right-24: How-to guide](distribution.md)

</div>

## Advanced
Expand Down
9 changes: 7 additions & 2 deletions argilla/docs/how_to_guides/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ You can use the `Filter` class to define the conditions and pass them to the `Da

## Filter by status

You can filter records based on their status. The status can be `pending`, `draft`, `submitted`, or `discarded`.
You can filter records based on record or response status. Record status can be `pending` or `completed` and response status can be `pending`, `draft`, `submitted`, or `discarded`.

```python
import argilla as rg
Expand All @@ -134,7 +134,12 @@ workspace = client.workspaces("my_workspace")
dataset = client.datasets(name="my_dataset", workspace=workspace)

status_filter = rg.Query(
filter=rg.Filter(("response.status", "==", "submitted"))
filter=rg.Filter(
[
("status", "==", "completed"),
("response.status", "==", "discarded")
]
)
)

filtered_records = list(dataset.records(status_filter))
Expand Down
1 change: 1 addition & 0 deletions argilla/docs/reference/argilla/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
* [Questions](settings/questions.md)
* [Metadata](settings/metadata_property.md)
* [Vectors](settings/vectors.md)
* [Distribution](settings/task_distribution.md)
* [rg.Record](records/records.md)
* [rg.Response](records/responses.md)
* [rg.Suggestion](records/suggestions.md)
Expand Down
2 changes: 1 addition & 1 deletion argilla/docs/reference/argilla/settings/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ dataset.create()

```

> To define the settings for fields, questions, metadata, or vectors, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md) class documentation.
> To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.

---

Expand Down
42 changes: 42 additions & 0 deletions argilla/docs/reference/argilla/settings/task_distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
hide: footer
---
# Distribution

Distribution settings are used to define the criteria used by the tool to automatically manage records in the dataset depending on the expected number of submitted responses per record.

## Usage Examples

The default minimum submitted responses per record is 1. If you wish to increase this value, you can define it through the `TaskDistribution` class and pass it to the `Settings` class.

```python
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)

dataset = rg.Dataset(
name="my_dataset",
settings=settings
)
```

---

## `rg.TaskDistribution`

::: src.argilla.settings._task_distribution.OverlapTaskDistribution
options:
heading_level: 3
show_root_toc_entry: false
2 changes: 1 addition & 1 deletion argilla/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ nav:
- Query and filter records: how_to_guides/query.md
- Importing and exporting datasets: how_to_guides/import_export.md
- Annotate a dataset: how_to_guides/annotate.md
- Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md
- Distribute the annotation task: how_to_guides/distribution.md
- Advanced:
- Use Markdown to format rich content: how_to_guides/use_markdown_to_format_rich_content.md
- Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md
Expand Down
8 changes: 4 additions & 4 deletions argilla/src/argilla/_api/_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ class DatasetsAPI(ResourceAPI[DatasetModel]):

@api_error_handler
def create(self, dataset: "DatasetModel") -> "DatasetModel":
json_body = dataset.model_dump()
json_body = dataset.model_dump(exclude_unset=True)
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
response = self.http_client.post(
url=self.url_stub,
json=json_body,
Expand All @@ -48,13 +48,13 @@ def create(self, dataset: "DatasetModel") -> "DatasetModel":

@api_error_handler
def update(self, dataset: "DatasetModel") -> "DatasetModel":
json_body = dataset.model_dump()
json_body = dataset.model_dump(exclude_unset=True)
dataset_id = json_body["id"] # type: ignore
response = self.http_client.patch(f"{self.url_stub}/{dataset_id}", json=json_body)
response.raise_for_status()
response_json = response.json()
dataset = self._model_from_json(response_json=response_json)
self._log_message(message=f"Updated dataset {dataset.url}")
self._log_message(message=f"Updated dataset {dataset.id}")
frascuchon marked this conversation as resolved.
Show resolved Hide resolved
return dataset

@api_error_handler
Expand All @@ -63,7 +63,7 @@ def get(self, dataset_id: UUID) -> "DatasetModel":
response.raise_for_status()
response_json = response.json()
dataset = self._model_from_json(response_json=response_json)
self._log_message(message=f"Got dataset {dataset.url}")
self._log_message(message=f"Got dataset {dataset.id}")
davidberenstein1957 marked this conversation as resolved.
Show resolved Hide resolved
return dataset

@api_error_handler
Expand Down
9 changes: 5 additions & 4 deletions argilla/src/argilla/_models/_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,28 +12,29 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Optional
from datetime import datetime
from uuid import UUID
from typing import Literal
from typing import Optional
from uuid import UUID

from pydantic import field_serializer, ConfigDict

from argilla._models import ResourceModel

__all__ = ["DatasetModel"]

from argilla._models._settings._task_distribution import TaskDistributionModel


class DatasetModel(ResourceModel):
name: str
status: Literal["draft", "ready"] = "draft"

guidelines: Optional[str] = None
allow_extra_metadata: bool = True # Ideally, the default value should be provided by the server

distribution: Optional[TaskDistributionModel] = None
workspace_id: Optional[UUID] = None
last_activity_at: Optional[datetime] = None
url: Optional[str] = None

model_config = ConfigDict(
validate_assignment=True,
Expand Down
29 changes: 29 additions & 0 deletions argilla/src/argilla/_models/_settings/_task_distribution.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2024-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

__all__ = ["TaskDistributionModel", "OverlapTaskDistributionModel"]

from typing import Literal

from pydantic import BaseModel, PositiveInt, ConfigDict


class OverlapTaskDistributionModel(BaseModel):
strategy: Literal["overlap"]
min_submitted: PositiveInt
frascuchon marked this conversation as resolved.
Show resolved Hide resolved

model_config = ConfigDict(validate_assignment=True)


TaskDistributionModel = OverlapTaskDistributionModel
Loading