[FEATURE] argilla: add support to distribution (#5187)

# Description  This PR adds support to configure the task distribution strategy when creating or updating datasets. We can create datasets with specific task distribution setup ```python task_distribution = TaskDistribution(min_submitted=4) settings = Settings( fields=[TextField(name="text", title="text")], questions=[LabelQuestion(name="label", title="text", labels=["positive", "negative"])], distribution=task_distribution, ) dataset = Dataset(dataset_name, settings=settings).create() ``` or update an existing dataset (without any user response) ```python dataset = client.datasets(...) dataset.settings.distribution.min_submitted = 100 # or dataset.distribution.min_submitted = 100 # or dataset.distribution = TaskDistribution(min_submitted=100) dataset.update() ``` Closes #5033 Closes #5034 Refs: #5246 **Type of change**  - New feature (non-breaking change which adds functionality) - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: José Francisco Calvo <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Damián Pumar <[email protected]> Co-authored-by: José Francisco Calvo <[email protected]> Co-authored-by: Leire <[email protected]> Co-authored-by: David Berenstein <[email protected]> Co-authored-by: Natalia Elvira <[email protected]> Co-authored-by: Sara Han <[email protected]>
argilla-io · Jul 19, 2024 · e640924 · e640924
1 parent 4237e68
commit e640924
Show file tree

Hide file tree

Showing 21 changed files with 391 additions and 49 deletions.
diff --git a/.github/workflows/argilla.yml b/.github/workflows/argilla.yml
@@ -21,7 +21,7 @@ jobs:
   build:
     services:
       argilla-quickstart:
-        image: argilla/argilla-quickstart:main
+        image: argilladev/argilla-quickstart:develop
         ports:
           - 6900:6900
         env:

diff --git a/argilla/docs/assets/images/how_to_guides/distribution/taskdistribution.svg b/argilla/docs/assets/images/how_to_guides/distribution/taskdistribution.svg
diff --git a/argilla/docs/how_to_guides/annotate.md b/argilla/docs/how_to_guides/annotate.md
@@ -72,7 +72,13 @@ If you are starting an annotation effort, all the records are initially kept in
 - **Pending**: The records without a response.
 - **Draft**: The records with partial responses. They can be submitted or discarded later. You can’t move them back to the pending queue.
 - **Discarded**: The records may or may not have responses. They can be edited but you can’t move them back to the pending queue.
-- **Submitted**: The records have been fully annotated and have already been submitted.
+- **Submitted**: The records have been fully annotated and have already been submitted. You can remove them from this queue and send them to the draft or discarded queues, but never back to the pending queue.
+
+!!! note
+    If you are working as part of a team, the number of records in your Pending queue may change as other members of the team submit responses and those records get completed.
+
+!!! tip
+    If you are working as part of a team, the records in the draft queue that have been completed by other team members will show a check mark to indicate that there is no need to provide a response.
 
 ### Suggestions
 
@@ -115,9 +121,9 @@ The bulk view displays the records in a vertical list. Once this view is active,
 
 ### Annotation progress
 
-The global progress of the annotation task from all users is displayed in the dataset list. This is indicated in the `Global progress` column, which shows the number of records still to be annotated, along with a progress bar. The progress bar displays the percentage and number of records submitted, conflicting (i.e., those with both submitted and discarded responses), discarded and pending by hovering your mouse over it.
+You can track the progress of an annotation task in the progress bar shown in the dataset list and in the progress panel inside the dataset. This bar shows the number of records that have been completed (i.e., those that have the minimum number of submitted responses) and those left to be completed.
 
-You can track your annotation progress in real time from the righ-bottom panel inside the dataset page. This means that, while you are annotating, the progress bar updates as you submit or discard a record. Expanding the panel, the distribution of `Pending`, `Draft`, `Submitted` and `Discarded` responses is displayed in a donut chart.
+You can also track your own progress in real time expanding the right-bottom panel inside the dataset page. There you can see the number of records for which you have `Pending`, `Draft`, `Submitted` and `Discarded` responses.
 
 ## Use search, filters, and sort
 
@@ -173,16 +179,4 @@ You can sort your records according to one or several attributes.
 
 The insertion time and last update are general to all records.
 
-The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.
-
-## Annotate in teams
-
-!!! note
-    Argilla 2.1 will come with automatic task distribution, which will allow you to distribute the work across several users more efficiently.
-
-### Edit guidelines in the settings
-
-As an `owner` or `admin`, you can edit the guidelines as much as you need from the icon settings on the header. Markdown format is enabled.
-
-!!! tip
-    If you want further guidance on good practices for guidelines during the project development, check this [blog post](https://argilla.io/blog/annotation-guidelines-practices/).
+The suggestion scores, response, and suggestion values for rating questions and metadata properties are available only when they were provided.
diff --git a/argilla/docs/how_to_guides/dataset.md b/argilla/docs/how_to_guides/dataset.md
@@ -42,6 +42,7 @@ A **dataset** is a collection of records that you can configure for labelers to
             vectors=[rg.VectorField(name="vector", dimensions=10)],
             guidelines="guidelines",
             allow_extra_metadata=True,
+            distribution=2
         )
         ```
 
@@ -96,6 +97,7 @@ settings = rg.Settings(
     guidelines="Select the sentiment of the prompt.",
     fields=[rg.TextField(name="prompt", use_markdown=True)],
     questions=[rg.LabelQuestion(name="sentiment", labels=["positive", "negative"])],
+    distribution=rg.TaskDistribution(min_submitted=3)
 )
 
 dataset1 = rg.Dataset(name="sentiment_analysis_1", settings=settings)
@@ -395,6 +397,19 @@ It is good practice to use at least the dataset guidelines if not both methods.
 !!! tip
     If you want further guidance on good practices for guidelines during the project development, check our [blog post](https://argilla.io/blog/annotation-guidelines-practices/).
 
+### Distribution
+
+When working as a team, you may want to distribute the annotation task to ensure efficiency and quality. You can use the `TaskDistribution` settings to configure the number of minimum submitted responses expected for each record. Argilla will use this setting to automatically handle records in your team members' pending queues.
+
+```python
+rg.TaskDistribution(
+    min_submitted = 2
+)
+```
+
+> To learn more about how to distribute the task among team members in the [Distribute the annotation guide](../how_to_guides/distribution.md).
+
+
 ## List datasets
 
 You can list all the datasets available in a workspace using the `datasets` attribute of the `Workspace` class. You can also use `len(workspace.datasets)` to get the number of datasets in a workspace.

diff --git a/argilla/docs/how_to_guides/distribution.md b/argilla/docs/how_to_guides/distribution.md
@@ -0,0 +1,76 @@
+---
+description: In this section, we will provide a step-by-step guide to show how to distribute the annotation task among team members.
+---
+
+# Distribute the annotation task among the team
+
+This guide explains how you can use Argilla’s **automatic task distribution** to efficiently divide the task of annotating a dataset among multiple team members.
+
+Owners and admins can define the minimum number of submitted responses expected for each record depending on whether the dataset should have annotation overlap and how much. Argilla will use this setting to handle automatically the records that will be shown in the pending queues of all users with access to the dataset.
+
+When a record has met the minimum number of submissions, the status of the record will change to `completed` and the record will be removed from the `Pending` queue of all team members, so they can focus on providing responses where they are most needed. The dataset’s annotation task will be fully completed once all records have the `completed` status.
+
+![Task Distribution diagram](../assets/images/how_to_guides/distribution/taskdistribution.svg)
+
+!!! note
+    The status of a record can be either `completed`, when it has the required number of responses with `submitted` status, or `pending`, when it doesn’t meet this requirement.
+
+    Each record can have multiple responses and each of those can have the status `submitted`, `discarded` or `draft`.
+
+!!! info "Main Class"
+
+    ```python
+    rg.TaskDistribution(
+        min_submitted = 2
+    )
+    ```
+    > Check the [Task Distribution - Python Reference](../reference/argilla/settings/task_distribution.md) to see the attributes, arguments, and methods of the `TaskDistribution` class in detail.
+
+## Configure task distribution settings
+
+By default, Argilla will set the required minimum submitted responses to 1. This means that whenever a record has at least 1 response with the status `submitted` the status of the record will be `completed` and removed from the `Pending` queue of other team members.
+
+!!! tip
+    Leave the default value of minimum submissions (1) if you are working on your own or when you don't require more than one submitted response per record.
+
+If you wish to set a different number, you can do so through the `distribution` setting in your dataset settings:
+
+```python
+settings = rg.Settings(
+    guidelines="These are some guidelines.",
+    fields=[
+        rg.TextField(
+            name="text",
+        ),
+    ],
+    questions=[
+        rg.LabelQuestion(
+            name="label",
+            labels=["label_1", "label_2", "label_3"]
+        ),
+    ],
+    distribution=rg.TaskDistribution(min_submitted=3)
+)
+```
+
+> Learn more about configuring dataset settings in the [Dataset management guide](../how_to_guides/dataset.md).
+
+!!! tip
+    Increase the number of minimum subsmissions if you’d like to ensure you get more than one submitted response per record. Make sure that this number is never higher than the number of members in your team. Note that the lower this number is, the faster the task will be completed.
+
+!!! note
+    Note that some records may have more responses than expected if multiple team members submit responses on the same record simultaneously.
+
+## Change task distribution settings
+
+If you wish to change the minimum submitted responses required in a dataset you can do so as long as the annotation hasn’t started, i.e. the dataset has no responses for any records.
+
+Admins and owners can change this value from the dataset settings page in the UI or from the SDK:
+
+```python
+dataset = client.datasets(...)
+
+dataset.settings.distribution.min_submitted = 4
+
+dataset.update()
+```
diff --git a/argilla/docs/how_to_guides/index.md b/argilla/docs/how_to_guides/index.md
@@ -59,6 +59,22 @@ These guides provide step-by-step instructions for common scenarios, including d
 
     [:octicons-arrow-right-24: How-to guide](import_export.md)
 
+-   __Annotate a dataset__
+
+    ---
+
+    Learn how to use the Argilla UI to navigate datasets and submit responses.
+
+    [:octicons-arrow-right-24: How-to guide](annotate.md)
+
+-   __Distribute the annotation__
+
+    ---
+
+    Learn how to use Argilla's automatic task distribution to annotate as a team efficiently.
+
+    [:octicons-arrow-right-24: How-to guide](distribution.md)
+
 </div>
 
 ## Advanced

diff --git a/argilla/docs/how_to_guides/query.md b/argilla/docs/how_to_guides/query.md
@@ -122,7 +122,7 @@ You can use the `Filter` class to define the conditions and pass them to the `Da
 
 ## Filter by status
 
-You can filter records based on their status. The status can be `pending`, `draft`, `submitted`, or `discarded`.
+You can filter records based on record or response status. Record status can be `pending` or `completed` and response status can be `pending`, `draft`, `submitted`, or `discarded`.
 
 ```python
 import argilla as rg
@@ -134,7 +134,12 @@ workspace = client.workspaces("my_workspace")
 dataset = client.datasets(name="my_dataset", workspace=workspace)
 
 status_filter = rg.Query(
-    filter=rg.Filter(("response.status", "==", "submitted"))
+    filter=rg.Filter(
+        [
+            ("status", "==", "completed"),
+            ("response.status", "==", "discarded")
+        ]
+    )
 )
 
 filtered_records = list(dataset.records(status_filter))

diff --git a/argilla/docs/reference/argilla/SUMMARY.md b/argilla/docs/reference/argilla/SUMMARY.md
@@ -8,6 +8,7 @@
     * [Questions](settings/questions.md)
     * [Metadata](settings/metadata_property.md)
     * [Vectors](settings/vectors.md)
+    * [Distribution](settings/task_distribution.md)
 * [rg.Record](records/records.md)
     * [rg.Response](records/responses.md)
     * [rg.Suggestion](records/suggestions.md)

diff --git a/argilla/docs/reference/argilla/settings/settings.md b/argilla/docs/reference/argilla/settings/settings.md
@@ -30,7 +30,7 @@ dataset.create()
 
 ```
 
-> To define the settings for fields, questions, metadata, or vectors, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md) class documentation.
+> To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.
 
 ---
 

diff --git a/argilla/docs/reference/argilla/settings/task_distribution.md b/argilla/docs/reference/argilla/settings/task_distribution.md
@@ -0,0 +1,42 @@
+---
+hide: footer
+---
+# Distribution
+
+Distribution settings are used to define the criteria used by the tool to automatically manage records in the dataset depending on the expected number of submitted responses per record.
+
+## Usage Examples
+
+The default minimum submitted responses per record is 1. If you wish to increase this value, you can define it through the `TaskDistribution` class and pass it to the `Settings` class.
+
+```python
+settings = rg.Settings(
+    guidelines="These are some guidelines.",
+    fields=[
+        rg.TextField(
+            name="text",
+        ),
+    ],
+    questions=[
+        rg.LabelQuestion(
+            name="label",
+            labels=["label_1", "label_2", "label_3"]
+        ),
+    ],
+    distribution=rg.TaskDistribution(min_submitted=3)
+)
+
+dataset = rg.Dataset(
+    name="my_dataset",
+    settings=settings
+)
+```
+
+---
+
+## `rg.TaskDistribution`
+
+::: src.argilla.settings._task_distribution.OverlapTaskDistribution
+    options:
+        heading_level: 3
+        show_root_toc_entry: false
diff --git a/argilla/mkdocs.yml b/argilla/mkdocs.yml
@@ -142,7 +142,7 @@ nav:
       - Query and filter records: how_to_guides/query.md
       - Importing and exporting datasets: how_to_guides/import_export.md
       - Annotate a dataset: how_to_guides/annotate.md
-      - Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md
+      - Distribute the annotation task: how_to_guides/distribution.md
       - Advanced:
         - Use Markdown to format rich content: how_to_guides/use_markdown_to_format_rich_content.md
         - Migrate your legacy datasets to Argilla V2: how_to_guides/migrate_from_legacy_datasets.md

diff --git a/argilla/src/argilla/_api/_datasets.py b/argilla/src/argilla/_api/_datasets.py
@@ -35,7 +35,7 @@ class DatasetsAPI(ResourceAPI[DatasetModel]):
 
     @api_error_handler
     def create(self, dataset: "DatasetModel") -> "DatasetModel":
-        json_body = dataset.model_dump()
+        json_body = dataset.model_dump(exclude_unset=True)
         response = self.http_client.post(
             url=self.url_stub,
             json=json_body,
@@ -48,13 +48,13 @@ def create(self, dataset: "DatasetModel") -> "DatasetModel":
 
     @api_error_handler
     def update(self, dataset: "DatasetModel") -> "DatasetModel":
-        json_body = dataset.model_dump()
+        json_body = dataset.model_dump(exclude_unset=True)
         dataset_id = json_body["id"]  # type: ignore
         response = self.http_client.patch(f"{self.url_stub}/{dataset_id}", json=json_body)
         response.raise_for_status()
         response_json = response.json()
         dataset = self._model_from_json(response_json=response_json)
-        self._log_message(message=f"Updated dataset {dataset.url}")
+        self._log_message(message=f"Updated dataset {dataset.id}")
         return dataset
 
     @api_error_handler
@@ -63,7 +63,7 @@ def get(self, dataset_id: UUID) -> "DatasetModel":
         response.raise_for_status()
         response_json = response.json()
         dataset = self._model_from_json(response_json=response_json)
-        self._log_message(message=f"Got dataset {dataset.url}")
+        self._log_message(message=f"Got dataset {dataset.id}")
         return dataset
 
     @api_error_handler

diff --git a/argilla/src/argilla/_models/_dataset.py b/argilla/src/argilla/_models/_dataset.py
@@ -12,28 +12,29 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from typing import Optional
 from datetime import datetime
-from uuid import UUID
 from typing import Literal
+from typing import Optional
+from uuid import UUID
 
 from pydantic import field_serializer, ConfigDict
 
 from argilla._models import ResourceModel
 
 __all__ = ["DatasetModel"]
 
+from argilla._models._settings._task_distribution import TaskDistributionModel
+
 
 class DatasetModel(ResourceModel):
     name: str
     status: Literal["draft", "ready"] = "draft"
 
     guidelines: Optional[str] = None
     allow_extra_metadata: bool = True  # Ideally, the default value should be provided by the server
-
+    distribution: Optional[TaskDistributionModel] = None
     workspace_id: Optional[UUID] = None
     last_activity_at: Optional[datetime] = None
-    url: Optional[str] = None
 
     model_config = ConfigDict(
         validate_assignment=True,

diff --git a/argilla/src/argilla/_models/_settings/_task_distribution.py b/argilla/src/argilla/_models/_settings/_task_distribution.py
@@ -0,0 +1,29 @@
+# Copyright 2024-present, Argilla, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__all__ = ["TaskDistributionModel", "OverlapTaskDistributionModel"]
+
+from typing import Literal
+
+from pydantic import BaseModel, PositiveInt, ConfigDict
+
+
+class OverlapTaskDistributionModel(BaseModel):
+    strategy: Literal["overlap"]
+    min_submitted: PositiveInt
+
+    model_config = ConfigDict(validate_assignment=True)
+
+
+TaskDistributionModel = OverlapTaskDistributionModel
-Original file line number
+Diff line change
@@ Expand Up / @@ -30,7 +30,7 @@ dataset.create() @@
     ```
-    > To define the settings for fields, questions, metadata, or vectors, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md) class documentation.
+    > To define the settings for fields, questions, metadata, vectors, or distribution, refer to the [`rg.TextField`](fields.md), [`rg.LabelQuestion`](questions.md), [`rg.TermsMetadataProperty`](metadata_property.md), and [`rg.VectorField`](vectors.md), [`rg.TaskDistribution`](task_distribution.md) class documentation.
     ---
@@ Expand Down @@