Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] argilla: add support to distribution #5187

Merged
merged 60 commits into from
Jul 19, 2024
Merged
Changes from 1 commit
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
f62d58a
feat: add dataset support to be created using distribution settings (…
jfcalvo Jul 1, 2024
017001f
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 1, 2024
f084ab7
✨ Remove unused method
damianpumar Jul 4, 2024
c8ef4c6
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 4, 2024
6df5256
feat: improve Records `responses_submitted` relationship to be view o…
jfcalvo Jul 4, 2024
dbae135
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 4, 2024
cf3408c
feat: change metrics to support new distribution task logic (#5140)
jfcalvo Jul 4, 2024
8e8b116
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 5, 2024
808c837
[ENHANCEMENT]: `argilla-server`: allow update distribution for non an…
frascuchon Jul 8, 2024
3d74a33
chore: Add status field to record model
frascuchon Jul 9, 2024
7b7d2f5
feat: Add read-only property 'status' to the record resource
frascuchon Jul 9, 2024
736bfc9
tests: Update tests to reflect the status property
frascuchon Jul 9, 2024
f241e41
fix: wrong filter naming after merge from develop
frascuchon Jul 9, 2024
307b38c
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 9, 2024
67d4ee3
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 9, 2024
9b84dcf
chore: Remove message match (depends on python version
frascuchon Jul 9, 2024
08e5757
chore: Add task distribution model
frascuchon Jul 9, 2024
443b9d0
feat: Add support to task distribution
frascuchon Jul 9, 2024
303361a
tests: Update tests with task distribution
frascuchon Jul 9, 2024
43ba10f
chore: Use main TaskDistribution naning
frascuchon Jul 9, 2024
d6c186b
ci: Using feat branch docker image
frascuchon Jul 9, 2024
3e06890
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 9, 2024
aba06c7
Update argilla/src/argilla/_models/_dataset.py
frascuchon Jul 10, 2024
f2238e6
chore: Apply format suggestions
frascuchon Jul 10, 2024
b73004a
Merge branch 'feat/argilla/add-record-status-property' into feat/argi…
frascuchon Jul 10, 2024
2ea0a3e
chore: Export distribution in dataset
frascuchon Jul 10, 2024
b15de8f
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 11, 2024
f497140
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 11, 2024
bec0b0d
feat: add session helper with serializable isolation level (#5165)
jfcalvo Jul 12, 2024
8bf8abb
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
85e847f
[REFACTOR] `argilla-server`: remove deprecated records endpoint (#5206)
frascuchon Jul 12, 2024
1041487
chore: Add task distribution setter for dataset
frascuchon Jul 12, 2024
22263d8
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
c219764
[ENHANCEMENT] `argilla`: add record `status` property (#5184)
frascuchon Jul 12, 2024
ced0220
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 12, 2024
0c85b9d
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 15, 2024
a9375c1
[REFACTOR] cleaning list records endpoints (#5221)
frascuchon Jul 15, 2024
46f2640
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 15, 2024
f77341e
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 15, 2024
b456600
improvement: capture and retry database concurrent update errors (#5227)
jfcalvo Jul 16, 2024
8dd1c7e
chore: update CHANGELOG.md
jfcalvo Jul 16, 2024
4417af6
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
f284720
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
1a50c3a
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 16, 2024
ba3dc49
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
frascuchon Jul 17, 2024
08b29e0
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 17, 2024
20ae663
🔀 Update UI for distribution task (#5219)
leiyre Jul 18, 2024
d77e9a8
fixing tests
frascuchon Jul 18, 2024
c9b865b
chore: Add distribution check
frascuchon Jul 18, 2024
9dba7ef
chore: set tools line-height to 88 characters
jfcalvo Jul 18, 2024
103556e
Revert "chore: set tools line-height to 88 characters"
jfcalvo Jul 18, 2024
e7d4b75
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 18, 2024
128e8a0
Merge branch 'develop' into feat/add-dataset-automatic-task-distribution
jfcalvo Jul 18, 2024
504ff7b
[ENHANCEMENT] improve es mappings for responses (#5228)
frascuchon Jul 18, 2024
940a812
Merge branch 'feat/add-dataset-automatic-task-distribution' into feat…
frascuchon Jul 18, 2024
42b80aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 18, 2024
b0b1846
[Docs] task distribution (#5246)
nataliaElv Jul 19, 2024
46bf786
Merge branch 'develop' into feat/argilla/add-support-to-distribution
frascuchon Jul 19, 2024
b3bca5f
Apply suggestions from code review
frascuchon Jul 19, 2024
0665271
Merge branch 'develop' into feat/argilla/add-support-to-distribution
frascuchon Jul 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
feat: add dataset support to be created using distribution settings (#…
…5013)

# Description

This PR is the first one related with distribution task feature, adding
the following changes:
* Added `distribution` JSON column to `datasets` table:
* This column is non-nullable so a value is always required when a
dataset is created.
* By default old datasets will have the value `{"strategy": "overlap",
"min_submitted": 1}`.
* Added `distribution` attribute to `DatasetCreate` schema:
  * None is not a valid value.
* If no value is specified for this attribute
`DatasetOverlapDistributionCreate` with `min_submitted` to `1` is used.
* `DatasetOverlapDistributionCreate` only allows values greater or equal
than `1` for `min_submitted` attributed.
* Now the context `create_dataset` function is receiving a dictionary
instead of `DatasetCreate` schema.
* Moved dataset creation validations to a new `DatasetCreateValidator`
class.

Update of `distribution` attribute for datasets will be done in a
different issue.

Closes #5005 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [ ] Improvement (change adding some improvement to an existing
functionality)
- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [x] Adding new tests and passing old ones.
- [x] Check that migration works as expected with old datasets and
SQLite.
- [x] Check that migration works as expected with old datasets and
PostgreSQL.

**Checklist**

- [ ] I added relevant documentation
- [ ] follows the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [ ] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Paco Aranda <[email protected]>
3 people authored Jul 1, 2024
commit f62d58a2f91e16eb4d02e56c4039a432070349c0
Original file line number Diff line number Diff line change
@@ -42,10 +42,8 @@ export class RecordRepository {
constructor(private readonly axios: NuxtAxiosInstance) {}

getRecords(criteria: RecordCriteria): Promise<BackendRecords> {
if (criteria.isFilteringByAdvanceSearch)
return this.getRecordsByAdvanceSearch(criteria);

return this.getRecordsByDatasetId(criteria);
return this.getRecordsByAdvanceSearch(criteria);
// return this.getRecordsByDatasetId(criteria);
}

async getRecord(recordId: string): Promise<BackendRecord> {
@@ -264,6 +262,30 @@ export class RecordRepository {
};
}

body.filters = {
and: [
{
type: "terms",
scope: {
entity: "response",
property: "status",
},
values: [status],
},
],
};

if (status === "pending") {
body.filters.and.push({
type: "terms",
scope: {
entity: "record",
property: "status",
},
values: ["pending"],
});
}

if (
isFilteringByMetadata ||
isFilteringByResponse ||
7 changes: 6 additions & 1 deletion argilla-server/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -16,12 +16,17 @@ These are the section headers that we use:

## [Unreleased]()

## [2.0.0rc1](https://github.com/argilla-io/argilla/compare/v1.29.0...v2.0.0rc1)
### Added

- Added support to specify `distribution` attribute when creating a dataset. ([#5013](https://github.com/argilla-io/argilla/pull/5013))
- Added support to change `distribution` attribute when updating a dataset. ([#5028](https://github.com/argilla-io/argilla/pull/5028))

### Changed

- Change `responses` table to delete rows on cascade when a user is deleted. ([#5126](https://github.com/argilla-io/argilla/pull/5126))

## [2.0.0rc1](https://github.com/argilla-io/argilla/compare/v1.29.0...v2.0.0rc1)

### Removed

- Removed all API v0 endpoints. ([#4852](https://github.com/argilla-io/argilla/pull/4852))
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""add status column to records table

Revision ID: 237f7c674d74
Revises: 45a12f74448b
Create Date: 2024-06-18 17:59:36.992165

"""

from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = "237f7c674d74"
down_revision = "45a12f74448b"
branch_labels = None
depends_on = None


record_status_enum = sa.Enum("pending", "completed", name="record_status_enum")


def upgrade() -> None:
record_status_enum.create(op.get_bind())

op.add_column("records", sa.Column("status", record_status_enum, server_default="pending", nullable=False))
op.create_index(op.f("ix_records_status"), "records", ["status"], unique=False)

# NOTE: Updating existent records to have "completed" status when they have
# at least one response with "submitted" status.
op.execute("""
UPDATE records
SET status = 'completed'
WHERE id IN (
SELECT DISTINCT record_id
FROM responses
WHERE status = 'submitted'
);
""")


def downgrade() -> None:
op.drop_index(op.f("ix_records_status"), table_name="records")
op.drop_column("records", "status")

record_status_enum.drop(op.get_bind())
Original file line number Diff line number Diff line change
@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""add record metadata column
"""add metadata column to records table

Revision ID: 3ff6484f8b37
Revises: ae5522b4c674
@@ -31,12 +31,8 @@


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.add_column("records", sa.Column("metadata", sa.JSON(), nullable=True))
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_column("records", "metadata")
# ### end Alembic commands ###
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright 2021-present, the Recognai S.L. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""add distribution column to datasets table

Revision ID: 45a12f74448b
Revises: d00f819ccc67
Create Date: 2024-06-13 11:23:43.395093

"""

import json

import sqlalchemy as sa
from alembic import op

# revision identifiers, used by Alembic.
revision = "45a12f74448b"
down_revision = "d00f819ccc67"
branch_labels = None
depends_on = None

DISTRIBUTION_VALUE = json.dumps({"strategy": "overlap", "min_submitted": 1})


def upgrade() -> None:
op.add_column("datasets", sa.Column("distribution", sa.JSON(), nullable=True))
op.execute(f"UPDATE datasets SET distribution = '{DISTRIBUTION_VALUE}'")
with op.batch_alter_table("datasets") as batch_op:
batch_op.alter_column("distribution", nullable=False)


def downgrade() -> None:
op.drop_column("datasets", "distribution")
Original file line number Diff line number Diff line change
@@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""add allow_extra_metadata column to dataset table
"""add allow_extra_metadata column to datasets table

Revision ID: b8458008b60e
Revises: 7cbcccf8b57a
@@ -31,14 +31,10 @@


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.add_column(
"datasets", sa.Column("allow_extra_metadata", sa.Boolean(), server_default=sa.text("true"), nullable=False)
)
# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.drop_column("datasets", "allow_extra_metadata")
# ### end Alembic commands ###
Original file line number Diff line number Diff line change
@@ -189,7 +189,7 @@ async def create_dataset(
):
await authorize(current_user, DatasetPolicy.create(dataset_create.workspace_id))

return await datasets.create_dataset(db, dataset_create)
return await datasets.create_dataset(db, dataset_create.dict())


@router.post("/datasets/{dataset_id}/fields", status_code=status.HTTP_201_CREATED, response_model=Field)
@@ -302,4 +302,4 @@ async def update_dataset(

await authorize(current_user, DatasetPolicy.update(dataset))

return await datasets.update_dataset(db, dataset, dataset_update)
return await datasets.update_dataset(db, dataset, dataset_update.dict(exclude_unset=True))
Original file line number Diff line number Diff line change
@@ -64,7 +64,9 @@ async def update_response(
response = await Response.get_or_raise(
db,
response_id,
options=[selectinload(Response.record).selectinload(Record.dataset).selectinload(Dataset.questions)],
options=[
selectinload(Response.record).selectinload(Record.dataset).selectinload(Dataset.questions),
],
)

await authorize(current_user, ResponsePolicy.update(response))
@@ -83,7 +85,9 @@ async def delete_response(
response = await Response.get_or_raise(
db,
response_id,
options=[selectinload(Response.record).selectinload(Record.dataset).selectinload(Dataset.questions)],
options=[
selectinload(Response.record).selectinload(Record.dataset).selectinload(Dataset.questions),
],
)

await authorize(current_user, ResponsePolicy.delete(response))
38 changes: 35 additions & 3 deletions argilla-server/src/argilla_server/api/schemas/v1/datasets.py
Original file line number Diff line number Diff line change
@@ -13,11 +13,11 @@
# limitations under the License.

from datetime import datetime
from typing import List, Optional
from typing import List, Literal, Optional, Union
from uuid import UUID

from argilla_server.api.schemas.v1.commons import UpdateSchema
from argilla_server.enums import DatasetStatus
from argilla_server.enums import DatasetDistributionStrategy, DatasetStatus
from argilla_server.pydantic_v1 import BaseModel, Field, constr

try:
@@ -44,6 +44,32 @@
]


class DatasetOverlapDistribution(BaseModel):
strategy: Literal[DatasetDistributionStrategy.overlap]
min_submitted: int


DatasetDistribution = DatasetOverlapDistribution


class DatasetOverlapDistributionCreate(BaseModel):
strategy: Literal[DatasetDistributionStrategy.overlap]
min_submitted: int = Field(
ge=1,
description="Minimum number of submitted responses to consider a record as completed",
)


DatasetDistributionCreate = DatasetOverlapDistributionCreate


class DatasetOverlapDistributionUpdate(DatasetDistributionCreate):
pass


DatasetDistributionUpdate = DatasetOverlapDistributionUpdate


class RecordMetrics(BaseModel):
count: int

@@ -74,6 +100,7 @@ class Dataset(BaseModel):
guidelines: Optional[str]
allow_extra_metadata: bool
status: DatasetStatus
distribution: DatasetDistribution
workspace_id: UUID
last_activity_at: datetime
inserted_at: datetime
@@ -91,12 +118,17 @@ class DatasetCreate(BaseModel):
name: DatasetName
guidelines: Optional[DatasetGuidelines]
allow_extra_metadata: bool = True
distribution: DatasetDistributionCreate = DatasetOverlapDistributionCreate(
strategy=DatasetDistributionStrategy.overlap,
min_submitted=1,
)
workspace_id: UUID


class DatasetUpdate(UpdateSchema):
name: Optional[DatasetName]
guidelines: Optional[DatasetGuidelines]
allow_extra_metadata: Optional[bool]
distribution: Optional[DatasetDistributionUpdate]

__non_nullable_fields__ = {"name", "allow_extra_metadata"}
__non_nullable_fields__ = {"name", "allow_extra_metadata", "distribution"}
5 changes: 3 additions & 2 deletions argilla-server/src/argilla_server/api/schemas/v1/records.py
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@
from argilla_server.api.schemas.v1.metadata_properties import MetadataPropertyName
from argilla_server.api.schemas.v1.responses import Response, ResponseFilterScope, UserResponseCreate
from argilla_server.api.schemas.v1.suggestions import Suggestion, SuggestionCreate, SuggestionFilterScope
from argilla_server.enums import RecordInclude, RecordSortField, SimilarityOrder, SortOrder
from argilla_server.enums import RecordInclude, RecordSortField, SimilarityOrder, SortOrder, RecordStatus
from argilla_server.pydantic_v1 import BaseModel, Field, StrictStr, root_validator, validator
from argilla_server.pydantic_v1.utils import GetterDict
from argilla_server.search_engine import TextQuery
@@ -66,6 +66,7 @@ def get(self, key: str, default: Any) -> Any:

class Record(BaseModel):
id: UUID
status: RecordStatus
fields: Dict[str, Any]
metadata: Optional[Dict[str, Any]]
external_id: Optional[str]
@@ -196,7 +197,7 @@ def _has_relationships(self):

class RecordFilterScope(BaseModel):
entity: Literal["record"]
property: Union[Literal[RecordSortField.inserted_at], Literal[RecordSortField.updated_at]]
property: Union[Literal[RecordSortField.inserted_at], Literal[RecordSortField.updated_at], Literal["status"]]


class Records(BaseModel):
4 changes: 4 additions & 0 deletions argilla-server/src/argilla_server/bulk/records_bulk.py
Original file line number Diff line number Diff line change
@@ -29,6 +29,7 @@
)
from argilla_server.api.schemas.v1.responses import UserResponseCreate
from argilla_server.api.schemas.v1.suggestions import SuggestionCreate
from argilla_server.contexts import distribution
from argilla_server.contexts.accounts import fetch_users_by_ids_as_dict
from argilla_server.contexts.records import (
fetch_records_by_external_ids_as_dict,
@@ -67,6 +68,7 @@ async def create_records_bulk(self, dataset: Dataset, bulk_create: RecordsBulkCr

await self._upsert_records_relationships(records, bulk_create.items)
await _preload_records_relationships_before_index(self._db, records)
await distribution.update_records_status(self._db, records)
await self._search_engine.index_records(dataset, records)

await self._db.commit()
@@ -207,6 +209,7 @@ async def upsert_records_bulk(self, dataset: Dataset, bulk_upsert: RecordsBulkUp

await self._upsert_records_relationships(records, bulk_upsert.items)
await _preload_records_relationships_before_index(self._db, records)
await distribution.update_records_status(self._db, records)
await self._search_engine.index_records(dataset, records)

await self._db.commit()
@@ -237,6 +240,7 @@ async def _preload_records_relationships_before_index(db: "AsyncSession", record
.filter(Record.id.in_([record.id for record in records]))
.options(
selectinload(Record.responses).selectinload(Response.user),
selectinload(Record.responses_submitted),
selectinload(Record.suggestions).selectinload(Suggestion.question),
selectinload(Record.vectors),
)
Loading