feat: automatic distribution task #5136

jfcalvo · 2024-07-01T10:44:16Z

Description

This is the feature branch with all the changes to support the new distribution task for datasets.

Changes included in this PR

Closes #5000

Type of change

New feature (non-breaking change which adds functionality)

How Has This Been Tested

Added new tests and manually test with SQLite and PostgreSQL.
Testing possible concurrency issues with SQLite and PostgreSQL.

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
I confirm My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

…5013) # Description This PR is the first one related with distribution task feature, adding the following changes: * Added `distribution` JSON column to `datasets` table: * This column is non-nullable so a value is always required when a dataset is created. * By default old datasets will have the value `{"strategy": "overlap", "min_submitted": 1}`. * Added `distribution` attribute to `DatasetCreate` schema: * None is not a valid value. * If no value is specified for this attribute `DatasetOverlapDistributionCreate` with `min_submitted` to `1` is used. * `DatasetOverlapDistributionCreate` only allows values greater or equal than `1` for `min_submitted` attributed. * Now the context `create_dataset` function is receiving a dictionary instead of `DatasetCreate` schema. * Moved dataset creation validations to a new `DatasetCreateValidator` class. Update of `distribution` attribute for datasets will be done in a different issue. Closes #5005 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Refactor (change restructuring the codebase without changing functionality) - [ ] Improvement (change adding some improvement to an existing functionality) - [ ] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [x] Adding new tests and passing old ones. - [x] Check that migration works as expected with old datasets and SQLite. - [x] Check that migration works as expected with old datasets and PostgreSQL. **Checklist** - [ ] I added relevant documentation - [ ] follows the style guidelines of this project - [ ] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Paco Aranda <[email protected]>

codecov · 2024-07-01T10:50:23Z

Codecov Report

Attention: Patch coverage is 89.34708% with 31 lines in your changes missing coverage. Please review.

Project coverage is 90.41%. Comparing base (44074f9) to head (504ff7b).

Files	Patch %	Lines
...server/src/argilla_server/search_engine/commons.py	82.81%	22 Missing ⚠️
argilla-server/src/argilla_server/database.py	46.15%	7 Missing ⚠️
...la-server/src/argilla_server/search_engine/base.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5136      +/-   ##
===========================================
- Coverage    91.54%   90.41%   -1.13%     
===========================================
  Files          135      137       +2     
  Lines         5865     5749     -116     
===========================================
- Hits          5369     5198     -171     
- Misses         496      551      +55

Flag	Coverage Δ
argilla-server	`90.41% <89.34%> (-1.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-07-01T10:53:35Z

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-5136-ki24f765kq-no.a.run.app

…nly (#5148) # Description Add changes to `responses_submitted` relationship to avoid problems with existent `responses` relationship and avoid a warning message that SQLAlchemy was reporting. Refs #5000 **Type of change** - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** - [x] Warning is not showing anymore. - [x] Test are passing. **Checklist** - I added relevant documentation - follows the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@damianpumar

# Description This PR adds changes to the endpoints to get the dataset progress and current user metrics in the following way: ## `GET /datasets/:dataset_id/progress` I have changed the endpoint to support the new business logic behind the distribution task. Responding with only `completed` and `pending` type of records and using `total` as the sum of the two types of records. Old response without distribution task: ```json { "total": 8, "submitted": 2, "discarded": 2, "conflicting": 1, "pending": 3 } ``` New response with the changes from this PR supporting distribution task: * The `completed` attribute will have the count of all the records with status as `completed` for the dataset. * The `pending` attribute will have the count of all the records with status as `pending` for the dataset. * The `total` attribute will have the sum of the `completed` and `pending` attributes. ```json { "total": 5 "completed": 2, "pending": 3, } ``` @damianpumar some changes are required on the frontend to support this new endpoint structure. ## `GET /me/datasets/:dataset_id/metrics` Old response without distribution task: ```json { "records": { "count": 7 }, "responses": { "count": 4, "submitted": 1, "discarded": 2, "draft": 1 } } ``` New response with the changes from this PR supporting distribution task: * `records` section has been eliminated because is not necessary anymore. * `responses` `count` section has been renamed to `total`. * `pending` section has been added to the `responses` section. ```json { "responses": { "total": 7, "submitted": 1, "discarded": 2, "draft": 1, "pending": 3 } } ``` The logic behind these attributes is the following: * `total` is the sum of `submitted`, `discarded`, `draft` and `pending` attribute values. * `submitted` is the count of all responses belonging to the current user in the specified dataset with `submitted` status. * `discarded` is the count of all responses belonging to the current user in the specified dataset with `discarded` status. * `draft` is the count of all responses belonging to the current user in the specified dataset with `draft` status. * `pending` is the count of all records with `pending` status for the dataset that has not responses belonging to the current user. @damianpumar some changes are required on the frontend to support this new endpoint structure as well. Closes #5139 **Type of change** - Breaking change (fix or feature that would cause existing functionality to not work as expected) **How Has This Been Tested** - [x] Modifying existent tests. - [x] Running test suite with SQLite and PostgreSQL. **Checklist** - I added relevant documentation - follows the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Paco Aranda <[email protected]> Co-authored-by: Damián Pumar <[email protected]>

@nataliaElv

…notated datasets (#5171) # Description  This PR changes the current validator when updating the distribution task to allow updating the distribution task settings for datasets with records without ANY response. cc @nataliaElv **Type of change**  - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

# Description This PR add a new `get_serializable_async_db` function helper that returns a session using isolation leve as `SERIALIZABLE`. This session can be used on some handlers where we require that specific isolation level. As example I have added that session helper for handler deleting responses and PostgreSQL is showing the following received queries: ```sql 2024-07-04 17:09:40.417 CEST [83566] LOG: statement: BEGIN ISOLATION LEVEL READ COMMITTED; 2024-07-04 17:09:40.418 CEST [83566] LOG: execute __asyncpg_stmt_e__: SELECT users.first_name, users.last_name, users.username, users.role, users.api_key, users.password_hash, users.id, users.inserted_at, users.updated_at FROM users WHERE users.api_key = $1::VARCHAR 2024-07-04 17:09:40.418 CEST [83566] DETAIL: parameters: $1 = 'argilla.apikey' 2024-07-04 17:09:40.422 CEST [83566] LOG: execute __asyncpg_stmt_12__: SELECT users_1.id AS users_1_id, workspaces.name AS workspaces_name, workspaces.id AS workspaces_id, workspaces.inserted_at AS workspaces_inserted_at, workspaces.updated_at AS workspaces_updated_at FROM users AS users_1 JOIN workspaces_users AS workspaces_users_1 ON users_1.id = workspaces_users_1.user_id JOIN workspaces ON workspaces.id = workspaces_users_1.workspace_id WHERE users_1.id IN ($1::UUID) ORDER BY workspaces_users_1.inserted_at ASC 2024-07-04 17:09:40.422 CEST [83566] DETAIL: parameters: $1 = 'ed2d570f-cc9f-4d53-a433-74aa7a286a52' 2024-07-04 17:09:40.426 CEST [83566] LOG: execute __asyncpg_stmt_13__: SELECT users.first_name, users.last_name, users.username, users.role, users.api_key, users.password_hash, users.id, users.inserted_at, users.updated_at FROM users WHERE users.username = $1::VARCHAR 2024-07-04 17:09:40.426 CEST [83566] DETAIL: parameters: $1 = 'argilla' 2024-07-04 17:09:40.428 CEST [83566] LOG: execute __asyncpg_stmt_12__: SELECT users_1.id AS users_1_id, workspaces.name AS workspaces_name, workspaces.id AS workspaces_id, workspaces.inserted_at AS workspaces_inserted_at, workspaces.updated_at AS workspaces_updated_at FROM users AS users_1 JOIN workspaces_users AS workspaces_users_1 ON users_1.id = workspaces_users_1.user_id JOIN workspaces ON workspaces.id = workspaces_users_1.workspace_id WHERE users_1.id IN ($1::UUID) ORDER BY workspaces_users_1.inserted_at ASC 2024-07-04 17:09:40.428 CEST [83566] DETAIL: parameters: $1 = 'ed2d570f-cc9f-4d53-a433-74aa7a286a52' 2024-07-04 17:09:40.430 CEST [83563] LOG: statement: BEGIN ISOLATION LEVEL SERIALIZABLE; 2024-07-04 17:09:40.430 CEST [83563] LOG: execute __asyncpg_stmt_14__: SELECT responses.values, responses.status, responses.record_id, responses.user_id, responses.id, responses.inserted_at, responses.updated_at FROM responses WHERE responses.id = $1::UUID 2024-07-04 17:09:40.430 CEST [83563] DETAIL: parameters: $1 = 'fdea95a0-ee9a-43ea-b093-2e13f2473c19' 2024-07-04 17:09:40.431 CEST [83566] LOG: statement: ROLLBACK; 2024-07-04 17:09:40.432 CEST [83563] LOG: statement: ROLLBACK; ``` We can clearly see that there are two nested transaction: 1. The main one to get current user using default `get_async_db` helper. 2. A nested one using `get_serializable_async_db` (and setting `SERIALIZABLE` isolation level) trying to find the response by id. The response id used is fake so the transaction ends there and the deletion is not done. ## Missing things on this PR - [x] Fix some failing tests. - [ ] Tests are passing but still not changing the isolation level to `SERIALIZABLE`. - [ ] Check that this works as expected and does not affect SQLite. - [ ] Check that this works as expected with PostgreSQL (no concurrency errors). Closes #5155 **Type of change** - New feature (non-breaking change which adds functionality) - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** - [x] Manually seeing PostgreSQL logs. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

# Description  This PR removes deprecated endpoints working with records to avoid creating records with a proper record status computation. The affected endpoints are: `POST /api/v1/datasets/:dataset_id/records` `PATCH /api/v1/datasets/:dataset_id/records` **Type of change**  - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) - Breaking change (fix or feature that would cause existing functionality to not work as expected) - Refactor (change restructuring the codebase without changing functionality) - Improvement (change adding some improvement to an existing functionality) - Documentation update **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

# Description  This PR adds the record status as a read-only property in the `Record` resource class. Closes #5141 **Type of change**  - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

# Pull Request Template  This PR merges all the approved PRs related to cleaning list and search records endpoints - #5153 - #5156 - #5163 - #5166 **Type of change**  - Bug fix (non-breaking change which fixes an issue) - New feature (non-breaking change which adds functionality) - Breaking change (fix or feature that would cause existing functionality to not work as expected) - Refactor (change restructuring the codebase without changing functionality) - Improvement (change adding some improvement to an existing functionality) - Documentation update **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: José Francisco Calvo <[email protected]>

# Description After investigate timeouts for PostgreSQL I have found that timeouts should not affect errors when a SERIALIZABLE transactions is rollbacked because another concurrent update error is raised. So the only way to support concurrent updates with PostgreSQL and SERIALIZABLE transactions is to capture errors and retry the transaction. This code has the following changes: * Start using `backoff` library to retry any of the possible CRUD context functions updating responses and record statuses, using SERIALIZABLE database sessions. * This change has the side effect of working with PostgreSQL and SQLite at the same time. * I have set a fixed time of 15 seconds as maximum time for retrying with exponential backoff. * I have moved search engine updates outside of the transaction block. * This should mitigate errors on high concurrency scenarios for PostgreSQL and SQLite: * For SQLite we have the additional setting to set a timeout if necessary. * I have changed `DEFAULT_DATABASE_SQLITE_TIMEOUT` value to `5` seconds so the backoff logic will handle possible problems with locked database errors and SQLite. Refs #5000 **Type of change** - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** - [x] Manually testing with PostgreSQL and SQLite, running benchmarks using 20 concurrent requests. - [x] Running test suite for PostgreSQL and SQLite. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

- [x] Update progress bar styles - [x] Show two decimals in the progress bar of the dataset list - [x] Remove donut chart and replace with small cards - [x] Replace my progress bar with team progress bar - [x] Show submitted info when panel is collapsed --------- Co-authored-by: Damián Pumar <[email protected]> Co-authored-by: David Berenstein <[email protected]>

This reverts commit 9dba7ef.

# Description  This PR reviews and improves the es mapping definition for responses by storing responses as a list of user responses. This change brings some improvements: - Scaling better when the number of annotators increases (the index mapping remains the same) - Simplify queries without users - Support compute distributions on response values ( it can build aggregations on top of question response values) **Type of change**  - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Francisco Aranda <[email protected]>

@frascuchon

# Description  This PR adds changes to the server telemetry to gather metrics for API endpoint calls. This is the first iteration. Some new usage metrics can be included. The metrics gathered include the user ID and some system info as the server ID (UUID generated once when starting the Argilla server) Also, it deprecates the old telemetry KEY ("huggingface_hub includes an helper to send telemetry data. This information helps us debug issues and prioritize new features. Users can disable telemetry collection at any time by setting the HF_HUB_DISABLE_TELEMETRY=1 environment variable. Telemetry is also disabled in offline mode (i.e. when setting HF_HUB_OFFLINE=1)." ### OUTDATED Adds telemetry for: - [x] users - [x] workspaces - [x] datasets - [x] login users - [x] server errors - [x] records - [x] responses - [x] suggestions - [x] metadata - [x] vectors - [x] deprecate old telemetry KEY ("huggingface_hub includes an helper to send telemetry data. This information helps us debug issues and prioritize new features. Users can disable telemetry collection at any time by setting the HF_HUB_DISABLE_TELEMETRY=1 environment variable. Telemetry is also disabled in offline mode (i.e. when setting HF_HUB_OFFLINE=1)." - [x] write documentation done int #5253 - [x] add automatic task distribution should be done after #5136 - [x] include gradio-app/gradio#8884 General Idea: I’ve structured data to come in through URLs/topics like `dataset/settings/vectorsettings/create` or `dateset/records/suggestions/read` along with some generalized metadata per URL/topics, like `count` or `type` of suggestion or setting. To discuss: - What to do with `list` methods. I currently track `list-like` and send each individual with `read`, along with a `read` with a count. I did this because it might be interesting to get the total number of users, workspaces etc. Should we move this over to `list` as a separate CRUD action? Do we also want to capture each individual update - A similar logic applies to bulk operations. `bulk_crud` as separate CRUD actions? - I don't track user/dataset/workspace-specific list operations, like list_users_workspace or list_datasets_user. - I don't track metadata and vector updates on a record level, however, we DO keep track of operations on suggestions and responses. - @frascuchon was there a reason to include the `header` along with `user/login` operations? otherwise I will rewrite this a bit and include the `user/login` as `user/read`. Follow up - #5224 @frascuchon_ Closes #5204 **Type of change**  - Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested**  NA **Checklist**  - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Paco Aranda <[email protected]> Co-authored-by: José Francisco Calvo <[email protected]> Co-authored-by: Francisco Aranda <[email protected]>

jfcalvo requested review from frascuchon and damianpumar July 1, 2024 10:44

jfcalvo and others added 25 commits July 1, 2024 13:27

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

017001f

✨ Remove unused method

f084ab7

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

c8ef4c6

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

dbae135

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

8e8b116

fix: wrong filter naming after merge from develop

f241e41

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

67d4ee3

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

3e06890

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

b15de8f

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

f497140

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

8bf8abb

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

22263d8

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

ced0220

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

0c85b9d

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

46f2640

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

f77341e

chore: update CHANGELOG.md

8dd1c7e

jfcalvo and others added 4 commits July 16, 2024 12:11

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

4417af6

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

f284720

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

1a50c3a

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

ba3dc49

davidberenstein1957 mentioned this pull request Jul 17, 2024

[FEAT] Add integration with huggingface_hub.utils.telemetry #5218

Merged

14 tasks

leiyre and others added 6 commits July 18, 2024 16:53

chore: set tools line-height to 88 characters

9dba7ef

Revert "chore: set tools line-height to 88 characters"

103556e

This reverts commit 9dba7ef.

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

e7d4b75

Merge branch 'develop' into feat/add-dataset-automatic-task-distribution

128e8a0

jfcalvo changed the title ~~[FEATURE-BRANCH] feat: automatic distribution task~~ feat: automatic distribution task Jul 18, 2024

jfcalvo merged commit d30c429 into develop Jul 18, 2024
14 of 15 checks passed

jfcalvo deleted the feat/add-dataset-automatic-task-distribution branch July 18, 2024 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: automatic distribution task #5136

feat: automatic distribution task #5136

jfcalvo commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading

github-actions bot commented Jul 1, 2024

feat: automatic distribution task #5136

feat: automatic distribution task #5136

Conversation

jfcalvo commented Jul 1, 2024 • edited Loading

Description

Changes included in this PR

codecov bot commented Jul 1, 2024 • edited Loading

Codecov Report

github-actions bot commented Jul 1, 2024

jfcalvo commented Jul 1, 2024 •

edited

Loading

codecov bot commented Jul 1, 2024 •

edited

Loading