Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add priority field to queue #705

Merged
merged 5 commits into from
Jan 26, 2023
Merged

Add priority field to queue #705

merged 5 commits into from
Jan 26, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Jan 26, 2023

In this PR:

  • the jobs have a new field called priority: normal (default) or low
  • the normal jobs are fetched before the low ones (but respecting max_jobs_per_namespace)
  • when updating a job, i.e., if a dataset has been updated again, the priority is inherited, never lowered, even if the priority field is set. This way: when launching backfill (with a low priority) on a dataset that was already waiting for update, we don't deprioritize it.
  • same logic with the existing "force" field: when updating a job, if force was true, we maintain it.

@severo severo mentioned this pull request Jan 26, 2023
Copy link
Contributor

@AndreaFrancis AndreaFrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a couple of comments regarding migration validation, besides that, everything looks good!

raise ValueError("priority should be 'normal'")

check_documents(DocCls=JobSnapshot, sample_size=10, custom_validation=custom_validation)
if JobSnapshot.objects(force=False).count() != JobSnapshot.objects.count():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be priority="normal" ?


check_documents(DocCls=JobSnapshot, sample_size=10, custom_validation=custom_validation)
if JobSnapshot.objects(force=False).count() != JobSnapshot.objects.count():
raise ValueError('All the objects should have the "force" field, set to False')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here 'All the objects should have the "priority" field, set to "normal"'

@codecov-commenter
Copy link

Codecov Report

Base: 90.40% // Head: 83.46% // Decreases project coverage by -6.95% ⚠️

Coverage data is based on head (efca6b9) compared to base (7d4e4d6).
Patch coverage: 76.66% of modified lines in pull request are covered.

❗ Current head efca6b9 differs from pull request most recent head cd8d03b. Consider uploading reports for the commit cd8d03b to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #705      +/-   ##
==========================================
- Coverage   90.40%   83.46%   -6.95%     
==========================================
  Files          97       14      -83     
  Lines        4929      526    -4403     
==========================================
- Hits         4456      439    -4017     
+ Misses        473       87     -386     
Flag Coverage Δ
jobs_mongodb_migration 83.46% <76.66%> (-0.88%) ⬇️
libs_libcommon ?
services_admin ?
services_api ?
workers_datasets_based ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...grations/_20230126164900_queue_job_add_priority.py 76.27% <76.27%> (ø)
...ngodb_migration/src/mongodb_migration/collector.py 100.00% <100.00%> (ø)
libs/libcommon/src/libcommon/queue.py
libs/libcommon/tests/test_queue.py
...ets_based/src/datasets_based/workers/first_rows.py
...datasets_based/workers/parquet_and_dataset_info.py
workers/datasets_based/tests/test_worker_loop.py
services/api/src/api/routes/webhook.py
services/admin/src/admin/config.py
services/admin/tests/conftest.py
... and 76 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@severo severo force-pushed the add-priority-level-to-queue branch from cd8d03b to 3de1315 Compare January 26, 2023 17:57
@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

I did a rebase on main after #706, hence the force push. And I'll have to update the docker images with the new value (3de1315)

@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

Thanks for the review!

@severo severo merged commit ebdb9fe into main Jan 26, 2023
@severo severo deleted the add-priority-level-to-queue branch January 26, 2023 18:12
@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

Interesting: the db migration job failed, and rolled back to the previous state (I have tests, but hadn't seen it failing live 😅) it seems to work as intended:

The deployment failed because the migration failed, ie. the new images have not been deployed because the database is not ready:

make upgrade-prod
make[1]: Entering directory '/home/slesage/hf/datasets-server/chart'
helm upgrade --install datasets-server-prod . --values docker-images.yaml --values env/prod.yaml -n datasets-server
Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
make[1]: *** [Makefile:18: upgrade] Error 1
make[1]: Leaving directory '/home/slesage/hf/datasets-server/chart'
make: *** [Makefile:42: upgrade-prod] Error 2

The logs of the migration job:

INFO: 2023-01-26 18:22:15,442 - root - Start migrations
INFO: 2023-01-26 18:22:15,596 - root - 3 migrations have already been applied. They will be skipped.
INFO: 2023-01-26 18:22:15,596 - root - Migrate 20230126164900: add to the migrations collection
INFO: 2023-01-26 18:22:15,605 - root - Migrate 20230126164900: apply
INFO: 2023-01-26 18:22:15,606 - root - Add the priority field, with the default value ('normal'), to all the jobs
INFO: 2023-01-26 18:25:00,630 - root - Migrate 20230126164900: validate
INFO: 2023-01-26 18:25:00,630 - root - Ensure that a random selection of jobs have the 'priority' field set to 'normal'
ERROR: 2023-01-26 18:25:54,426 - root - Validation error on document 636e0dc52af91cd735e0ad3b: priority should be 'normal'. Document: {"_id": {"$oid": "636e0dc52af91cd735e0ad3b"}, "type": "/splits", "dataset": "bigscience-data/roots_ar_tashkeela", "unicity_id": "Job[/splits][bigscience-data/roots_ar_tashkeela][None][None]", "namespace": "bigscience-data", "force": false, "priority": "normal", "status": "success", "created_at": {"$date": 1668156869463}, "started_at": {"$date": 1668156870203}, "finished_at": {"$date": 1668156871088}}
ERROR: 2023-01-26 18:25:54,427 - root - Migration failed: priority should be 'normal'
INFO: 2023-01-26 18:25:54,427 - root - Start rollback
INFO: 2023-01-26 18:25:54,427 - root - Rollback 20230126164900: roll back
INFO: 2023-01-26 18:25:54,427 - root - Remove the priority field from all the jobs
INFO: 2023-01-26 18:33:29,814 - root - Rollback 20230126164900: removed from the migrations collection
INFO: 2023-01-26 18:33:29,819 - root - Rollback 20230126164900: done
INFO: 2023-01-26 18:33:29,820 - root - All executed migrations have been rolled back

By the way, the bug is in the validation function, not in the migration itself... Nevermind, I'll fix it (comparison to enum object value instead of enum object)

Another comment: it takes 3 minutes to add the field to all the jobs in the db! (and 8 minutes to delete!!) It's a lot. It's due to the size of the queue with the past jobs. At one time, I created too many useless jobs that were skipped afterward, and we have several billions of them that we will never use... Maybe just cleaning the queue and starting from scratch would be better.

Capture d’écran 2023-01-26 à 19 33 52

@severo
Copy link
Collaborator Author

severo commented Jan 26, 2023

fixed in #707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants