feat: 🎸 add a /backfill admin endpoint #708

severo · 2023-01-26T19:45:15Z

The logic is very basic: it updates all the datasets of the Hub, with a low priority. Note that most of the jobs will be skipped, because the response will already be in the cache.

We might want to take a more detailed approach later to reduce the number of unnecessary jobs by specifically creating jobs for the missing data only.

Apart of this, the PR also fixes the creation of children jobs: the priority is preserved (ie low priority jobs created low priority children jobs)

It only updates all the datasets of the Hub, with a low priority. Note that most of the jobs will be skipped, because the response will already be in the cache. Also thiscommit fixes the creation of children jobs: the priority is preserved (ie low priority jobs created children low priority jobs)

codecov-commenter · 2023-01-26T19:47:51Z

Codecov Report

Base: 83.46% // Head: 87.12% // Increases project coverage by +3.66% 🎉

Coverage data is based on head (bb27740) compared to base (290f5be).
Patch coverage: 58.33% of modified lines in pull request are covered.

❗ Current head bb27740 differs from pull request most recent head 6c12a0f. Consider uploading reports for the commit 6c12a0f to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #708      +/-   ##
==========================================
+ Coverage   83.46%   87.12%   +3.66%     
==========================================
  Files          14       20       +6     
  Lines         526      660     +134     
==========================================
+ Hits          439      575     +136     
+ Misses         87       85       -2

Flag	Coverage Δ
jobs_mongodb_migration	`?`
services_admin	`87.12% <58.33%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
services/admin/src/admin/routes/backfill.py	`56.52% <56.52%> (ø)`
services/admin/src/admin/app.py	`92.59% <100.00%> (ø)`
jobs/mongodb_migration/tests/test_plan.py
.../migrations/_20221116133500_queue_job_add_force.py
jobs/mongodb_migration/tests/test_migration.py
...ation/src/mongodb_migration/database_migrations.py
...db_migration/migrations/_20221110230400_example.py
...grations/_20221117223000_cache_generic_response.py
...ngodb_migration/src/mongodb_migration/collector.py
jobs/mongodb_migration/tests/test_collector.py
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

severo · 2023-01-27T10:20:49Z

Merging...

severo · 2023-01-27T10:39:12Z

Followup: it does not work because the loop that creates the jobs one by one takes too long and the nginx reverse-proxy returns a gateway timeout error. Only part of the datasets are refreshed.

severo · 2023-01-30T10:51:07Z

Alternatives:

reduce the duration by doing fewer checks (don't double-check if the dataset is private) and create the jobs as a batch.
find a way to increase the timeout for this endpoint
return immediately, and launch the update asynchronously
make /backfill be another kind of job

I think the first option is the simplest one. Trying here: #720

severo · 2023-01-30T13:43:26Z

OK. It's still a lot too slow, and it still timeouts:

$ curl -H 'Authorization: Bearer hf_...' -X POST "https://datasets-server.huggingface.co/admin/backfill"


<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.20.2</center>
</body>
</html>

Note that in #720 we were still doing a loop, it was not a batch operation

severo requested review from AndreaFrancis and lhoestq January 26, 2023 19:45

severo added 2 commits January 26, 2023 19:46

style: 💄 fix style

bb27740

feat: 🎸 update docker image

6c12a0f

severo mentioned this pull request Jan 27, 2023

Configs and splits #702

Merged

severo requested a review from albertvillanova January 27, 2023 10:02

severo merged commit 6e63de2 into main Jan 27, 2023

severo deleted the add-admin-endpoint-to-backfill branch January 27, 2023 10:20

severo mentioned this pull request Jan 31, 2023

Convert the /backfill endpoint to a kubernetes Job run periodically #740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 🎸 add a /backfill admin endpoint #708

feat: 🎸 add a /backfill admin endpoint #708

severo commented Jan 26, 2023 •

edited

Loading

codecov-commenter commented Jan 26, 2023 •

edited

Loading

severo commented Jan 27, 2023

severo commented Jan 27, 2023 •

edited

Loading

severo commented Jan 30, 2023 •

edited

Loading

severo commented Jan 30, 2023 •

edited

Loading

feat: 🎸 add a /backfill admin endpoint #708

feat: 🎸 add a /backfill admin endpoint #708

Conversation

severo commented Jan 26, 2023 • edited Loading

codecov-commenter commented Jan 26, 2023 • edited Loading

Codecov Report

severo commented Jan 27, 2023

severo commented Jan 27, 2023 • edited Loading

severo commented Jan 30, 2023 • edited Loading

severo commented Jan 30, 2023 • edited Loading

severo commented Jan 26, 2023 •

edited

Loading

codecov-commenter commented Jan 26, 2023 •

edited

Loading

severo commented Jan 27, 2023 •

edited

Loading

severo commented Jan 30, 2023 •

edited

Loading

severo commented Jan 30, 2023 •

edited

Loading