Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛⚗️Computational backend: tasks in UNKNOWN state are sometimes only temporarily in that state #5195

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Dec 20, 2023

What do these changes do?

This PR tries to fix/improve the issue where:

  • a task is submitted to an external cluster (the cluster gets created), task is in WAITING_FOR_CLUSTER state
  • at some point the cluster primary (with dask-scheduler) comes up and the dask client submits the job
  • it seems that under load when the director-v2 asks for the task the dask-scheduler does not know yet of the task and then it is correctly reported as UNKNOWN
  • the director-v2 then wrongly triage that task to be STARTED and at the same time COMPLETED, and sends a start event over the wire to ResourcesUsageTracker (RUT) (that is the wrong one)
  • during the time when the task is waiting, no heartbeat event is sent to RUT, which considers the task as FAILED
  • later the task appears in the dask-scheduler, and is eventually processed by a dask-worker (-> thus another start event is triggered (the real one) - RUT correctly complains
  • the director-v2 then sends heartbeat events

Different issues were identified:

  • when a dask-scheduler does not "know" about a job it returns a status RunningState.UNKNOWN (this makes sense)
  • when the director-v2 receives such a state, it is considering the task as being COMPLETED, in a way if the task is unknown, that means the task is gone/lost. Therefore from a higher level POV the task is considered as FAILED, which also makes sense.
  • What does not make sense is to send the start event when this happens and the task did not even start. it makes no sense to bill that.

This PR tries to correct that by removing the UNKNOWN state from the COMPLETED states. The UNKNOWN state is treated separately:

  • the start event is not sent anymore
  • but the pipeline is still stopped, which is what is expected in case the dask-scheduler was restarted or deleted

Bonuses:

Related issue/s

How to test

Dev Checklist

DevOps Checklist

@sanderegg sanderegg self-assigned this Dec 20, 2023
@sanderegg sanderegg added the a:director-v2 issue related with the director-v2 service label Dec 20, 2023
@sanderegg sanderegg added this to the Kobayashi Maru milestone Dec 20, 2023
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice thanks! we will observe 👍

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (4749268) 81.9% compared to head (7abca60) 68.5%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #5195      +/-   ##
=========================================
- Coverage    81.9%   68.5%   -13.4%     
=========================================
  Files        1290     532     -758     
  Lines       52877   26862   -26015     
  Branches     1157     202     -955     
=========================================
- Hits        43319   18410   -24909     
+ Misses       9308    8402     -906     
+ Partials      250      50     -200     
Flag Coverage Δ
integrationtests 64.8% <55.8%> (+1.3%) ⬆️
unittests 84.5% <67.6%> (+4.9%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...imcore_service_director_v2/utils/comp_scheduler.py 100.0% <100.0%> (ø)
...rector_v2/modules/comp_scheduler/dask_scheduler.py 89.7% <66.6%> (-0.6%) ⬇️
...rector_v2/modules/comp_scheduler/base_scheduler.py 86.4% <66.6%> (-2.2%) ⬇️

... and 923 files with indirect coverage changes

@sanderegg sanderegg force-pushed the comp-backend/only-start-tasks-when-receiving-progress-update branch from f26c62c to b6e9915 Compare December 20, 2023 22:24
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lets hope this solves the issue 🤞🏻

@sanderegg sanderegg force-pushed the comp-backend/only-start-tasks-when-receiving-progress-update branch from 1af560e to f4eb886 Compare December 21, 2023 07:12
@sanderegg sanderegg force-pushed the comp-backend/only-start-tasks-when-receiving-progress-update branch from da7dfef to 7abca60 Compare December 21, 2023 10:19
Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@sanderegg sanderegg merged commit d4d9ec1 into ITISFoundation:master Dec 21, 2023
54 checks passed
@sanderegg sanderegg deleted the comp-backend/only-start-tasks-when-receiving-progress-update branch December 21, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants