Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️✨ Comp backend task state reporting fixed #4775

Merged
merged 17 commits into from
Sep 21, 2023

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 19, 2023

What do these changes do?

This PR shall fix the issues where the Running state of a task was shown although the task is not computing, but is taken by a dask-sidecar (Dask scheduler states make no difference between a task being processed or taken by a worker and queued on the worker memory).
Therefore this shall fix:

  • when multiple computational services where shown as Running
  • and also the Running state is now shown as soon as the computational service signaled it started progressing, therefore linked directly to the worker activity and not only from the dask-scheduler processing fct
  • pending <-> waiting_for_cluster state now stable

This was achieved by:

  • properly separating Dask task state and RunningState
  • since Dask does not know the difference between computing and queued in worker, this can be computed on the dv-2 by checking if progress was already signaled through the Dask Pub/Sub mechanism

Bonus:

  • ensure that RPC calls do not return internal exceptions

Related issue/s

How to test

DevOps Checklist

@sanderegg sanderegg added the a:director-v2 issue related with the director-v2 service label Sep 19, 2023
@sanderegg sanderegg added this to the the nameless milestone Sep 19, 2023
@sanderegg sanderegg requested a review from pcrespov as a code owner September 19, 2023 14:58
@sanderegg sanderegg self-assigned this Sep 19, 2023
@sanderegg sanderegg requested a review from GitHK as a code owner September 19, 2023 14:58
Copy link
Collaborator

@elisabettai elisabettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat, thanks for the fix, dask-master! 🙃

@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Merging #4775 (1e64f5d) into master (4af4b44) will increase coverage by 0.7%.
The diff coverage is 85.8%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #4775     +/-   ##
========================================
+ Coverage    86.8%   87.5%   +0.7%     
========================================
  Files        1143    1097     -46     
  Lines       47699   45957   -1742     
  Branches     1015     861    -154     
========================================
- Hits        41425   40241   -1184     
+ Misses       6043    5517    -526     
+ Partials      231     199     -32     
Flag Coverage Δ
integrationtests 65.0% <72.3%> (+0.4%) ⬆️
unittests 85.2% <80.1%> (+0.5%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ore_service_director_v2/modules/clusters_keeper.py 45.8% <0.0%> (-4.2%) ⬇️
...imcore_service_director_v2/utils/comp_scheduler.py 100.0% <ø> (ø)
...rector_v2/modules/comp_scheduler/base_scheduler.py 89.8% <68.9%> (-1.1%) ⬇️
..._director_v2/modules/db/repositories/comp_tasks.py 96.6% <84.2%> (-1.0%) ⬇️
...rector_v2/modules/comp_scheduler/dask_scheduler.py 92.1% <97.2%> (+1.4%) ⬆️
...ice-library/src/servicelib/rabbitmq/_rpc_router.py 90.6% <100.0%> (-0.3%) ⬇️
...ore_service_director_v2/api/routes/computations.py 91.1% <100.0%> (+20.9%) ⬆️
...-v2/src/simcore_service_director_v2/core/errors.py 78.3% <100.0%> (+0.3%) ⬆️
...mcore_service_director_v2/models/dask_subsystem.py 100.0% <100.0%> (ø)
...simcore_service_director_v2/modules/dask_client.py 93.0% <100.0%> (+0.5%) ⬆️

... and 53 files with indirect coverage changes

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!
Is there a reasonable way to test this bug so it does not happen again?

@sanderegg
Copy link
Member Author

thx! Is there a reasonable way to test this bug so it does not happen again?

@pcrespov you might remember we did talk about simplifying the director-v2 scheduler, and that is when we finally decided to remove one of the signals for sending service states. that is when we broke it actually.

@sanderegg sanderegg force-pushed the bugfix/task-state-running branch 3 times, most recently from bfa12f0 to 534fe97 Compare September 20, 2023 20:50
@sanderegg sanderegg force-pushed the bugfix/task-state-running branch from 534fe97 to 3b278d3 Compare September 20, 2023 21:08
@sanderegg sanderegg enabled auto-merge (squash) September 20, 2023 21:14
@sanderegg sanderegg disabled auto-merge September 21, 2023 06:13
@codeclimate
Copy link

codeclimate bot commented Sep 21, 2023

Code Climate has analyzed commit 1e64f5d and detected 0 issues on this pull request.

View more on Code Climate.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@sanderegg sanderegg merged commit 0fb7103 into ITISFoundation:master Sep 21, 2023
@sanderegg sanderegg deleted the bugfix/task-state-running branch September 21, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service
Projects
None yet
5 participants