🐛 Computational services are not stopped when a project is deleted (Some computational services run forever with machine load) #3209
Labels
a:dask-service
Any of the dask services: dask-scheduler/sidecar or worker
bug
buggy, it does not work as expected
Milestone
Long story short
This morning, master04/05/06 were at full CPU load. An investigation showed that each of the machines had
4+
instances of aregistry.osparc-master.speag.com/simcore/services/comp/human-gb-2d-cardiac-model
container running, which was taking lots of time to perform the solver-tasks and was using CPU. Some of the services have been running for a week (Note that after deleting the study, the computational tasks linked to the study are not stopped/killed/removed).Expected behaviour
Actual behaviour
Computational services, which take a long time to finish, are allowed to run forever and use their allocated resources forever.
Suggested actionable changes in simcore:
COMP_SERVICE_TIMEOUT
.timeout
shell-command which will stop the execution after the provided timeout.Steps to reproduce
Your environment
Logs form the incident:
ps aux --forest
Proof: CPU on the container is properly limited to 4 CPUs:
Logs from the task
2D
running, showing it to be slow:Logs form the dask sidecar
The logs of the dask-sidecar contained nothing w.r.t. the UUID or docker container name of the long-running comp. services. Logs of the same image successfully running exist:
(last line was cut off)
Minor findings
Tasks
The text was updated successfully, but these errors were encountered: