Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Computational services are not stopped when a project is deleted (Some computational services run forever with machine load) #3209

Closed
1 task done
Tracked by #950
mrnicegyu11 opened this issue Jul 21, 2022 · 2 comments
Assignees
Labels
a:dask-service Any of the dask services: dask-scheduler/sidecar or worker bug buggy, it does not work as expected
Milestone

Comments

@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Jul 21, 2022

Long story short

This morning, master04/05/06 were at full CPU load. An investigation showed that each of the machines had 4+ instances of a registry.osparc-master.speag.com/simcore/services/comp/human-gb-2d-cardiac-model container running, which was taking lots of time to perform the solver-tasks and was using CPU. Some of the services have been running for a week (Note that after deleting the study, the computational tasks linked to the study are not stopped/killed/removed).

Expected behaviour

  • Desired: No computational service is able to run forever.
  • At the very least: No computational service keeps running forever once the assoc. project has been deleted

Actual behaviour

Computational services, which take a long time to finish, are allowed to run forever and use their allocated resources forever.

Suggested actionable changes in simcore:

  • All computational services use an environment-variable COMP_SERVICE_TIMEOUT.
  • A default value is set per image, but can be overwritten when running it.
  • All comp. services are executed as presented here, using the timeout shell-command which will stop the execution after the provided timeout.
  • Upon timeout, the container terminates. A reasonable return-code is provided and the information that the computation timed out is passed to the frontend.
  • [Minor: Generated data (computations not-converged etc.) is passed back to the user after a timeout, so she can restart her calculations from the interim files)]

Steps to reproduce

  • Run a slow computational service.
  • Delete the assoc. project.
  • Watch the comp. service still run, forever.

Your environment

  • oSparc master

Logs form the incident:

ps aux --forest

root     3437845  0.0  0.0 712392 10440 ?        Sl   Jul14   4:05 /usr/bin/containerd-shim-runc-v2 -namespace moby -id b366ac8c217dc934849a07b85e6784009e7a7
root     3437890  0.0  0.0   1104     0 ?        Ss   Jul14   0:07  \_ /sbin/docker-init -- /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     3437921  0.0  0.0  33108  7780 ?        S    Jul14   0:00      \_ /usr/bin/python3 -u /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     3437937  0.0  0.0   4548    64 ?        S    Jul14   0:03      |   \_ /usr/bin/runsvdir -P /etc/service
root     3437938  0.0  0.0   4396    64 ?        Ss   Jul14   0:00      |   |   \_ runsv cron
root     3437940  0.0  0.0  31588  1936 ?        S    Jul14   0:00      |   |   |   \_ /usr/sbin/cron -f
root     3437939  0.0  0.0   4396    60 ?        Ss   Jul14   0:00      |   |   \_ runsv sshd
root     3437941  0.0  0.0  21636  1840 ?        S    Jul14   0:00      |   \_ /bin/bash docker/entrypoint.sh run
root     3438023  0.0  0.0  52532  1516 ?        S    Jul14   0:00      |       \_ su --preserve-environment --command export PATH=/home/scu/service.cli:/usr
8004     3438024  0.0  0.0  21636  1724 ?        Ss   Jul14   0:00      |           \_ bash -c export PATH=/home/scu/service.cli:/usr/local/sbin:/usr/local/b
8004     3438025  0.0  0.0  21636  1768 ?        S    Jul14   0:00      |               \_ /bin/bash /home/scu/service.cli/run
8004     3438026  0.0  0.0  21636  1828 ?        S    Jul14   0:00      |                   \_ /bin/bash do_run
8004     3438060  0.0  0.0  21636  1764 ?        S    Jul14   0:00      |                       \_ bash execute
8004     3438063  218  0.1 101916 59428 ?        Rl   Jul14 21114:15      |                           \_ ./2D stim_param.txt /inputs model_INPUT.from1D /outp
root     3437930  0.0  0.0 564328  6700 ?        Sl   Jul14   0:00      \_ /usr/sbin/syslog-ng --pidfile /var/run/syslog-ng.pid -F --no-caps
root     2102946  0.0  0.0 712648  9924 ?        Sl   Jul18   1:28 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 322d7cf679737cf170e6139b1a8247c9788a7
root     2102971  0.0  0.0   1104     4 ?        Ss   Jul18   0:02  \_ /sbin/docker-init -- /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     2103009  0.0  0.0  33108  8380 ?        S    Jul18   0:00      \_ /usr/bin/python3 -u /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     2103059  0.0  0.0   4548    60 ?        S    Jul18   0:01      |   \_ /usr/bin/runsvdir -P /etc/service
root     2103061  0.0  0.0   4396    60 ?        Ss   Jul18   0:00      |   |   \_ runsv cron
root     2103065  0.0  0.0  31588  1948 ?        S    Jul18   0:00      |   |   |   \_ /usr/sbin/cron -f
root     2103062  0.0  0.0   4396    60 ?        Ss   Jul18   0:00      |   |   \_ runsv sshd
root     2103060  0.0  0.0  21636  2036 ?        S    Jul18   0:00      |   \_ /bin/bash docker/entrypoint.sh run
root     2103140  0.0  0.0  52532  1684 ?        S    Jul18   0:00      |       \_ su --preserve-environment --command export PATH=/home/scu/service.cli:/usr
8004     2103141  0.0  0.0  21636  1904 ?        Ss   Jul18   0:00      |           \_ bash -c export PATH=/home/scu/service.cli:/usr/local/sbin:/usr/local/b
8004     2103142  0.0  0.0  21636  1864 ?        S    Jul18   0:00      |               \_ /bin/bash /home/scu/service.cli/run
8004     2103143  0.0  0.0  21636  1856 ?        S    Jul18   0:00      |                   \_ /bin/bash do_run
8004     2103177  0.0  0.0  21636  2024 ?        S    Jul18   0:00      |                       \_ bash execute
8004     2103180  183  0.0  37608 11716 ?        Rl   Jul18 6406:06      |                           \_ ./2D stim_param.txt /inputs/model_INPUT.from1D /outpu
root     2103017  0.0  0.0 433240  6552 ?        Sl   Jul18   0:00      \_ /usr/sbin/syslog-ng --pidfile /var/run/syslog-ng.pid -F --no-caps
root     3130352  0.0  0.0 712456  9784 ?        Sl   Jul20   0:25 /usr/bin/containerd-shim-runc-v2 -namespace moby -id b78c94aa00a62380006d498c468bbf6477af8
root     3130377  0.0  0.0   1104     4 ?        Ss   Jul20   0:01  \_ /sbin/docker-init -- /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     3130467  0.0  0.0  33108 10368 ?        S    Jul20   0:00      \_ /usr/bin/python3 -u /sbin/my_init -- /bin/bash docker/entrypoint.sh run
root     3130608  0.0  0.0   4548   728 ?        S    Jul20   0:00      |   \_ /usr/bin/runsvdir -P /etc/service
root     3130610  0.0  0.0   4396   736 ?        Ss   Jul20   0:00      |   |   \_ runsv cron
root     3130612  0.0  0.0  31588  2976 ?        S    Jul20   0:00      |   |   |   \_ /usr/sbin/cron -f
root     3130611  0.0  0.0   4396   768 ?        Ss   Jul20   0:00      |   |   \_ runsv sshd
root     3130609  0.0  0.0  21636  3488 ?        S    Jul20   0:00      |   \_ /bin/bash docker/entrypoint.sh run
root     3130804  0.0  0.0  52532  3380 ?        S    Jul20   0:00      |       \_ su --preserve-environment --command export PATH=/home/scu/service.cli:/usr
8004     3130805  0.0  0.0  21636  3340 ?        Ss   Jul20   0:00      |           \_ bash -c export PATH=/home/scu/service.cli:/usr/local/sbin:/usr/local/b
8004     3130806  0.0  0.0  21636  3532 ?        S    Jul20   0:00      |               \_ /bin/bash /home/scu/service.cli/run
8004     3130807  0.0  0.0  21636  3344 ?        S    Jul20   0:00      |                   \_ /bin/bash do_run
8004     3130841  0.0  0.0  21636  3436 ?        S    Jul20   0:00      |                       \_ bash execute
8004     3130844  188  0.1 101916 60860 ?        Rl   Jul20 2292:16      |                           \_ ./2D stim_param.txt /inputs/model_INPUT.from1D /outpu
r

Proof: CPU on the container is properly limited to 4 CPUs:

ubuntu@osparc-master-05:~$ docker inspect c7ed | jq | grep CPU
        "SC_COMP_SERVICES_SCHEDULED_AS=CPU",
        "SIMCORE_NANO_CPUS_LIMIT=4000000000",
        "simcore.service.settings": "[{\"name\": \"Resources\", \"type\": \"Resources\", \"value\": {\"Limits\": {\"NanoCPUs\": 4000000000, \"MemoryBytes\": 4294967296}, \"Reservations\": {\"NanoCPUs\": 4000000000, \"MemoryBytes\": 4294967296}}}]"

Logs from the task 2D running, showing it to be slow:

osparc-master-05:~$ docker exec -it c7ed bash
root@c7ed76a0457e:/home/scu# cat /logs/log.dat
number of threads is 4
0    0
1    1
2    10
3    60
4    150
5    60
Performing 2D simulation
From fiber initial file
t: 1, tStep: 0, runtime: 3330 min
/outputs/ap1.dat
t: 2, tStep: 0, runtime: 6319 min
/outputs/ap2.dat
t: 3, tStep: 0, runtime: 9251 min
/outputs/ap3.dat

Logs form the dask sidecar

The logs of the dask-sidecar contained nothing w.r.t. the UUID or docker container name of the long-running comp. services. Logs of the same image successfully running exist:

2022-07-20 22:08:53,676 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.core - INFO - Starting task for simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 on osparc-master-05-master-simcore_master_dask-sidecar...
2022-07-20 22:08:53,970 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.core - INFO - Pulling simcore/services/comp/human-gb-2d-cardiac-model:1.0.1: {'status': 'Pulling from simcore/services/comp/human-gb-2d-cardiac-model', 'id': '1.0.1'}...
2022-07-20 22:08:53,973 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.core - INFO - Pulling simcore/services/comp/human-gb-2d-cardiac-model:1.0.1: {'status': 'Digest: sha256:d0b6e27dd84a2b5622ec05d07d445a95fa5946b9bd147c2b55b6f81ec4a080b3'}...
2022-07-20 22:08:53,973 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.core - INFO - Pulling simcore/services/comp/human-gb-2d-cardiac-model:1.0.1: {'status': 'Status: Image is up to date for registry.osparc-master.speag.com/simcore/services/comp/human-gb-2d-cardiac-model:1.0.1'}...
2022-07-20 22:08:53,976 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.core - INFO - Docker image for simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 ready  on osparc-master-05-master-simcore_master_dask-sidecar.
2022-07-20 22:08:53,979 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.docker_utils - INFO - simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 has integration version 0
2022-07-20 22:08:54,089 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.docker_utils - INFO - Starting to parse information of task [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta]
2022-07-20 22:08:58,057 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: number of threads is 4
2022-07-20 22:08:58,559 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 0 0
2022-07-20 22:08:59,063 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 1 1
2022-07-20 22:08:59,565 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 2 10
2022-07-20 22:09:00,068 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 3 60
2022-07-20 22:09:00,571 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 4 150
2022-07-20 22:09:01,073 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: 5 60
2022-07-20 22:09:01,575 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: Performing 2D simulation
2022-07-20 22:09:02,077 - distributed.worker.simcore_service_dask_sidecar.dask_utils - INFO - [simcore/services/comp/human-gb-2d-cardiac-model:1.0.1 - cb36fc3a90ea688fd0962c47199878de5ea2d53e832892ba308008f8725ccc22/sleepy_brahmagupta - LOG]: From fiber initial file
2022-07-20 22:39:19,055 - distributed.worker.simcore_service_dask_sidecar.computational_sidecar.docker_utils - INFO - Completed run of registry.osparc-master.speag.com/simcore/services/comp/human-gb-2d-cardiac-model:1.0.1
Key:       simcore/services/comp/human-gb-2d-cardiac-model:1.0.1:userid_9389:projectid_a2ac1573-d8b8-56bb-94b6-6eb28ea3fa9c:nodeid_3e594d71-4754-5b20-80a1-19ae081466ea:uuid_1f5aea92-3fbe-48b3-933e-57cb9006587a
kwargs:    

(last line was cut off)

Minor findings

  • It does seem like all long-running services where started before the dask-sidecar on each node (which is updated regularly due to the rolling releases on master from the deployment-agent). This means they might not be cleaned up or go rogue if the dask-sidecar restarts
  • The running comp. servies are docker containers, not not bound to any docker swarm service

Tasks

Preview Give feedback
@mrnicegyu11 mrnicegyu11 added the bug buggy, it does not work as expected label Jul 21, 2022
@sanderegg sanderegg added t:enhancement Improvement or request on an existing feature enhancement bug buggy, it does not work as expected a:dask-service Any of the dask services: dask-scheduler/sidecar or worker and removed bug buggy, it does not work as expected t:enhancement Improvement or request on an existing feature enhancement labels Feb 28, 2023
@sanderegg
Copy link
Member

@mrnicegyu11 : I am not so sure about the timeout now that we want to have wallets and such.
basically the time out will now happen when the wallet comes down to its allowed minimum.
for the project deletion I created a new issue (see tasks list)

@sanderegg sanderegg added this to the Baklava milestone Aug 21, 2023
@sanderegg sanderegg changed the title 🐛 Some computational services run forever with machine load 🐛 Ensure computational services are stopped when a project is deleted (Some computational services run forever with machine load) Aug 22, 2023
@sanderegg sanderegg changed the title 🐛 Ensure computational services are stopped when a project is deleted (Some computational services run forever with machine load) 🐛 Computational services are not stopped when a project is deleted (Some computational services run forever with machine load) Aug 22, 2023
@sanderegg
Copy link
Member

Tested on master today with long running sleepers. when the project was deleted the sleepers were stopped and removed as they should. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:dask-service Any of the dask services: dask-scheduler/sidecar or worker bug buggy, it does not work as expected
Projects
None yet
Development

No branches or pull requests

2 participants