Change default systemd restart policy #93

giordano · 2025-01-09T11:54:58Z

On amdci6 we observed that the systemd services fail somewhat frequently because permission problems on *.qcow2 files. While these issues seem to resolve by themselves after a few minutes, quickly trying to restart the services only causes them to hit the maximum number of retries in the given interval (max 10 retries in 2 minutes by default), forcing someone to log into the machine to manually restart the systemd service.

With this change the service are tried to restart every minute rather than every second, allowing for more attempts over a larger time period. This could cause some agents to take longer to come back after a failure (1 minute vs 1 second), but also reduce the need for someone to manually restart them, which would cause much longer downtimes.

On `amdci6` we observed that the systemd services fail somewhat frequently because permission problems on `*.qcow2` files. While these issues seem to resolve by themselves after a few minutes, quickly trying to restart the services only causes them to hit the maximum number of retries in the given interval (max 10 retries in 2 minutes by default), forcing someone to log into the machine to manually restart the systemd service. With this change the service are tried to restart every minute rather than every second, allowing for more attempts over a larger time period. This could cause some agents to take longer to come back after a failure (1 minute vs 1 second), but also reduce the need for someone to manually restart them, which would cause much longer downtimes.

staticfloat · 2025-01-13T01:55:04Z

Hmmm, I think this will also cause runners to pick up a new job after waiting 60s as well, since the buildkite runners are configured to exit after each job (so that we can tear down the container they did the job in, then start up again in a new container).

What I'd really like is an exponential backoff, configurable via RestartSteps, but that's only available in systemd 254+, and we're on 245 on amdci6, at least.

While these issues seem to resolve by themselves after a few minutes

In my time administering these machines, that has never been the case. I've only ever had this happen when there was some kind of abrupt end to the KVM virtual machine (such as power outage to the machine) which prevented libvirt from re-chowning the files back to their original owner, and then it required manual intervention to sudo chown those files back to their original ownership.

This would all be solvable by just having a sudo chown $(id -u);$(id -g) as part of the systemd startup script, but I've been trying to avoid sprinkling sudo commands in here (and requiring passwordless sudo).

Alternative solutions could be:

Continue allowing the service to try and restart once per second, for much longer.
Create a small shell script that does the chowning, and copying, save it as owned by root and make it setuid, then invoke that from the systemd script. This would hopefully not be much of a security hole, as it only does that one action.

giordano · 2025-01-25T11:57:07Z

In my time administering these machines, that has never been the case.

Ok, maybe I misunderstood the issue when reading the log and it wasn't related to the permission of the *.qcow2 files (which did happen yesterday and I had to go and manually fix them), but the thing I observed when I opened the PR was still that we were trying to restart the jobs 10 times within 10 seconds and always failing for some other still unidentified reasons, which resolved themselves after a much longer period of time than 10 seconds (of the order of ~10 minutes).

What I'm trying now is to retry every 10 seconds for 45 minutes, this should allow us to restart quicker than retries every 1 minute, but still be able to work around the agents failing to restart within 10 seconds total.

giordano requested review from staticfloat and fredrikekre January 9, 2025 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default systemd restart policy #93

Change default systemd restart policy #93

giordano commented Jan 9, 2025

staticfloat commented Jan 13, 2025

giordano commented Jan 25, 2025

Change default systemd restart policy #93

Are you sure you want to change the base?

Change default systemd restart policy #93

Conversation

giordano commented Jan 9, 2025

staticfloat commented Jan 13, 2025

giordano commented Jan 25, 2025