Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default systemd restart policy #93

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

giordano
Copy link
Member

@giordano giordano commented Jan 9, 2025

On amdci6 we observed that the systemd services fail somewhat frequently because permission problems on *.qcow2 files. While these issues seem to resolve by themselves after a few minutes, quickly trying to restart the services only causes them to hit the maximum number of retries in the given interval (max 10 retries in 2 minutes by default), forcing someone to log into the machine to manually restart the systemd service.

With this change the service are tried to restart every minute rather than every second, allowing for more attempts over a larger time period. This could cause some agents to take longer to come back after a failure (1 minute vs 1 second), but also reduce the need for someone to manually restart them, which would cause much longer downtimes.

On `amdci6` we observed that the systemd services fail somewhat frequently because permission problems on `*.qcow2` files. While these issues seem to resolve by themselves after a few minutes, quickly trying to restart the services only causes them to hit the maximum number of retries in the given interval (max 10 retries in 2 minutes by default), forcing someone to log into the machine to manually restart the systemd service.

With this change the service are tried to restart every minute rather than every second, allowing for more attempts over a larger time period. This could cause some agents to take longer to come back after a failure (1 minute vs 1 second), but also reduce the need for someone to manually restart them, which would cause much longer downtimes.
@staticfloat
Copy link
Member

Hmmm, I think this will also cause runners to pick up a new job after waiting 60s as well, since the buildkite runners are configured to exit after each job (so that we can tear down the container they did the job in, then start up again in a new container).

What I'd really like is an exponential backoff, configurable via RestartSteps, but that's only available in systemd 254+, and we're on 245 on amdci6, at least.

While these issues seem to resolve by themselves after a few minutes

In my time administering these machines, that has never been the case. I've only ever had this happen when there was some kind of abrupt end to the KVM virtual machine (such as power outage to the machine) which prevented libvirt from re-chowning the files back to their original owner, and then it required manual intervention to sudo chown those files back to their original ownership.

This would all be solvable by just having a sudo chown $(id -u);$(id -g) as part of the systemd startup script, but I've been trying to avoid sprinkling sudo commands in here (and requiring passwordless sudo).

Alternative solutions could be:

  • Continue allowing the service to try and restart once per second, for much longer.
  • Create a small shell script that does the chowning, and copying, save it as owned by root and make it setuid, then invoke that from the systemd script. This would hopefully not be much of a security hole, as it only does that one action.

@giordano
Copy link
Member Author

In my time administering these machines, that has never been the case.

Ok, maybe I misunderstood the issue when reading the log and it wasn't related to the permission of the *.qcow2 files (which did happen yesterday and I had to go and manually fix them), but the thing I observed when I opened the PR was still that we were trying to restart the jobs 10 times within 10 seconds and always failing for some other still unidentified reasons, which resolved themselves after a much longer period of time than 10 seconds (of the order of ~10 minutes).

What I'm trying now is to retry every 10 seconds for 45 minutes, this should allow us to restart quicker than retries every 1 minute, but still be able to work around the agents failing to restart within 10 seconds total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants