-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow graceful restarts of game server containers instead of shutdown #2781
Comments
This is an interesting idea 🤔 and I can see the value! For reference, this is where we track container failure and move containers to Unhealthy: agones/pkg/gameservers/health.go Lines 135 to 148 in d70793c
At first, I didn't think that the K8s API actually gave us access to status codes when a termination occurs, but I was wrong! As we can see here: There is an Additional thought/detail/questions:
|
In response to the questions:
Ideally it would be really nice to have a |
Not a huge change. Actually, that might be a better solution in many ways. Since if the GameServer is before being Ready, we allow the container to restart, and don't move it to One solution is that the game server binary watches for that state change from I actually think it's the least amount of work on the Agones side, since we already test that restarts are allowed before |
That's totally a manageable amount of integration! If that's an option, then that would actually be my preferred solution mostly because it makes it significantly easier to figure out why a game server pod behaved the way it did, and seems to fit in with the other ways that a game server moves between states. That being said, I just remembered that There's an issue open in the kubernetes repo to allow tuning of the backoff, but it's several years old without any real progress: kubernetes/kubernetes#57291 |
That is a really good point. You will be able to see from the
Erk! That sucks, I really liked this idea! I do like the workaround of having a shell script, something like:
Although probably with better exit code handling. Two thoughts:
|
I'd still like some way to restart the container when viable, just to clear out any changes we may have made to the state in the container. So ideally somewhere between the wrapper and the game server we'd be able to determine that we've crossed the restart threshold and it should be safe to restart the container without incurring any backoff penalties. I'll throw something together that does all that when I get a minute. |
Got bored and wrote some bash scripts, and they aren't that bad! loop.sh #!/bin/bash
while "$@"; do :; done run.fail.sh #!/bin/bash
echo "FAIL!"
sleep 1
exit 2 run.pass.sh #!/bin/bash
echo "Running!"
sleep 1 Results: ➜ shell ./loop.sh ./run.pass.sh
Running!
Running!
Running!
Running!
Running!
Running!
Running!
Running!
^C
➜ shell ./loop.sh ./run.fail.sh
FAIL!
➜ shell |
Actually yeah, for an example those are way simpler than what I was talking about and much easier to understand. I can build whatever container restart timing stuff I need on top of that. |
The only downside I can see to exiting the process / container is that you would need to load in all your data from memory again, rather than being able to reuse it if you switched back to a zero state. But if you don't have much / any data to load (thinking relay serves here), this could be an interesting approach. For a workaround for doing the above right now, rather than wait for a SDK to switch back to Scheduled, you could set a label that would move that |
I've been wondering if there are solutions we could do on the SDK that wouldn't even really look like an exit: if you had a "golden" point in your initialization that you wanted to preserve, we could potentially fork() at that point and keep a hot standby process. One approach that might work is to have the original process bind to the healthcheck port first, then the child of the fork would just wait on being able to bind to it. I thiiink (would have to check) that as long as there's a process receiving liveness checks, you should be able to flap between processes like this indefinitely. |
I assume you mean, this would be something you do inside your game server binary process itself? |
Yeah. It would require cooperation of the game server binary, so it's intrusive (unlike reporter's solution, which would work on legacy binaries), but it has a similar advantage: you no longer have to reason about whether you reset the game state sufficiently, the child process is effectively the reset state. As long as, when the child becomes the primary server, you fork a new child before accruing state, you'll be at the "golden" state. |
@zmerlynn that makes sense! Then you could reuse in-memory state at that point. So I'm thinking of two things for this ticket:
Some other ideas for method names, none of which I'm super happy about:
Howzat sound? |
I poked at my fork suggestion briefly and .. maybe it would work in a language other than Go, but at least in Go, fork support is pretty much limited to fork/exec and not "true" fork(), due to needing to copy state of goroutines/etc. |
@austin-space I loved the idea here so much, I put together a rather long proposal in #2794, which brings together a couple of different ideas (disruption controls and pod reuse). Rather than keep a separate issue, I'm going to close this issue out and redirect conversations towards #2794. |
We decided the (original) proposal in #2794 was biting off too much at once and stripped it back to the core of the issue, so I'm reopening this one. I'd like to keep looking into this, though, as I think there's a real opportunity here to offer simple between-container reuse. |
Hi! I am interested in using this feature, are there any updates on it? |
@miai10 I think we are blocked on kubernetes/kubernetes#57291 for the time being. We have discussed internally options to get around blocking on that - and I'll be discussing it directly with SIG-node soon. For the time being https://agones.dev/site/docs/integration-patterns/reusing-gameservers/ is the supported method for Pod re-use. |
I was at the sig node meeting today. Have you considered having a supervisord (or similar init) process managing the game server within the container? The supervisor would restart the game server process without restarting the entire container. |
@rphillips That's not too different from #2781 (comment), which is a hacky bash loop. I think it's an okay approach to handle it outside the process but within the container. That said, we've had specific requests for exiting the container fully instead - specifically, users don't want to have to reason about whether the process shut down "cleanly" within the container. In the game server context, a lot of users are using a framework (Unity, Unreal) that wraps their simulation, and users have seen cases where e.g. the framework modified Windows registry keys and left the container in a funky state. So a supervisor is feasible if you're willing to reason about the container state in-between restarts, but allowing the container to be restarted would be ideal. |
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions ' |
There is some traction on kubernetes/kubernetes#57291, so keeping this from getting staled-out for now. If it can be changed upstream, we can do some pretty interesting patterns in Agones transparently. |
Is your feature request related to a problem? Please describe.
Pod churn(creating and deleting pods) is a relatively expensive operation. In cases where pod churn is particularly high, churn can get in the way of a fleet's ability to scale since both are contending for the same resources.
Describe the solution you'd like
Have some value in the game server health configuration like
AllowGracefulContainerRestart
. If this flag is set to true, instead of callingShutdown()
on the sdk, we'd just exit out of the game server process with a zero exit code. The health controller would not mark this as unhealthy(as long as we had a non-zero exit code). The game server would then go back to theStarting
state and would run normally from there.In the case where a game server exited with a non-zero exit code or exited before being allocated, that would indicate an unhealthy game server and the pod should be restarted.
This would remove the costly pod churn operation, while still allowing us to have a brand new game server container.
Describe alternatives you've considered
Additional context
This would be targeting a fairly specific case where the cost to start a new game server is very low, but the impact of starting a new pod is relatively high.
The text was updated successfully, but these errors were encountered: