Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling Game Servers left behind if allocation fails at the wrong moment #607

Open
jkowalski opened this issue Feb 20, 2019 · 12 comments
Open
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented

Comments

@jkowalski
Copy link
Contributor

There's a rare race condition in Game Server Allocation which leads to dangling game servers that will never be cleaned up.

Basically if a call to allocate the server partially succeeds by marking the GS as Allocated but failing to return the result to the user (for example because of a call timeout or a connection getting dropped at the wrong moment), the user will never know which server was allocated in their call. Even worse, such server will never be in use and will stay Allocated forever, preventing fleet scaling down, etc.

One solution would be to purge Allocated game servers that don't have actual game played on them after a long while. To do that we need to know which GS are active and which are not.

How about the following strategy:

  • we add new call to the SDK that indicates the game is really being played (perhaps we just extend SDK health reporting to periodically pass extra data such as number of players in the game, perhaps some latency statistics and maybe average frame time, tick rate, etc.)

  • any DGS that does not receive health signal indicating that the game is being played within X amount of time will automatically move back to Ready or perhaps even Shutdown

This would obviously be a breaking change for current DGS who don't report the correct health information yet, so we'd need an opt-in or opt-out mechanism for the transition period to avoid getting those GS killed.

@jkowalski jkowalski added kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented labels Feb 20, 2019
@markmandel
Copy link
Collaborator

Oh yep - this is fun. And in reality, likely to happen in any distributed system, because weird stuff always does.

2 thoughts on the above:

we add new call to the SDK that indicates the game is really being played
Long term I think we will need to add some form of player tracking to the SDK - it's coming up a lot with conversations I'm having with the openmatch team. How deep the feature set is an ongoing question.

any DGS that does not receive health signal indicating that the game is being played
So in theory - a DGS could do this now, without changes to Agones. It would be ideal if we handled this as well in the SDK, but you could tie your healthcheck ping to playercount, and if after N minutes, nothing happens, stop sending it back as Healthy() (or you could also call sdk.Shutdown() too?)

But this probably really goes to a wider conversation about player tracking, and how we want to manage that.

@cyriltovena
Copy link
Collaborator

Adding a mandatory call from the SDK when the game start seems a good solution without having to handle player tracking for now.

And I agree, we need opt-in or opt-out, I think opt-out is better since we want our users to not have dangling GameServer by default.

@ilkercelikyilmaz
Copy link
Contributor

Maybe we should consider having another state called "InUse". After the GS state is set to Allocated, game running on the GS should call Agones to change the state from Allocated to InUse. If the GS is in Allocated state for X minutes, system can shut it down and release the resources.

@markmandel
Copy link
Collaborator

At some point in the near future, we will need to track at the very least - max number of players a DGS can handle, and the number of players currently playing - to make it easier to support backfill for matchmakers (or even just front-fill).

This feels all related to be.

@cyriltovena
Copy link
Collaborator

I'm wondering if the game server could handle this itself by shutting down after a given timeout.

@markmandel
Copy link
Collaborator

I'm wondering if we (optionally) track player counts/ids per game server, we could have some kind of timeout to shutdown if allocated but no players connect after a certain period? (Also optional)

@castaneai
Copy link
Collaborator

Hi. I found the same problem.

In games where a small number of players match, the server that fails the allocation is still the dangling server.
If the player retries the allocation, a new server will be allocated and the dangling server will remain. (Matchmaker has no way of knowing which server failed to allocate)

Currently I am implementing PlayerCount management and @jkowalski 's solution on the GameServer side that If PlayerCount is zero for a certain period in Allocated state, Shutdown is called automatically.

This issue can occur in a variety of games, so it would be nice if Agones could provide a solution!

@markmandel
Copy link
Collaborator

We are part of the way there, with Player Tracking being in alpha right now!

Not sure if we should wait until GA for this feature, but what would you like this to look like? Some sort of additional configuration option on a GameServer yaml?

@castaneai
Copy link
Collaborator

I agree. It is a good idea to avoid the dangling server by writing it in GameServer yaml.

@markmandel
Copy link
Collaborator

This still seems like a good idea, but will get moved to #2716 once it's implemented.

@github-actions
Copy link

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

@github-actions github-actions bot added the stale Pending closure unless there is a strong objection. label Apr 15, 2023
@markmandel markmandel added awaiting-maintainer Block issues from being stale/obsolete/closed and removed stale Pending closure unless there is a strong objection. labels Apr 18, 2023
@markmandel
Copy link
Collaborator

Added awaiting-maintainer as I think we should definitely still do this one #2716 is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented
Projects
None yet
Development

No branches or pull requests

5 participants