Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow game servers deletion #540

Closed
cyriltovena opened this issue Jan 31, 2019 · 15 comments
Closed

Slow game servers deletion #540

cyriltovena opened this issue Jan 31, 2019 · 15 comments
Labels
area/performance Anything to do with Agones being slow, or making it go faster.
Milestone

Comments

@cyriltovena
Copy link
Collaborator

cyriltovena commented Jan 31, 2019

If you create a fleet using the tutorial command :

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/agones/release-0.7.0/examples/simple-udp/fleet.yaml

Then over provision (say 40 replica) the fleet using kubectl edit a lot of pods will get stuck in scheduling.

However if you delete the fleet, it disappears instantly, pods will be delete within seconds but gameservers can stay for up to 10min hanging before ultimately disappearing.

We should investigate what is the root cause of this.

@cyriltovena cyriltovena added the area/performance Anything to do with Agones being slow, or making it go faster. label Jan 31, 2019
@markmandel
Copy link
Collaborator

Oh that's fun. I'm almost willing to bet that what happens is:

  1. GameServer has a finaliser
  2. The finaliser only gets removed when the GamerServer's backing pod gets deleted
  3. Since they are stuck in Scheduling there will be no pod until there is room
  4. So it takes time to create the Pod and then have it removed

So we should look at if when you delete a GameServer, remove the finaliser if it's at Scheduled or before. (Possible concern with race conditions - but that should be the general gist)

@jkowalski
Copy link
Contributor

jkowalski commented Jan 31, 2019

I thought we have to release the port regardless? So the finalizer has to stay. Right?

@cyriltovena
Copy link
Collaborator Author

Something that I'd appreciate, we only delete the fleet only when replica is down to 0, WDYT ?

@markmandel
Copy link
Collaborator

@jkowalski

So the port only physically gets released once the Pod is gone -- hence the finaliser.

At PortAllocated - we assign a port from our in memory registry of available ports - but we don't do anything to lock a specific port at that stage physically on the network.

So if we delete the GameServer before the Pod goes up, the PortAllocator will just free up the port in the registry, and make it available again.

@Kuqd you can always do this yourself by doing a foregroundDeletion (details) - although not sure how to do that from kubectl.

@cyriltovena
Copy link
Collaborator Author

@markmandel I'll give a try to foregroundDeletion, it's for an API.

@cyriltovena
Copy link
Collaborator Author

cyriltovena commented Jan 31, 2019

Some more hints, it seems that gs/pods are still getting created then deleted

@cyriltovena
Copy link
Collaborator Author

cyriltovena commented Jan 31, 2019

I tried foregroundDeletion and it's worst, it's never ending, seems that there is a race between creating an deleting gs/pods.

I think I've pinned down the issue, we should stop enqueuing for the GameServerSet when it's being deleted.

@pm7h
Copy link
Contributor

pm7h commented Jan 31, 2019

I saw a similar behavior when per testing fleet scaling up and down. When I tested with GKE Autoscaling, at some point fleet autoscaling got stuck (different problem), game servers got stuck in Scheduled state, and then later deletion took some time.

I saw errors like this:
error-log

@Kuqd, did you observe this error as well or is it a different issue?

@ilkercelikyilmaz
Copy link
Contributor

I believe this issue is also resolved (which was similar to issue #543) I tested and it seems fine (I tested by creating 4000 GS). Can someone else confirm it so we can close. I believe recent GSS and Delete improvements resolved the issue.

@cyriltovena
Copy link
Collaborator Author

I'll investigate.

@jkowalski
Copy link
Contributor

@Kuqd can we close now?

@markmandel
Copy link
Collaborator

Gentle bump - @Kuqd can we close this now?

@markmandel
Copy link
Collaborator

I'm going to close this, please feel free to reopen if it rears its head again.

@markmandel markmandel added this to the 0.9.0 milestone Mar 25, 2019
@tenevdev
Copy link

tenevdev commented Jul 6, 2021

Hello, I'd like to reopen this, specifically the part about foreground deletion. I can reproduce it with agones 1.14.0 and a fleet which has 2 or more replicas - it keeps deleting and starting new game servers in a loop. Maybe there is some work to do in order to stop creating replicas for a fleet when it has a finalizer / deletion timestamp?

EDIT: I've tested the same sort of thing with a Deployment and ReplicaSet which in this sense is supposed to be equivalent to a Fleet and GameServerSet and deletion works fine so whatever is done for deployments to handle this could apply as a solution for fleets.

@markmandel
Copy link
Collaborator

@tenevdev can you create a new issue, with reproducible steps that are run against the latest Agones?

I think this is hard to track exactly what your issue is otherwise, as this ticket wandered a little from its original intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Anything to do with Agones being slow, or making it go faster.
Projects
None yet
Development

No branches or pull requests

6 participants