-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636
Comments
Initially I met this issue on EKS with smaller cluster, there were only 100 GameServers, which leads to the issue. |
That's a gnarly bug! 🐛 |
We need to verify that this bug still exists on current releases of Agones and if so it should be fixed. |
I tried to reproduce the bug with the current Agones version but didn't succeed. Everything worked fine. Environment
My default node-pool has 10 nodes, so I also adjust the fleet replicas number accordingly. Both the requested CPU and Memory exceeded 99% of the node pool capacity. And I followed the same steps to try to reproduce the bug, but as we can see below, the old GameServerSet scale down as expected and the new GameServerSet scale up as expected as well. Create a Fleet with invalid image:
Update the Fleet with a valid image
|
Sounds like we can close it then? |
Yes, I think we can close it. |
Closing! |
There is an issue when we use up to 99% of all nodes capacities.
If we use Fleet with smaller Replicas count,
Replicas = 10
for example, there is no such bug.Use RollingUpdate scheduling strategy.
If there is an issue on initial
Fleet.yaml
configuration, and the wrong image is used. Then after fixing an image, Fleet would never self-heal, in terms it would never reach the desired amount of Replicas.What happened:
Use default node pool consisting of 14 nodes. Bug would be reproducible with smaller cluster (need to correct Replicas accordingly)
Two GameServerSets was created:
One with 0
Ready
and 1000Spec.Replicas
And the second with 250 (25%) Ready and 250 in
Spec.Replicas
What you expected to happen:
Fleet would have 1000 Ready Replicas after Update.
Old
GameServerSet
would be down scaled Gradually.I expect that first invalid (not active) GameServerSet should update its Spec.Replicas parameter even though no GameServers was created in it. You can take a look here:
agones/pkg/fleets/controller.go
Line 478 in 6475404
How to reproduce it (as minimally and precisely as possible):
image: gcr.io/agones-images/udp-server:0.21
kubectl delete fleet simple-udp
would leave 250 dangling GameServers.Anything else we need to know?:
Could be used as a workaround. But I expect that there should be a way to do this with Update of a Fleet.
Environment:
kubectl version
): 1.15fleet.yaml
:The text was updated successfully, but these errors were encountered: