Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

aLekSer · 2020-06-22T16:17:42Z

There is an issue when we use up to 99% of all nodes capacities.
If we use Fleet with smaller Replicas count, Replicas = 10 for example, there is no such bug.

Use RollingUpdate scheduling strategy.
If there is an issue on initial Fleet.yaml configuration, and the wrong image is used. Then after fixing an image, Fleet would never self-heal, in terms it would never reach the desired amount of Replicas.

What happened:
Use default node pool consisting of 14 nodes. Bug would be reproducible with smaller cluster (need to correct Replicas accordingly)

default | OK | 1.15.12-gke.2 | 14 | n1-standard-4

Two GameServerSets was created:
One with 0 Ready and 1000 Spec.Replicas
And the second with 250 (25%) Ready and 250 in Spec.Replicas

What you expected to happen:
Fleet would have 1000 Ready Replicas after Update.
Old GameServerSet would be down scaled Gradually.

I expect that first invalid (not active) GameServerSet should update its Spec.Replicas parameter even though no GameServers was created in it. You can take a look here:

agones/pkg/fleets/controller.go

Line 478 in 6475404

if gsSet.Status.Replicas <= 0 {

How to reproduce it (as minimally and precisely as possible):

Create the fleet with 1000 GameServers but wrong Image (see below):
Update the image to normal image: gcr.io/agones-images/udp-server:0.21

kubectl get gameserversets
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-22wmj   Packed       1000      1000      0           0       14m
simple-udp-zpnht   Packed       250       250       0           250     13m

As a side affect kubectl delete fleet simple-udp would leave 250 dangling GameServers.

Anything else we need to know?:

kubectl delete fleet simple-udp
kubectl apply -f ./fleet.yaml

Could be used as a workaround. But I expect that there should be a way to do this with Update of a Fleet.

Environment:

Agones version: 1.6
Kubernetes version (use kubectl version): 1.15
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): helm
Troubleshooting guide log(s):
Others:
fleet.yaml:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 1000
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-udp
            image: gcr.io/agones-images/udp-server:1
            resources:
              requests:
                memory: "64Mi"
                cpu: "20m"
              limits:
                memory: "64Mi"
                cpu: "20m"

Spec:
  Replicas:    1000
  Scheduling:  Packed
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate

The text was updated successfully, but these errors were encountered:

aLekSer · 2020-06-22T16:38:58Z

Initially I met this issue on EKS with smaller cluster, there were only 100 GameServers, which leads to the issue.
https://agones.dev/site/docs/installation/terraform/eks/

markmandel · 2020-06-22T20:12:11Z

That's a gnarly bug! 🐛

roberthbailey · 2022-09-22T16:58:40Z

We need to verify that this bug still exists on current releases of Agones and if so it should be fixed.

gongmax · 2022-10-14T04:47:27Z

I tried to reproduce the bug with the current Agones version but didn't succeed. Everything worked fine.

Environment

Agones version: 1.26
Kubernetes version (use kubectl version): 1.25
Cloud provider or hardware configuration: GKE

Others:

fleet.yaml:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 720
  scheduling: Packed
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-udp
            image: gcr.io/agones-images/udp-server:1
            resources:
              requests:
                memory: "160Mi"
                cpu: "20m"
              limits:
                memory: "160Mi"
                cpu: "20m"

node pools info:

> gcloud container node-pools list --cluster test-cluster --region us-west1-c
NAME            MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
default         n1-standard-4  100           1.23.12-gke.100
agones-system   e2-standard-4  100           1.23.12-gke.100
agones-metrics  e2-standard-4  100           1.23.12-gke.100

My default node-pool has 10 nodes, so I also adjust the fleet replicas number accordingly. Both the requested CPU and Memory exceeded 99% of the node pool capacity. And I followed the same steps to try to reproduce the bug, but as we can see below, the old GameServerSet scale down as expected and the new GameServerSet scale up as expected as well.

Create a Fleet with invalid image:

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           0       4m7s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-p6xnk   Packed       720       720       0           0       4m17s

Update the Fleet with a valid image

> kubectl edit fleet simple-udp
fleet.agones.dev/simple-udp edited

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       604       0           0       5m39s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       360       180       0           0       9s
simple-udp-p6xnk   Packed       540       540       0           0       5m44s

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       899       0           337     6m54s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           442     95s
simple-udp-p6xnk   Packed       98        98        0           0       7m10s



> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           634     7m48s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           666     2m15s
simple-udp-p6xnk   Packed       0         0         0           0       7m50s

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           720     8m10s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           720     2m36s

markmandel · 2022-10-14T16:19:23Z

Sounds like we can close it then?

gongmax · 2022-10-14T16:21:48Z

Yes, I think we can close it.

markmandel · 2022-10-14T16:22:49Z

Closing!

aLekSer added the kind/bug These are bugs. label Jun 22, 2020

roberthbailey added help wanted We would love help on these issues. Please come help us! good first issue These are great first issues. If you are looking for a place to start, start here! labels Sep 22, 2022

markmandel closed this as completed Oct 14, 2022

markmandel added the invalid Sorry. We got this one wrong. label Oct 14, 2022

gongmax mentioned this issue Nov 29, 2022

Request for gongmax to become Approver #2834

Closed

5 tasks

gongmax self-assigned this Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

aLekSer commented Jun 22, 2020 •

edited

Loading

aLekSer commented Jun 22, 2020

markmandel commented Jun 22, 2020

roberthbailey commented Sep 22, 2022

gongmax commented Oct 14, 2022 •

edited

Loading

markmandel commented Oct 14, 2022

gongmax commented Oct 14, 2022

markmandel commented Oct 14, 2022

Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

Comments

aLekSer commented Jun 22, 2020 • edited Loading

aLekSer commented Jun 22, 2020

markmandel commented Jun 22, 2020

roberthbailey commented Sep 22, 2022

gongmax commented Oct 14, 2022 • edited Loading

Environment

markmandel commented Oct 14, 2022

gongmax commented Oct 14, 2022

markmandel commented Oct 14, 2022

aLekSer commented Jun 22, 2020 •

edited

Loading

gongmax commented Oct 14, 2022 •

edited

Loading