Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck #1636

Closed
aLekSer opened this issue Jun 22, 2020 · 7 comments
Assignees
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! invalid Sorry. We got this one wrong. kind/bug These are bugs.

Comments

@aLekSer
Copy link
Collaborator

aLekSer commented Jun 22, 2020

There is an issue when we use up to 99% of all nodes capacities.
If we use Fleet with smaller Replicas count, Replicas = 10 for example, there is no such bug.

Use RollingUpdate scheduling strategy.
If there is an issue on initial Fleet.yaml configuration, and the wrong image is used. Then after fixing an image, Fleet would never self-heal, in terms it would never reach the desired amount of Replicas.

What happened:
Use default node pool consisting of 14 nodes. Bug would be reproducible with smaller cluster (need to correct Replicas accordingly)

default | OK | 1.15.12-gke.2 | 14 | n1-standard-4

Two GameServerSets was created:
One with 0 Ready and 1000 Spec.Replicas
And the second with 250 (25%) Ready and 250 in Spec.Replicas

What you expected to happen:
Fleet would have 1000 Ready Replicas after Update.
Old GameServerSet would be down scaled Gradually.

I expect that first invalid (not active) GameServerSet should update its Spec.Replicas parameter even though no GameServers was created in it. You can take a look here:

if gsSet.Status.Replicas <= 0 {

How to reproduce it (as minimally and precisely as possible):

  1. Create the fleet with 1000 GameServers but wrong Image (see below):
  2. Update the image to normal image: gcr.io/agones-images/udp-server:0.21
kubectl get gameserversets
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-22wmj   Packed       1000      1000      0           0       14m
simple-udp-zpnht   Packed       250       250       0           250     13m
  1. As a side affect kubectl delete fleet simple-udp would leave 250 dangling GameServers.

Anything else we need to know?:

kubectl delete fleet simple-udp
kubectl apply -f ./fleet.yaml

Could be used as a workaround. But I expect that there should be a way to do this with Update of a Fleet.

Environment:

  • Agones version: 1.6
  • Kubernetes version (use kubectl version): 1.15
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s):
  • Others:
    fleet.yaml:
apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 1000
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-udp
            image: gcr.io/agones-images/udp-server:1
            resources:
              requests:
                memory: "64Mi"
                cpu: "20m"
              limits:
                memory: "64Mi"
                cpu: "20m"
Spec:
  Replicas:    1000
  Scheduling:  Packed
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
@aLekSer aLekSer added the kind/bug These are bugs. label Jun 22, 2020
@aLekSer aLekSer changed the title Fleet RollingUpdate with ImagePullBackoff: scaling not active GameServerSet stuck, when initially used wrong image Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling not active GameServerSet stuck, when initially used wrong image Jun 22, 2020
@aLekSer aLekSer changed the title Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling not active GameServerSet stuck, when initially used wrong image Fleet RollingUpdate with ImagePullBackoff and CPU Requests up to capacity: scaling down not active GameServerSet stuck Jun 22, 2020
@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 22, 2020

Initially I met this issue on EKS with smaller cluster, there were only 100 GameServers, which leads to the issue.
https://agones.dev/site/docs/installation/terraform/eks/

@markmandel
Copy link
Collaborator

That's a gnarly bug! 🐛

@roberthbailey
Copy link
Member

We need to verify that this bug still exists on current releases of Agones and if so it should be fixed.

@roberthbailey roberthbailey added help wanted We would love help on these issues. Please come help us! good first issue These are great first issues. If you are looking for a place to start, start here! labels Sep 22, 2022
@gongmax
Copy link
Collaborator

gongmax commented Oct 14, 2022

I tried to reproduce the bug with the current Agones version but didn't succeed. Everything worked fine.

Environment

  • Agones version: 1.26

  • Kubernetes version (use kubectl version): 1.25

  • Cloud provider or hardware configuration: GKE

  • Others:

    • fleet.yaml:
    apiVersion: "agones.dev/v1"
    kind: Fleet
    metadata:
      name: simple-udp
    spec:
      replicas: 720
      scheduling: Packed
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
      template:
        spec:
          ports:
          - name: default
            containerPort: 7654
          template:
            spec:
              containers:
              - name: simple-udp
                image: gcr.io/agones-images/udp-server:1
                resources:
                  requests:
                    memory: "160Mi"
                    cpu: "20m"
                  limits:
                    memory: "160Mi"
                    cpu: "20m"
    
    • node pools info:
    > gcloud container node-pools list --cluster test-cluster --region us-west1-c
    NAME            MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
    default         n1-standard-4  100           1.23.12-gke.100
    agones-system   e2-standard-4  100           1.23.12-gke.100
    agones-metrics  e2-standard-4  100           1.23.12-gke.100
    

My default node-pool has 10 nodes, so I also adjust the fleet replicas number accordingly. Both the requested CPU and Memory exceeded 99% of the node pool capacity. And I followed the same steps to try to reproduce the bug, but as we can see below, the old GameServerSet scale down as expected and the new GameServerSet scale up as expected as well.

Create a Fleet with invalid image:

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           0       4m7s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-p6xnk   Packed       720       720       0           0       4m17s

Update the Fleet with a valid image

> kubectl edit fleet simple-udp
fleet.agones.dev/simple-udp edited

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       604       0           0       5m39s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       360       180       0           0       9s
simple-udp-p6xnk   Packed       540       540       0           0       5m44s

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       899       0           337     6m54s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           442     95s
simple-udp-p6xnk   Packed       98        98        0           0       7m10s



> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           634     7m48s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           666     2m15s
simple-udp-p6xnk   Packed       0         0         0           0       7m50s

> kubectl get Fleet
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       720       720       0           720     8m10s

> kubectl get GameServerSet
NAME               SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp-4bs2d   Packed       720       720       0           720     2m36s

@markmandel
Copy link
Collaborator

Sounds like we can close it then?

@gongmax
Copy link
Collaborator

gongmax commented Oct 14, 2022

Yes, I think we can close it.

@markmandel
Copy link
Collaborator

Closing!

@markmandel markmandel added the invalid Sorry. We got this one wrong. label Oct 14, 2022
@gongmax gongmax self-assigned this Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! invalid Sorry. We got this one wrong. kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

4 participants