Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

Closed
bjethwan opened this issue Jun 27, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@bjethwan
Copy link

bjethwan commented Jun 27, 2022

Report

Consider below scaledobject, and a dummy k8s deployment called "demo"

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: demo-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: demo
  minReplicaCount: 0
  cooldownPeriod:  30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://192.168.1.165:9090
      metricName: visit_counter_total
      threshold: '2'
      query: visit_counter_total
      disableScaleToZero: 'true'

Issue:
Prometheus reported visit_counter_total=38, everything good 19 pods got created.
Prometheus reported visit_counter_total=39, pods stayed 19 (which is fine)
Prometheus reported visit_counter_total=40, ISSUE: NUMBER OF PODS STUCK AT 19


% k get pods --no-headers | wc -l
      19

LOGS: WHEN I INCREASED visit_counter_total FROM 38 TO 39 (WAITED FOR 3 MINS) TO 40 (WAITED FOR 10 MINS)

1.6563435558217254e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435558218324e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.656343555832207e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708244784e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708245707e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.6563435708350515e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435791322286e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436090996041e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436390634325e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.65634366903128e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436989966836e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437289631693e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437589283757e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563

Expected Behavior

POD COUNT SHOULD HAVE GONE UP TO 20.

Actual Behavior

POD COUNT STUCK AT 19

Steps to Reproduce the Problem

  1. INSTALL PROMETHEUS LOCALLY
  2. RUN THIS SAMPLE MICROSERVICE https://tanzu.vmware.com/developer/guides/spring-prometheus/
  3. INCREASE visit_counter_total FROM 38 TO 39 (WAITED FOR 3 MINS) TO 40 (WAITED FOR 10 MINS)

Logs from KEDA operator

1.6563435558217254e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435558218324e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.656343555832207e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708244784e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708245707e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.6563435708350515e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435791322286e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436090996041e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436390634325e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.65634366903128e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436989966836e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437289631693e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437589283757e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563

KEDA Version

2.7.1

Kubernetes Version

1.23

Platform

Other

Scaler Details

Prometheus

Anything else?

ALL THIS IN THE LOCAL ENVIRONMENT USING KIND

@bjethwan
Copy link
Author

bjethwan commented Jun 28, 2022

I tried it again today and found that it might take time to reproduce but it does happen, that after a point scaling pauses.
In today's case every other time when I increase the visit_counter_total, a new pods was coming up fine, then I changes the timings for visit_counter_total and then the pods count didn't go to 14

% k get hpa 
NAME                         REFERENCE         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-demo-scaledobject   Deployment/demo   2154m/2 (avg)   1         100       13         33m

I was working perfectly fine till I hit visit_counter_total=28
image

@bjethwan
Copy link
Author

Can you give me a debug binary to get more informed logs?
--zap-log-level=debug is really not enough to nail down on this one.

@JorTurFer
Copy link
Member

Hey,
I'd say that having that small difference between the target value and the current value, you could need a lot of time more. I suggest taking a look at HPA documentation and how to configure these advanced behaviours.
In the docs they also say:
The control plane skips any scaling action if the ratio is sufficiently close to 1.0 (within a globally-configurable tolerance, 0.1 by default). Maybe this is your case

@JorTurFer
Copy link
Member

JorTurFer commented Jun 28, 2022

BTW, where have you found the parameter disableScaleToZero? I think that it's not used anywhere (at least, not in the code)

@JorTurFer
Copy link
Member

Can you give me a debug binary to get more informed logs?
--zap-log-level=debug is really not enough to nail down on this one.

What do you mean? You can use the source code and troubleshoot your case with it

@bjethwan
Copy link
Author

bjethwan commented Jul 2, 2022

Hey, I'd say that having that small difference between the target value and the current value, you could need a lot of time more. I suggest taking a look at HPA documentation and how to configure these advanced behaviours. In the docs they also say: The control plane skips any scaling action if the ratio is sufficiently close to 1.0 (within a globally-configurable tolerance, 0.1 by default). Maybe this is your case

This was the issue. But now I am completely trapped. Because I have no way to configure this in my AKS clusters. I checked with Azure AKS teams, they are unable to help.

@JorTurFer
Do you know a workaround?

@JorTurFer
Copy link
Member

the only workaround is setting the targetValue smaller, the problem is that in v2.7.1, the target value and the metrics have to be integers, so you need to have 1 instead of 2 and I guess that it's too much overscaling.
In main there is a commit merged to support float values for target and metrics, and using it you could set the threshold to 1.8 or 1.9 but next release will be in a month. You could use main tag but we discourage it because it's not a stable version

@JorTurFer JorTurFer self-assigned this Jul 5, 2022
@JorTurFer JorTurFer moved this from Proposed to Pending End-User Feedback in Roadmap - KEDA Core Jul 5, 2022
@JorTurFer JorTurFer moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Jul 5, 2022
@JorTurFer JorTurFer moved this from Ready To Ship to Pending End-User Feedback in Roadmap - KEDA Core Jul 5, 2022
@bjethwan
Copy link
Author

bjethwan commented Jul 7, 2022

FYI: Azure/AKS#3068

@bjethwan
Copy link
Author

bjethwan commented Jul 7, 2022

@JorTurFer
For now I have tweaked the query
From:
query: ceil(no_of_agents/2)

To:
query: ceil(no_of_agents/1.8)

The downside is that there are some extra pods hanging and since it's on AKS, it's impacting the $

@bjethwan bjethwan closed this as completed Jul 7, 2022
Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Jul 7, 2022
@bjethwan
Copy link
Author

bjethwan commented Jul 7, 2022

This is not on KEDA but AKS

@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 10, 2022
@jcQuartic
Copy link

hi @tomkerkhove instead of the customized fix, do we have a solution for this issue now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants