scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

bjethwan · 2022-06-27T15:38:35Z

Report

Consider below scaledobject, and a dummy k8s deployment called "demo"

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: demo-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: demo
  minReplicaCount: 0
  cooldownPeriod:  30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://192.168.1.165:9090
      metricName: visit_counter_total
      threshold: '2'
      query: visit_counter_total
      disableScaleToZero: 'true'

Issue:
Prometheus reported visit_counter_total=38, everything good 19 pods got created.
Prometheus reported visit_counter_total=39, pods stayed 19 (which is fine)
Prometheus reported visit_counter_total=40, ISSUE: NUMBER OF PODS STUCK AT 19


% k get pods --no-headers | wc -l
      19

LOGS: WHEN I INCREASED visit_counter_total FROM 38 TO 39 (WAITED FOR 3 MINS) TO 40 (WAITED FOR 10 MINS)

1.6563435558217254e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435558218324e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.656343555832207e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708244784e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708245707e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.6563435708350515e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435791322286e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436090996041e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436390634325e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.65634366903128e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436989966836e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437289631693e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437589283757e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563

Expected Behavior

POD COUNT SHOULD HAVE GONE UP TO 20.

Actual Behavior

POD COUNT STUCK AT 19

Steps to Reproduce the Problem

INSTALL PROMETHEUS LOCALLY
RUN THIS SAMPLE MICROSERVICE https://tanzu.vmware.com/developer/guides/spring-prometheus/
INCREASE visit_counter_total FROM 38 TO 39 (WAITED FOR 3 MINS) TO 40 (WAITED FOR 10 MINS)

Logs from KEDA operator

1.6563435558217254e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435558218324e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.656343555832207e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708244784e+09	INFO	controller.scaledobject	Reconciling ScaledObject	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435708245707e+09	DEBUG	controller.scaledobject	Parsed Group, Version, Kind, Resource	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
1.6563435708350515e+09	DEBUG	controller.scaledobject	ScaledObject is defined correctly and is ready for scaling	{"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "name": "demo-scaledobject", "namespace": "default"}
1.6563435791322286e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436090996041e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436390634325e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.65634366903128e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563436989966836e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437289631693e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563437589283757e+09	DEBUG	scalehandler	Scaler for scaledObject is active	{"scaledobject.Name": "demo-scaledobject", "scaledObject.Namespace": "default", "scaleTarget.Name": "demo", "Metrics Name": "s0-prometheus-visit_counter_total"}
1.6563

KEDA Version

2.7.1

Kubernetes Version

1.23

Platform

Other

Scaler Details

Prometheus

Anything else?

ALL THIS IN THE LOCAL ENVIRONMENT USING KIND

The text was updated successfully, but these errors were encountered:

bjethwan · 2022-06-28T15:38:53Z

I tried it again today and found that it might take time to reproduce but it does happen, that after a point scaling pauses.
In today's case every other time when I increase the visit_counter_total, a new pods was coming up fine, then I changes the timings for visit_counter_total and then the pods count didn't go to 14

% k get hpa 
NAME                         REFERENCE         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-demo-scaledobject   Deployment/demo   2154m/2 (avg)   1         100       13         33m

I was working perfectly fine till I hit visit_counter_total=28

bjethwan · 2022-06-28T15:42:40Z

Can you give me a debug binary to get more informed logs?
--zap-log-level=debug is really not enough to nail down on this one.

JorTurFer · 2022-06-28T18:40:48Z

Hey,
I'd say that having that small difference between the target value and the current value, you could need a lot of time more. I suggest taking a look at HPA documentation and how to configure these advanced behaviours.
In the docs they also say:
The control plane skips any scaling action if the ratio is sufficiently close to 1.0 (within a globally-configurable tolerance, 0.1 by default). Maybe this is your case

JorTurFer · 2022-06-28T18:42:13Z

BTW, where have you found the parameter disableScaleToZero? I think that it's not used anywhere (at least, not in the code)

JorTurFer · 2022-06-28T18:43:13Z

Can you give me a debug binary to get more informed logs?
--zap-log-level=debug is really not enough to nail down on this one.

What do you mean? You can use the source code and troubleshoot your case with it

bjethwan · 2022-07-02T15:31:24Z

Hey, I'd say that having that small difference between the target value and the current value, you could need a lot of time more. I suggest taking a look at HPA documentation and how to configure these advanced behaviours. In the docs they also say: The control plane skips any scaling action if the ratio is sufficiently close to 1.0 (within a globally-configurable tolerance, 0.1 by default). Maybe this is your case

This was the issue. But now I am completely trapped. Because I have no way to configure this in my AKS clusters. I checked with Azure AKS teams, they are unable to help.

@JorTurFer
Do you know a workaround?

JorTurFer · 2022-07-02T16:08:29Z

the only workaround is setting the targetValue smaller, the problem is that in v2.7.1, the target value and the metrics have to be integers, so you need to have 1 instead of 2 and I guess that it's too much overscaling.
In main there is a commit merged to support float values for target and metrics, and using it you could set the threshold to 1.8 or 1.9 but next release will be in a month. You could use main tag but we discourage it because it's not a stable version

bjethwan · 2022-07-07T06:16:04Z

FYI: Azure/AKS#3068

bjethwan · 2022-07-07T06:17:44Z

@JorTurFer
For now I have tweaked the query
From:
query: ceil(no_of_agents/2)

To:
query: ceil(no_of_agents/1.8)

The downside is that there are some extra pods hanging and since it's on AKS, it's impacting the $

bjethwan · 2022-07-07T06:18:38Z

This is not on KEDA but AKS

jcQuartic · 2023-11-20T04:41:06Z

hi @tomkerkhove instead of the customized fix, do we have a solution for this issue now?

bjethwan added the bug Something isn't working label Jun 27, 2022

tomkerkhove added this to Roadmap - KEDA Core Jun 27, 2022

tomkerkhove moved this to Proposed in Roadmap - KEDA Core Jun 27, 2022

bjethwan mentioned this issue Jun 28, 2022

Cache metrics (values) in Metric Server and honor pollingInterval #2282

Closed

JorTurFer self-assigned this Jul 5, 2022

JorTurFer moved this from Proposed to Pending End-User Feedback in Roadmap - KEDA Core Jul 5, 2022

JorTurFer moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Jul 5, 2022

JorTurFer moved this from Ready To Ship to Pending End-User Feedback in Roadmap - KEDA Core Jul 5, 2022

bjethwan mentioned this issue Jul 7, 2022

Allow configuring horizontal-pod-autoscaler-tolerance AKS control-plane Azure/AKS#3068

Open

bjethwan closed this as completed Jul 7, 2022

Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Jul 7, 2022

tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

bjethwan commented Jun 27, 2022 •

edited

Loading

bjethwan commented Jun 28, 2022 •

edited

Loading

bjethwan commented Jun 28, 2022

JorTurFer commented Jun 28, 2022

JorTurFer commented Jun 28, 2022 •

edited

Loading

JorTurFer commented Jun 28, 2022

bjethwan commented Jul 2, 2022

JorTurFer commented Jul 2, 2022

bjethwan commented Jul 7, 2022

bjethwan commented Jul 7, 2022

bjethwan commented Jul 7, 2022

jcQuartic commented Nov 20, 2023

scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

scaling issue when metric and threshold result in floating point, no recovery from that point onwards #3291

Comments

bjethwan commented Jun 27, 2022 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

bjethwan commented Jun 28, 2022 • edited Loading

bjethwan commented Jun 28, 2022

JorTurFer commented Jun 28, 2022

JorTurFer commented Jun 28, 2022 • edited Loading

JorTurFer commented Jun 28, 2022

bjethwan commented Jul 2, 2022

JorTurFer commented Jul 2, 2022

bjethwan commented Jul 7, 2022

bjethwan commented Jul 7, 2022

bjethwan commented Jul 7, 2022

jcQuartic commented Nov 20, 2023

bjethwan commented Jun 27, 2022 •

edited

Loading

bjethwan commented Jun 28, 2022 •

edited

Loading

JorTurFer commented Jun 28, 2022 •

edited

Loading