Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA HPA not scaling pods based on GPU metrics from prometheus #6141

Closed
Vijaygawate opened this issue Sep 6, 2024 · 2 comments
Closed

KEDA HPA not scaling pods based on GPU metrics from prometheus #6141

Vijaygawate opened this issue Sep 6, 2024 · 2 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@Vijaygawate
Copy link

Vijaygawate commented Sep 6, 2024

Report

I am trying to scale HPA based on GPU metric; everything seems to be working but when I am trying to query metric using below command, I am getting the output as "Error from server (NotFound): the server could not find the requested resource."
image

Expected Behavior

HPA shpuld scale pods, when met condition.
I have checked hpa logs and it says valid metric found, but still it is not scaling up the pods
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-prometheus(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: gpu-dcgmproftester-deployment-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})

Actual Behavior

No events happening in KEDA HPA, even my GPU utilization goes above 10%, also, GPU metric is available in Prometheus

KEDA HPA file

image

Steps to Reproduce the Problem

I have followed below article
https://gcore.com/docs/cloud/kubernetes/clusters/autoscaling/configure-gpu-autoscaling-for-kubernetes

Logs from KEDA operator

2024/09/06 05:25:58 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota 2024-09-06T05:25:58Z INFO setup Starting manager 2024-09-06T05:25:58Z INFO setup KEDA Version: 2.15.1 2024-09-06T05:25:58Z INFO setup Git Commit: 09a4951478746ba0d95521b786439e58aeda179b 2024-09-06T05:25:58Z INFO setup Go Version: go1.22.5 2024-09-06T05:25:58Z INFO setup Go OS/Arch: linux/amd64 2024-09-06T05:25:58Z INFO setup Running on Kubernetes 1.30+ {"version": "v1.30.3-eks-a18cd3a"} 2024-09-06T05:25:59Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"} I0906 05:25:59.037583 1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh... I0906 05:26:16.489836 1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"} 2024-09-06T05:26:16Z INFO cert-rotation starting cert rotator controller 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"} 2024-09-06T05:26:16Z INFO Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"} 2024-09-06T05:26:16Z INFO Starting Controller {"controller": "cert-rotator"} 2024-09-06T05:26:16Z INFO cert-rotation no cert refresh needed 2024-09-06T05:26:16Z INFO cert-rotation certs are ready in /certs 2024-09-06T05:26:16Z INFO Starting workers {"controller": "cert-rotator", "worker count": 1} 2024-09-06T05:26:16Z INFO cert-rotation no cert refresh needed 2024-09-06T05:26:16Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-09-06T05:26:16Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"} 2024-09-06T05:26:16Z INFO cert-rotation no cert refresh needed 2024-09-06T05:26:16Z INFO cert-rotation Ensuring CA cert {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"} 2024-09-06T05:26:16Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"} 2024-09-06T05:26:16Z INFO Starting workers {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1} 2024-09-06T05:26:16Z INFO Starting workers {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5} 2024-09-06T05:26:16Z INFO Starting workers {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1} 2024-09-06T05:26:16Z INFO Starting workers {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1} 2024-09-06T05:26:16Z INFO Starting workers {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1} 2024-09-06T05:26:17Z INFO cert-rotation CA certs are injected to webhooks 2024-09-06T05:26:17Z INFO grpc_server Starting Metrics Service gRPC Server {"address": ":9666"} 2024-09-06T05:32:28Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "3d73bd81-f844-4637-b4dd-04909f5a3c6b"} 2024-09-06T05:32:28Z INFO Adding Finalizer for the ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "3d73bd81-f844-4637-b4dd-04909f5a3c6b"} 2024-09-06T05:32:28Z INFO KubeAPIWarningLogger metadata.finalizers: "finalizer.keda.sh": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers 2024-09-06T05:32:28Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "3d73bd81-f844-4637-b4dd-04909f5a3c6b", "resource": "apps/v1.Deployment", "name": "gpu-api"} 2024-09-06T05:32:28Z INFO Creating a new HPA {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "3d73bd81-f844-4637-b4dd-04909f5a3c6b", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-gpu-dcgmproftester-deployment-scaledobject"} 2024-09-06T05:32:28Z INFO Initializing Scaling logic according to ScaledObject Specification {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "3d73bd81-f844-4637-b4dd-04909f5a3c6b"} 2024-09-06T05:32:28Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "4e6ab4cb-a72c-42a1-badf-d4ff2b908d52"} 2024-09-06T05:32:28Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "4e6ab4cb-a72c-42a1-badf-d4ff2b908d52", "resource": "apps/v1.Deployment", "name": "gpu-api"} 2024-09-06T05:32:58Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "9ee7f368-ac5d-43ae-a4b8-34bf7c82357c"} 2024-09-06T05:32:58Z INFO Detected resource targeted for scaling {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"gpu-dcgmproftester-deployment-scaledobject","namespace":"default"}, "namespace": "default", "name": "gpu-dcgmproftester-deployment-scaledobject", "reconcileID": "9ee7f368-ac5d-43ae-a4b8-34bf7c82357c", "resource": "apps/v1.Deployment", "name": "gpu-api"}

KEDA Version

2.15.1

Kubernetes Version

1.30

Platform

Amazon Web Services

Scaler Details

prometheus

Anything else?

Output of apiservice

kubectl describe apiservice v1beta1.external.metrics.k8s.io

Name: v1beta1.external.metrics.k8s.io Namespace: Labels: app.kubernetes.io/component=operator app.kubernetes.io/instance=keda app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=v1beta1.external.metrics.k8s.io app.kubernetes.io/part-of=keda-operator app.kubernetes.io/version=2.15.1 helm.sh/chart=keda-2.15.1 Annotations: meta.helm.sh/release-name: keda meta.helm.sh/release-namespace: keda API Version: apiregistration.k8s.io/v1 Kind: APIService Metadata: Creation Timestamp: 2024-09-06T05:25:54Z Resource Version: 16395 UID: 25ebf5a5-3448-4732-bd2a-b9ee7d33851f Spec: Ca Bundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFoTVJBd0RnWURWUVFLRXdkTFJVUkIKVDFKSE1RMHdDd1lEVlFRREV3UkxSVVJCTUI0WERUSTBNRGt3TmpBME1qVTFOMW9YRFRNME1Ea3dOREExTWpVMQpOMW93SVRFUU1BNEdBMVVFQ2hNSFMwVkVRVTlTUnpFTk1Bc0dBMVVFQXhNRVMwVkVRVENDQVNJd0RRWUpLb1pJCmh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS1E2MjNPZ3BGMnU0MXVHTjZsb01UMEJxQkJDUE51Q3NjbXUKNXoySDZYRjhBY04zcWNzRlMyS1J1TTV0aFYxRHI2OGNPaUR2UVB2a2Y1UFRnL0xRenRzMTE0Y3RuaGNsamliLwpBV2J4Q2poNlVud0Vocld4ZzBpbDlDWWYxcHBXbVhCQTE4SzJJMUxaQTh4YWppb0hGUjREa3VQc3ZwUUNTQ3d3CnNqVDdWVnZFTkEzYVNzbkhCMExDNXpYaDRwN3dyMzlVUmFNbktLRWV1czQ0K3U3NUtyWDFtM3Y4UVVjamRVbGwKbTRzazdSZXFCYlc4K0FoWFhiTXZuSkRpcHZlTUJUbzVoSnZnL1R0cmQ3NGpHQlZaak5QVkoxQ0o4NXNpTEZUdgpMZmg4dWdnYzNBZkdsTjhkYXZSSHpEWFZpbmNoQjFMR1JUS1JvY2lOQ1ZqMDNiUmxJODhDQXdFQUFhTlRNRkV3CkRnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFRkE3WkFkWEYKRFdSNy9XK2haam5ZRXF4Rm9kUVVNQThHQTFVZEVRUUlNQWFDQkV0RlJFRXdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUhHTWJOTVNaTHpuM09ZSnI2Rm5HcUxCUkY1RXU1M3NtVkV3T0t0Y3cxc2J5TWJqYnhzM1QyWWpIWE9EClZTc2k4OXlKWGhtekZrWDJ2OTIwcmVzYTFHWkhhUk5Dc1JVS01LZDZ2bVBrU2JBQzJ5RDRmVFlLaUUrcjgrU0cKWHlzT3BFYTJLUkw5ZnBjdS9scm0vQkwyOEo5Mk9tSy9KdkNHK1pZRVdGTnRWM3RrRmw5Nk9kQjVjNG56OFV3agpaSXFPUzg5Ujh0RjA2elpjaU9Lc1lsdTB1ZjF1c3Z6aVpNc3A3Um53STRvUTJPRWxmTWFOR2hCdCtWcjk1N0E3CjI3TXUvc0JtU2lFQU9ucHpaQ2loSXo1MzdMdzJnV0xkMS9xbDByMmNqRnhhK01IN01aaGx4YW50L28xdmVsSnMKWGtORit3V1UybHkwREZ5bFRmWUhGYURvcDZ3PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== Group: external.metrics.k8s.io Group Priority Minimum: 100 Service: Name: keda-operator-metrics-apiserver Namespace: keda Port: 443 Version: v1beta1 Version Priority: 100 Status: Conditions: Last Transition Time: 2024-09-06T05:26:04Z Message: all checks passed Reason: Passed Status: True Type: Available Events: <none>

HPA Logs:

image

@Vijaygawate Vijaygawate added the bug Something isn't working label Sep 6, 2024
Copy link

stale bot commented Nov 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Nov 7, 2024
Copy link

stale bot commented Nov 17, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Nov 17, 2024
@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Status: Ready To Ship
Development

No branches or pull requests

1 participant