Excessive calling of the DD API by KEDA #5521

Adityashar · 2024-02-19T13:28:49Z

Report

Hi Team,

We are observing very high requests from keda's DD Scaler to the DD API. These can go as high as 1000 / queries a minute while there are only 64 ScaledObject deployed in our platform currently.

Since there are other applications as well in our platform that are using the DD API /api/v1/query, we often get the below error message from keda for all of our scaledobjects and this disrupts the functionality of the formers as well.
your Datadog account reached the 1600 queries per 60 seconds rate limit, next limit reset will happen in X seconds

I have gone through the doc of datadog scaler and its rate-limiting (https://keda.sh/docs/2.11/scalers/datadog/#polling-intervals-and-datadog-rate-limiting), however I feel we could improve some of the keda code as well to reduce this calling.

There are two things that I observed in keda's codebase:

ScalerCache (and the metrics cache) gets invalidated on every error returned from Datadog (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L499).
Two calls are made for fetching metrics in case of an error, as implemented in this function of scaler_cache.go

For both of these above points, we could avoid invalidating the cache for a scaledobject or hitting DD API multiple times in case we receive -

a 429 response code from DD (our API is getting rate-limited)
a no Datadog metrics returned for the given time window error message (as computed in the datadog_scaler.go code). There are many metrics such as kafka lag, which remain null(?) for most durations unless there's actual lag.

I see a few options that we could implement going forward:

Instead of trying twice (as something hard-coded) for fetching metrics, make it configurable at scaler level (or even scaledobject level)
Add a scaler level sleep or retryAfter config which could be invoked in case we are getting 429s from DD, AWS etc
Performing cache invalidation for the above two cases based on a threshold (maybe time?)

I would really appreciate everyone's suggestions on these.

Expected Behavior

The number of calls made by keda to DD are low even during errors such as 429s and no Datadog metrics returned for the given time window.

useCachedMetrics is a helpful feature to cater to the incoming metric requests from the HPA. However, once an error is received (especially one of the above two), the cache gets deleted (which could have been avoided). This may lead to at most 10 calls to DD in a minute for single scaler -
2 * 4 (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L409) due to HPA and 2 * 1 (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L520) due to reconcile loop).

Actual Behavior

Excessive calling to DD by keda

Steps to Reproduce the Problem

Deploy a sample scaledObject which references a sample application

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog
  namespace: datadog-keda
spec:
  scaleTargetRef:
    name: datadog-keda
  minReplicaCount: 1  
  maxReplicaCount: 50 
  pollingInterval: 90
  triggers:
  - type: datadog
    metricType: AverageValue
    metadata:
      query: "sum:istio.mesh.request.count{kube_namespace:datadog-keda}.as_count()"
      queryValue: "100"
      activationQueryValue: "10"
      queryAggregator: "max"
    useCachedMetrics: true
    authenticationRef:
      name: keda-trigger-auth-datadog
      kind: TriggerAuthentication

Hit the raw external metrics api and observe the number of requests made by keda to dd, once we lose the cache.

#! /bin/bash

function CallApi () {
    ITER=0
    END=20

    while [ $ITER -lt $END ]; do
        echo "BATCH: $1; Calling API: $ITER"
        kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/datadog-keda/s0-datadog-sum-istio_mesh_request_count?labelSelector=scaledobject.keda.sh%2Fname%3Ddatadog" > /dev/null
        ITER=$(expr $ITER + 1)
    done
}

THREADS=10

for i in $(seq $THREADS); do 
    CallApi $i & 
done

Logs from KEDA operator

No response

KEDA Version

2.11.2

Kubernetes Version

1.25

Platform

Amazon Web Services

Scaler Details

Datadog

Anything else?

No response

The text was updated successfully, but these errors were encountered:

JorTurFer · 2024-02-23T21:15:24Z

Hello @Adityashar

Thanks for reporting the issue. We are currently working on another solution via integrating the DD Agent as a source for querying the metrics: #5355

This will improve the behavior and relies on the agent the rate limiting management.

I'm not totally sure about adding a retry/delay system because it'll be released probably at the same time as the support for DD Agent and it'll handle the situation quite better, but I'm not against of it.
@zroubalik @tomkerkhove ?

Adityashar · 2024-02-27T04:49:40Z

Thanks for reporting the issue. We are currently working on another solution via integrating the DD Agent as a source for querying the metrics: #5355
This will improve the behavior and relies on the agent the rate limiting management.

Thanks for this information @JorTurFer, looking forward to this feature!

zroubalik · 2024-03-01T09:38:16Z

I agree with @JorTurFer

Adityashar · 2024-03-04T04:14:42Z

@JorTurFer @zroubalik I was taking a look at @arapulido's draft code and saw this line:

keda/pkg/scalers/datadog_scaler.go

Line 151 in 76a58a9

    
           return fmt.Sprintf("https://%s.%s.svc.cluster.local:%d/apis/external.metrics.k8s.io/v1beta1", datadogMetricsService, datadogNamespace, datadogMetricsServicePort)

Does this mean that we would need Datadog's APIService to use this feature? Also IIRC, there can only be one APIService in a cluster for the external.metrics.k8s.io, i.e., either Keda or Datadog.

JorTurFer · 2024-03-10T19:58:46Z

Does this mean that we would need Datadog's APIService to use this feature? Also IIRC, there can only be one APIService in a cluster for the external.metrics.k8s.io, i.e., either Keda or Datadog.

I don't think so, that path is the path exposed by the server, so the idea is that you will have to install the DD Agent without registering the APIService. Then you can set the DD service endpoint in KEDA and KEDA will query the DD Agent without registering the APIService

stale · 2024-05-10T00:38:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2024-05-18T12:24:52Z

This issue has been automatically closed due to inactivity.

Adityashar added the bug Something isn't working label Feb 19, 2024

keda-automation added this to Roadmap - KEDA Core Feb 19, 2024

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Feb 19, 2024

stale bot added the stale All issues that are marked as stale due to inactivity label May 10, 2024

stale bot closed this as completed May 18, 2024

github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive calling of the DD API by KEDA #5521

Excessive calling of the DD API by KEDA #5521

Adityashar commented Feb 19, 2024

JorTurFer commented Feb 23, 2024

Adityashar commented Feb 27, 2024

zroubalik commented Mar 1, 2024

Adityashar commented Mar 4, 2024

JorTurFer commented Mar 10, 2024

stale bot commented May 10, 2024

stale bot commented May 18, 2024

Excessive calling of the DD API by KEDA #5521

Excessive calling of the DD API by KEDA #5521

Comments

Adityashar commented Feb 19, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Feb 23, 2024

Adityashar commented Feb 27, 2024

zroubalik commented Mar 1, 2024

Adityashar commented Mar 4, 2024

JorTurFer commented Mar 10, 2024

stale bot commented May 10, 2024

stale bot commented May 18, 2024