Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive calling of the DD API by KEDA #5521

Closed
Adityashar opened this issue Feb 19, 2024 · 7 comments
Closed

Excessive calling of the DD API by KEDA #5521

Adityashar opened this issue Feb 19, 2024 · 7 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@Adityashar
Copy link

Report

Hi Team,

We are observing very high requests from keda's DD Scaler to the DD API. These can go as high as 1000 / queries a minute while there are only 64 ScaledObject deployed in our platform currently.

Since there are other applications as well in our platform that are using the DD API /api/v1/query, we often get the below error message from keda for all of our scaledobjects and this disrupts the functionality of the formers as well.
your Datadog account reached the 1600 queries per 60 seconds rate limit, next limit reset will happen in X seconds

I have gone through the doc of datadog scaler and its rate-limiting (https://keda.sh/docs/2.11/scalers/datadog/#polling-intervals-and-datadog-rate-limiting), however I feel we could improve some of the keda code as well to reduce this calling.

There are two things that I observed in keda's codebase:

  1. ScalerCache (and the metrics cache) gets invalidated on every error returned from Datadog (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L499).
  2. Two calls are made for fetching metrics in case of an error, as implemented in this function of scaler_cache.go

For both of these above points, we could avoid invalidating the cache for a scaledobject or hitting DD API multiple times in case we receive -

  1. a 429 response code from DD (our API is getting rate-limited)
  2. a no Datadog metrics returned for the given time window error message (as computed in the datadog_scaler.go code). There are many metrics such as kafka lag, which remain null(?) for most durations unless there's actual lag.

I see a few options that we could implement going forward:

  1. Instead of trying twice (as something hard-coded) for fetching metrics, make it configurable at scaler level (or even scaledobject level)
  2. Add a scaler level sleep or retryAfter config which could be invoked in case we are getting 429s from DD, AWS etc
  3. Performing cache invalidation for the above two cases based on a threshold (maybe time?)

I would really appreciate everyone's suggestions on these.

Expected Behavior

The number of calls made by keda to DD are low even during errors such as 429s and no Datadog metrics returned for the given time window.

useCachedMetrics is a helpful feature to cater to the incoming metric requests from the HPA. However, once an error is received (especially one of the above two), the cache gets deleted (which could have been avoided). This may lead to at most 10 calls to DD in a minute for single scaler -
2 * 4 (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L409) due to HPA and 2 * 1 (https://github.com/kedacore/keda/blob/v2.11.2/pkg/scaling/scale_handler.go#L520) due to reconcile loop).

Actual Behavior

Excessive calling to DD by keda

Steps to Reproduce the Problem

  1. Deploy a sample scaledObject which references a sample application
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog
  namespace: datadog-keda
spec:
  scaleTargetRef:
    name: datadog-keda
  minReplicaCount: 1  
  maxReplicaCount: 50 
  pollingInterval: 90
  triggers:
  - type: datadog
    metricType: AverageValue
    metadata:
      query: "sum:istio.mesh.request.count{kube_namespace:datadog-keda}.as_count()"
      queryValue: "100"
      activationQueryValue: "10"
      queryAggregator: "max"
    useCachedMetrics: true
    authenticationRef:
      name: keda-trigger-auth-datadog
      kind: TriggerAuthentication
  1. Hit the raw external metrics api and observe the number of requests made by keda to dd, once we lose the cache.
#! /bin/bash

function CallApi () {
    ITER=0
    END=20

    while [ $ITER -lt $END ]; do
        echo "BATCH: $1; Calling API: $ITER"
        kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/datadog-keda/s0-datadog-sum-istio_mesh_request_count?labelSelector=scaledobject.keda.sh%2Fname%3Ddatadog" > /dev/null
        ITER=$(expr $ITER + 1)
    done
}

THREADS=10

for i in $(seq $THREADS); do 
    CallApi $i & 
done

Logs from KEDA operator

No response

KEDA Version

2.11.2

Kubernetes Version

1.25

Platform

Amazon Web Services

Scaler Details

Datadog

Anything else?

No response

@Adityashar Adityashar added the bug Something isn't working label Feb 19, 2024
@JorTurFer
Copy link
Member

Hello @Adityashar

Thanks for reporting the issue. We are currently working on another solution via integrating the DD Agent as a source for querying the metrics: #5355

This will improve the behavior and relies on the agent the rate limiting management.

I'm not totally sure about adding a retry/delay system because it'll be released probably at the same time as the support for DD Agent and it'll handle the situation quite better, but I'm not against of it.
@zroubalik @tomkerkhove ?

@Adityashar
Copy link
Author

Thanks for reporting the issue. We are currently working on another solution via integrating the DD Agent as a source for querying the metrics: #5355
This will improve the behavior and relies on the agent the rate limiting management.

Thanks for this information @JorTurFer, looking forward to this feature!

@zroubalik
Copy link
Member

I agree with @JorTurFer

@Adityashar
Copy link
Author

@JorTurFer @zroubalik I was taking a look at @arapulido's draft code and saw this line:

return fmt.Sprintf("https://%s.%s.svc.cluster.local:%d/apis/external.metrics.k8s.io/v1beta1", datadogMetricsService, datadogNamespace, datadogMetricsServicePort)

Does this mean that we would need Datadog's APIService to use this feature? Also IIRC, there can only be one APIService in a cluster for the external.metrics.k8s.io, i.e., either Keda or Datadog.

@JorTurFer
Copy link
Member

Does this mean that we would need Datadog's APIService to use this feature? Also IIRC, there can only be one APIService in a cluster for the external.metrics.k8s.io, i.e., either Keda or Datadog.

I don't think so, that path is the path exposed by the server, so the idea is that you will have to install the DD Agent without registering the APIService. Then you can set the DD service endpoint in KEDA and KEDA will query the DD Agent without registering the APIService

Copy link

stale bot commented May 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label May 10, 2024
Copy link

stale bot commented May 18, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed May 18, 2024
@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

3 participants