Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling from 0->N doesn't get initiated on time/at all #1062

Closed
hila3000 opened this issue Aug 30, 2020 · 34 comments
Closed

Scaling from 0->N doesn't get initiated on time/at all #1062

hila3000 opened this issue Aug 30, 2020 · 34 comments
Labels
bug Something isn't working

Comments

@hila3000
Copy link

hila3000 commented Aug 30, 2020

We have a scaledobject based on GCP PubSub. The PubSub subscription has reached the number of messages to have the deployment scaled up, but it doesn't happen on time.
Also, yesterday it didn't happen at all on the same scaledobject, until we restarted the keda operator- It reached 27000 messages (Scaledobject is defined to get triggered on value of 50).

Expected Behavior

Once Pubsub subscription reaches the number of messages as defined in the scaledobject, the deployment needs to get scaled up from 0->N, based on the polling interval of course.

Actual Behavior

First issue- Doesn't scale on time-
Deployment got scaled to 0 on timestamp 1598771940.466385 (07:19:00.466 AM GMT).
Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 07:21 AM GMT, however at 07:38 GMT it reached 972 messages and only then it got scaled up, a.k.a after 17 minutes of a high count of messages (relatively to the scaledobject configuration).

I would expect it to scale up back to N replicas after 7 minutes the latest - 2 min of polling interval + 5 minutes of the cooling period.

Second issue - Didn't scale at all-
Yesterday it didn't scale back up at all (after almost 4 hours of high count of messages) until we restarted the Keda operator.
Messages count on the subscription started increasing since 13:56 GMT.

Deployment got scaled to 0 on timestamp 1598709286.193081 (13:54:46.193 PM GMT).
Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 13:56 PM GMT, however at 17:55 GMT it reached 27,063 messages and only after we restarted Keda operator on 17:53 PM GMT, it scaled the pods up to 1 replica.

I would expect it to scale up back to N replicas at 14:03 PM GMT the latest - 2 min of polling interval + 5 minutes of the cooling period. BUT it didn't scale at all.

LOGS-
First issue- Doesn't scale on time-
From Keda logs-
scaling to 0 on 7:19 AM GMT-
{"level":"info","ts":1598771940.466385,"logger":"scalehandler","msg":"Successfully scaled deployment to 0 replicas","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920"}

Any other references to this scaledobject after that-
{"level":"info","ts":1598771965.0714204,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"} {"level":"info","ts":1598771965.071469,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"}

we have more of those reconciling...detected... and then at 07:38:50.164 AM-
{"level":"info","ts":1598773130.1646843,"logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920","Original Replicas Count":0,"New Replicas Count":1}

Nothing in between.

Second issue- Doesn't scale at all-
From Keda logs-
As for yesterday in which it didn't scale at all, only relevant errors I see in the logs are-
13:55:13 GMT- Worth mentioning this error is noticed on the logs also right before.. a.k.a right before it got successfully scaled to 0.
error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp: i/o timeout" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598720729.596498
14:10:04 GMT -
error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": unexpected EOF" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598710204.4026008

These 2 errors types gets reported quite often.

Steps to Reproduce the Problem

I don't know how this can be reproduced, but I'll just share the configurations I have.

Scaledobject- (Don't mind the wrong indentation.. Happened when pasted here)
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
creationTimestamp: "2020-08-23T10:26:34Z"
finalizers:
finalizer.keda.k8s.io
generation: 2
labels:
deploymentName: log-forwarding-535176920
name: lf-scaledobject-535176920
namespace: prod-us-log-forwarding
resourceVersion: "85343895"
selfLink: /apis/keda.k8s.io/v1alpha1/namespaces/prod-us-log-forwarding/scaledobjects/lf-scaledobject-535176920
uid: 1e802b1a-e52b-11ea-8145-42010ab50004
spec:
cooldownPeriod: 300
maxReplicaCount: 5
minReplicaCount: 0
pollingInterval: 120
scaleTargetRef:
deploymentName: log-forwarding-535176920
triggers:
metadata:
credentials: GENERIC_GOOGLE_CREDENTIALS
subscriptionName: lf-data-535176920-sub
subscriptionSize: "50"
type: gcp-pubsub

Specifications

  • KEDA Version: 1.5.0.
  • Platform & Version: GCP GKE, not sure which version is required here.
  • Kubernetes Version: 1.14.10-gke.42.
  • Scaler(s): Scaledobject.
@hila3000 hila3000 added the bug Something isn't working label Aug 30, 2020
@zroubalik
Copy link
Member

I curious to see what DEBUG logs would tell us, https://github.com/kedacore/keda/tree/master#setting-log-levels

@balchua
Copy link
Contributor

balchua commented Aug 31, 2020

Do you think there is a need to close the client? I don't see it being closed.

@hila3000
Copy link
Author

hila3000 commented Sep 2, 2020

@balchua Did you ask me that? not sure if closing the client is something that can be done on configuration, let me know.
@zroubalik I've enabled debug now, waiting for a new occurrence in the logs of the errors to see if it prints more information about that.

@balchua
Copy link
Contributor

balchua commented Sep 2, 2020

Sorry its more for @zroubalik 😊

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

@zroubalik What say you on @balchua comment?
We are in production with Keda, and these issues happen often, the only "fix" is to restart Keda-operator pod.. to a point we thought to set low memory limits to have the pod restarted on its own... Please advise.

As for debug, I found an issue in dev of a message that got ingested only after 47 minutes.
on 7:00 GMT the message got into the sub, lf-scaledobject-1546020352 is its name.
on 7:47 GMT the message got ingested, a.k.a only then keda-operator scaled the pod back up from 0->1.

Relevant Logs- There were more in between, but these are the relevant ones, a.k.a about this specific sub or general errors

{"level":"debug","ts":"2020-09-03T06:54:14.985Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"}
{"level":"debug","ts":"2020-09-03T07:09:20.484Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"}

At this point, 7:09, it should have scaled the pod up, but it says "Deployment no change".

{"level":"error","ts":"2020-09-03T07:12:01.685Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp 199.36.153.8:443: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:12:02.085Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": EOF","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:12:24.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

BTW, I'm expecting to see an error like that about the relevant sub, lf-scaledobject-1546020352 -
{"level":"debug","ts":"2020-09-03T07:12:21.684Z","logger":"scalehandler","msg":"Error getting scale decision","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-884715244","ScaledObject.ScaleType":"deployment","Error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout"}
but there isn't.... (this one is example to another sub)

{"level":"error","ts":"2020-09-03T07:12:24.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:13:12.483Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:41702->199.36.153.8:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:14:11.186Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:14:11.186Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:14:12.485Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:14:43.087Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:17:50.687Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:18:01.884Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:18:22.885Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:18:44.481Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:57870->199.36.153.8:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:19:19.682Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

I have 10 more occurrences of -
{"level":"error","ts":"2020-09-03T07:21:55.590Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

and then one of-
{"level":"error","ts":"2020-09-03T07:22:15.284Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": write tcp 172.20.8.176:58202->199.36.153.9:443: write: broken pipe","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

We defined cloud DNS so we can bypass FW when communicating with google APIs. the 199 address is of that googleapi, first time to see it in the log after we defined the FW bypass, but it doesn't get reflected when we do a dig/nslookup command, only via the error we see it seems to be working. or.. not, broken pipe.

and now an unexpected EOF as well, until now we already encountered timeouts-
{"level":"error","ts":"2020-09-03T07:22:37.382Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": unexpected EOF","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

back to timeouts-
{"level":"error","ts":"2020-09-03T07:24:03.181Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:24:03.384Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}
{"level":"error","ts":"2020-09-03T07:25:01.087Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:88\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

eventually it says the error I expected to see earlier-
{"level":"debug","ts":"2020-09-03T07:25:01.087Z","logger":"scalehandler","msg":"Error getting scale decision","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Error":"rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout"}

then-
{"level":"debug","ts":"2020-09-03T07:25:01.087Z","logger":"scalehandler","msg":"Deployment no change","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352"}

Errors about timeouts or getting active status stopped being reported, and then eventually the scale started working-
{"level":"debug","ts":"2020-09-03T07:46:31.683Z","logger":"scalehandler","msg":"Scaler for scaledObject is active","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Scaler":{}}
{"level":"info","ts":"2020-09-03T07:46:32.182Z","logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"dev-log-forwarding","ScaledObject.Name":"lf-scaledobject-1546020352","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"dev-log-forwarding","Deployment.Name":"log-forwarding-1546020352","Original Replicas Count":0,"New Replicas Count":1}

I hope this helps.

@balchua
Copy link
Contributor

balchua commented Sep 3, 2020

@hila3000 if you have the chance to build the master branch of keda, you can have a look at the stackdriver client code here (https://github.com/kedacore/keda/blob/master/pkg/scalers/stackdriver_client.go).

In the method GetMetrics perhaps add a defer s.metricsClient.Close(). This is more a hunch as i don't know much about gcp pub sub.

Instructions to build keda is in the Readme. This may help speed up troubleshooting your issue.

Just my 2 cents. 😊

@zroubalik
Copy link
Member

I can see, that there are timeouts when communicating with the gcp pub sub:

rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout

This is the reason for non scaling.

@hila3000 btw, you can use ``` for multiline code blocks, its better for the bigger chunks of code or logs (you can specify syntax highlighting as well: https://docs.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks

And @balchua is probably right, I am not expert on gcp pub sub, but the client should be probably closed. This might cause the problems.

@zroubalik
Copy link
Member

@balchua haha, commented in the same time.

I think the close should be called for sure, as per the client doc:
https://github.com/googleapis/google-cloud-go/blob/4de835178d583bdbbe5c8ec10befda4f59373d32/monitoring/apiv3/metric_client.go#L196-L200

@zroubalik
Copy link
Member

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

This should eliminate the timeouts?
rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout

@balchua
Copy link
Contributor

balchua commented Sep 3, 2020

@zroubalik yeah the Close() method. Totally forgot it. We need to have the stackdriver client part of the scaler. Right now its not stored.

@hila3000 if you have the chance to build keda master, you may want to try closing the client just to test it out.

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

@balchua I tried - added the close in the file, ran docker build, but when I set the new image for the keda-operator deployment, it says-
Error: failed to start container "keda-operator": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"keda\": executable file not found in $PATH": unknown

Any idea why?

@balchua
Copy link
Contributor

balchua commented Sep 3, 2020

@hila3000 did you do a make build?

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

I did now... anything else before I hit the "docker build" command?

@zroubalik
Copy link
Member

@hila3000 if you are building it against master branch, all you need to do is run make publish to build the images locally and push them to Docker hub. Or you can run make build and run docker push manually.

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

@zroubalik @balchua So let's recap.
I did git clone to this repo, checkout to master
did the change in pkg/scalers/stackdriver_client.go (added the defer s.metricsClient.Close() on row 48), and then tried to run "make publish" but it fails-

hfish:keda (master)$ make publish
WARNING: GOROOT is not defined
GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 operator-sdk generate k8s
/bin/sh: operator-sdk: command not found
make: *** [generate-api] Error 127

Any thoughts? how do I define the GOROOT? where? why is the command not found?

Thanks a LOT for your help :)

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

also make build doesn't work-

hfish:keda (master)$ make build
WARNING: GOROOT is not defined
GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 operator-sdk generate k8s
/bin/sh: operator-sdk: command not found
make: *** [generate-api] Error 127

it did work when I ran it on v2 branch.

@zroubalik
Copy link
Member

OK, I can see.

You will need to install operator-sdk v0.11.0
https://github.com/operator-framework/operator-sdk/releases/tag/v0.11.0

If you have troubles with the instalation etc, if I spare some time I can give it a shot a produce an image with the change for you.

@hila3000
Copy link
Author

hila3000 commented Sep 3, 2020

@zroubalik The files in https://github.com/operator-framework/operator-sdk/releases/tag/v0.11.0 are without files extensions.. so not sure how should it be installed. For example, I've downloaded "operator-sdk-v0.11.0-x86_64-apple-darwin" cause this looks suitable for mac, but it's not a dmg so I can't really install it.

If you can create an image for me with the relevant fix it'll be GREAT and MUCH APPRECIATED (If it can be before Saturday it'll be super helpful too).

@balchua
Copy link
Contributor

balchua commented Sep 3, 2020

Oh darwin is for MacOS.
For linux try this:

# Set the release version variable
$ RELEASE_VERSION=v0.11.0
# Linux
$ curl -LO https://github.com/operator-framework/operator-sdk/releases/download/${RELEASE_VERSION}/operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu
$ chmod +x operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu && sudo mkdir -p /usr/local/bin/ && sudo cp operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu /usr/local/bin/operator-sdk && rm operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu

I copied it from the operator-sdk installation guide guide

@zroubalik
Copy link
Member

@hila3000 you can try these images:
KEDA Operator: docker.io/zroubalik/keda:gcpclose
KEDA Metrics Adapter: docker.io/zroubalik/keda-metrics-adapter:gcpclose

it is custom v1 build, which is more likely won't be released. I plan to include this change into the upcoming v2.
I am not sure if it is gonna actually help with your issue, I'd recommend to switch to v2 anyway, as there were some performance improvements internally etc and those could help with your issue as well.

KEDA v2 Beta is scheduled for the beginning of the next week.

@hila3000
Copy link
Author

hila3000 commented Sep 4, 2020

@balchua Thanks for the explanation. I do need for mac though, so it's the same concept and commands you wrote, just with the darwin files?
@zroubalik Thanks! I'll deploy these versions in dev for now and see if it solves our issues. Will deploy regardless v2 beta on dev as well next week when it'll get released, and update on the results. Any estimation when v2 GA version will get released? (For Production deployment).

@hila3000
Copy link
Author

hila3000 commented Sep 5, 2020

@zroubalik The version seems to be ok. Will give it more time to verify it, also seem to stabilize the memory usage, which is great.
Question though... What does that error mean-

{"level":"error","ts":"2020-09-05T19:13:01.738Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"Could not find stackdriver metric with filter metric.type=\"pubsub.googleapis.com/subscription/num_undelivered_messages\" AND resource.labels.subscription_id=\"lf-data-1272148120-sub\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:89\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

The sub does exist, messages count is 0 but still, it exists.

@balchua
Copy link
Contributor

balchua commented Sep 5, 2020

@hila3000 great news!

Regarding that error, it seems to be benign.

The error simply suggests that the stackdriver monitoring client did not return any timeseries data within that period and for that particular metric.

@hila3000
Copy link
Author

hila3000 commented Sep 6, 2020

@balchua @zroubalik I think it's safe to say that the fix worked. Thanks a lot!
We face a different issue now- Now that it works, we've set the pollinginterval back to 5 seconds, and it introduced a quota issue-

from keda-operator pod-
{"level":"error","ts":"2020-09-06T07:57:58.138Z","logger":"gcp_pub_sub_scaler","msg":"error getting Active Status","error":"rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Time series queries' and limit 'Time series queries per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:xxxxxxxxxxx'.","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive\n\tkeda/pkg/scalers/gcp_pub_sub_scaler.go:89\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment\n\tkeda/pkg/handler/scale_loop.go:119\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale\n\tkeda/pkg/handler/scale_loop.go:45\ngithub.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop\n\tkeda/pkg/handler/scale_loop.go:28"}

from keda-metrics-apiserver pod-
E0906 09:13:37.646708 1 provider.go:94] keda_metrics_adapter/provider "msg"="error getting metric for scaler" "error"="rpc error: code = ResourceExhausted desc = Quota exceeded for quota metric 'Time series queries' and limit 'Time series queries per minute' of service 'monitoring.googleapis.com' for consumer 'project_number:xxxxxxxxxxx'." "ScaledObject.Name"="lf-scaledobject-1016483367" "ScaledObject.Namespace"="prod-us-log-forwarding" "Scaler"={}

I opened a ticket to GCP to see if we can increase the quota, but still thought to ask - Any way to handle it from Keda's side?

@hila3000
Copy link
Author

hila3000 commented Sep 7, 2020

@balchua Regarding what you wrote- Within that period of time - But I checked the metric on GCP side and it consistently reports 0 messages. So I'm not sure why it says that the stackdriver monitoring client did not return any timeseries data.

@balchua
Copy link
Contributor

balchua commented Sep 7, 2020

The keda gcp pubsub scaler performs a metrics query on stackdriver within a time range (now - 2mins).
So if stackdriver doesn't return record(s) the keda gcp scaler will log that "error". But i believe it is benign from what i can tell. Does it not behave as you expect?

@hila3000
Copy link
Author

hila3000 commented Sep 7, 2020

For now, it seems ok, no apparent issues.

@balchua
Copy link
Contributor

balchua commented Sep 7, 2020

Cool. Memory utilization also seems ok?

@hila3000
Copy link
Author

hila3000 commented Sep 7, 2020

Yes :)

@zroubalik
Copy link
Member

Glad it is solved now, closing.

@hila3000
Copy link
Author

@zroubalik are the fixes you did in -
KEDA Operator: docker.io/zroubalik/keda:gcpclose
KEDA Metrics Adapter: docker.io/zroubalik/keda-metrics-adapter:gcpclose

Also exist in keda 2.0.0 and forward?

@zroubalik
Copy link
Member

@hila3000 yeah, the fix is present in 2.0.0 and forward.

@hila3000
Copy link
Author

Great, thanks!

SpiritZhou pushed a commit to SpiritZhou/keda that referenced this issue Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants