-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling from 0->N doesn't get initiated on time/at all #1062
Comments
I curious to see what DEBUG logs would tell us, https://github.com/kedacore/keda/tree/master#setting-log-levels |
Do you think there is a need to close the client? I don't see it being closed. |
@balchua Did you ask me that? not sure if closing the client is something that can be done on configuration, let me know. |
Sorry its more for @zroubalik 😊 |
@zroubalik What say you on @balchua comment? As for debug, I found an issue in dev of a message that got ingested only after 47 minutes. Relevant Logs- There were more in between, but these are the relevant ones, a.k.a about this specific sub or general errors
At this point, 7:09, it should have scaled the pod up, but it says "Deployment no change".
BTW, I'm expecting to see an error like that about the relevant sub, lf-scaledobject-1546020352 -
I have 10 more occurrences of - and then one of- We defined cloud DNS so we can bypass FW when communicating with google APIs. the 199 address is of that googleapi, first time to see it in the log after we defined the FW bypass, but it doesn't get reflected when we do a dig/nslookup command, only via the error we see it seems to be working. or.. not, broken pipe. and now an unexpected EOF as well, until now we already encountered timeouts- back to timeouts- eventually it says the error I expected to see earlier- then- Errors about timeouts or getting active status stopped being reported, and then eventually the scale started working- I hope this helps. |
@hila3000 if you have the chance to build the master branch of keda, you can have a look at the stackdriver client code here (https://github.com/kedacore/keda/blob/master/pkg/scalers/stackdriver_client.go). In the method Instructions to build keda is in the Readme. This may help speed up troubleshooting your issue. Just my 2 cents. 😊 |
I can see, that there are timeouts when communicating with the gcp pub sub:
This is the reason for non scaling. @hila3000 btw, you can use ``` for multiline code blocks, its better for the bigger chunks of code or logs (you can specify syntax highlighting as well: https://docs.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks And @balchua is probably right, I am not expert on gcp pub sub, but the client should be probably closed. This might cause the problems. |
@balchua haha, commented in the same time. I think the close should be called for sure, as per the client doc: |
I'd add the close call in here: https://github.com/kedacore/keda/blob/v2/pkg/scalers/gcp_pub_sub_scaler.go#L94 |
This should eliminate the timeouts? |
@zroubalik yeah the @hila3000 if you have the chance to build keda master, you may want to try closing the client just to test it out. |
@balchua I tried - added the close in the file, ran docker build, but when I set the new image for the keda-operator deployment, it says- Any idea why? |
@hila3000 did you do a |
I did now... anything else before I hit the "docker build" command? |
@hila3000 if you are building it against |
@zroubalik @balchua So let's recap. hfish:keda (master)$ make publish Any thoughts? how do I define the GOROOT? where? why is the command not found? Thanks a LOT for your help :) |
also make build doesn't work- hfish:keda (master)$ make build it did work when I ran it on v2 branch. |
OK, I can see. You will need to install operator-sdk v0.11.0 If you have troubles with the instalation etc, if I spare some time I can give it a shot a produce an image with the change for you. |
@zroubalik The files in https://github.com/operator-framework/operator-sdk/releases/tag/v0.11.0 are without files extensions.. so not sure how should it be installed. For example, I've downloaded "operator-sdk-v0.11.0-x86_64-apple-darwin" cause this looks suitable for mac, but it's not a dmg so I can't really install it. If you can create an image for me with the relevant fix it'll be GREAT and MUCH APPRECIATED (If it can be before Saturday it'll be super helpful too). |
Oh # Set the release version variable
$ RELEASE_VERSION=v0.11.0
# Linux
$ curl -LO https://github.com/operator-framework/operator-sdk/releases/download/${RELEASE_VERSION}/operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu
$ chmod +x operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu && sudo mkdir -p /usr/local/bin/ && sudo cp operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu /usr/local/bin/operator-sdk && rm operator-sdk-${RELEASE_VERSION}-x86_64-linux-gnu I copied it from the operator-sdk installation guide guide |
@hila3000 you can try these images: it is custom v1 build, which is more likely won't be released. I plan to include this change into the upcoming v2. KEDA v2 Beta is scheduled for the beginning of the next week. |
@balchua Thanks for the explanation. I do need for mac though, so it's the same concept and commands you wrote, just with the darwin files? |
@zroubalik The version seems to be ok. Will give it more time to verify it, also seem to stabilize the memory usage, which is great.
The sub does exist, messages count is 0 but still, it exists. |
@hila3000 great news! Regarding that error, it seems to be benign. The error simply suggests that the stackdriver monitoring client did not return any timeseries data within that period and for that particular metric. |
@balchua @zroubalik I think it's safe to say that the fix worked. Thanks a lot! from keda-operator pod- from keda-metrics-apiserver pod- I opened a ticket to GCP to see if we can increase the quota, but still thought to ask - Any way to handle it from Keda's side? |
@balchua Regarding what you wrote- Within that period of time - But I checked the metric on GCP side and it consistently reports 0 messages. So I'm not sure why it says that the stackdriver monitoring client did not return any timeseries data. |
The keda gcp pubsub scaler performs a metrics query on stackdriver within a time range (now - 2mins). |
For now, it seems ok, no apparent issues. |
Cool. Memory utilization also seems ok? |
Yes :) |
Glad it is solved now, closing. |
@zroubalik are the fixes you did in - Also exist in keda 2.0.0 and forward? |
@hila3000 yeah, the fix is present in 2.0.0 and forward. |
Great, thanks! |
…totals (kedacore#1062) Signed-off-by: Jorge Turrado <[email protected]>
We have a scaledobject based on GCP PubSub. The PubSub subscription has reached the number of messages to have the deployment scaled up, but it doesn't happen on time.
Also, yesterday it didn't happen at all on the same scaledobject, until we restarted the keda operator- It reached 27000 messages (Scaledobject is defined to get triggered on value of 50).
Expected Behavior
Once Pubsub subscription reaches the number of messages as defined in the scaledobject, the deployment needs to get scaled up from 0->N, based on the polling interval of course.
Actual Behavior
First issue- Doesn't scale on time-
Deployment got scaled to 0 on timestamp 1598771940.466385 (07:19:00.466 AM GMT).
Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 07:21 AM GMT, however at 07:38 GMT it reached 972 messages and only then it got scaled up, a.k.a after 17 minutes of a high count of messages (relatively to the scaledobject configuration).
I would expect it to scale up back to N replicas after 7 minutes the latest - 2 min of polling interval + 5 minutes of the cooling period.
Second issue - Didn't scale at all-
Yesterday it didn't scale back up at all (after almost 4 hours of high count of messages) until we restarted the Keda operator.
Messages count on the subscription started increasing since 13:56 GMT.
Deployment got scaled to 0 on timestamp 1598709286.193081 (13:54:46.193 PM GMT).
Pubsub messages show an increase over 50 messages (the value that's defined in the scaledobject) on 13:56 PM GMT, however at 17:55 GMT it reached 27,063 messages and only after we restarted Keda operator on 17:53 PM GMT, it scaled the pods up to 1 replica.
I would expect it to scale up back to N replicas at 14:03 PM GMT the latest - 2 min of polling interval + 5 minutes of the cooling period. BUT it didn't scale at all.
LOGS-
First issue- Doesn't scale on time-
From Keda logs-
scaling to 0 on 7:19 AM GMT-
{"level":"info","ts":1598771940.466385,"logger":"scalehandler","msg":"Successfully scaled deployment to 0 replicas","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920"}
Any other references to this scaledobject after that-
{"level":"info","ts":1598771965.0714204,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"} {"level":"info","ts":1598771965.071469,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"prod-us-log-forwarding","Request.Name":"lf-scaledobject-535176920"}
we have more of those reconciling...detected... and then at 07:38:50.164 AM-
{"level":"info","ts":1598773130.1646843,"logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"prod-us-log-forwarding","ScaledObject.Name":"lf-scaledobject-535176920","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"prod-us-log-forwarding","Deployment.Name":"log-forwarding-535176920","Original Replicas Count":0,"New Replicas Count":1}
Nothing in between.
Second issue- Doesn't scale at all-
From Keda logs-
As for yesterday in which it didn't scale at all, only relevant errors I see in the logs are-
13:55:13 GMT- Worth mentioning this error is noticed on the logs also right before.. a.k.a right before it got successfully scaled to 0.
error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp: i/o timeout" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598720729.596498
14:10:04 GMT -
error: "rpc error: code = Unauthenticated desc = transport: oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": unexpected EOF" level: "error" logger: "gcp_pub_sub_scaler" msg: "error getting Active Status" stacktrace: "github.com/go-logr/zapr.(*zapLogger).Error /Users/zroubali/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128 github.com/kedacore/keda/pkg/scalers.(*pubsubScaler).IsActive keda/pkg/scalers/gcp_pub_sub_scaler.go:88 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScaleDeployment keda/pkg/handler/scale_loop.go:119 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).handleScale keda/pkg/handler/scale_loop.go:45 github.com/kedacore/keda/pkg/handler.(*ScaleHandler).HandleScaleLoop keda/pkg/handler/scale_loop.go:28" ts: 1598710204.4026008
These 2 errors types gets reported quite often.
Steps to Reproduce the Problem
I don't know how this can be reproduced, but I'll just share the configurations I have.
Scaledobject- (Don't mind the wrong indentation.. Happened when pasted here)
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
creationTimestamp: "2020-08-23T10:26:34Z"
finalizers:
finalizer.keda.k8s.io
generation: 2
labels:
deploymentName: log-forwarding-535176920
name: lf-scaledobject-535176920
namespace: prod-us-log-forwarding
resourceVersion: "85343895"
selfLink: /apis/keda.k8s.io/v1alpha1/namespaces/prod-us-log-forwarding/scaledobjects/lf-scaledobject-535176920
uid: 1e802b1a-e52b-11ea-8145-42010ab50004
spec:
cooldownPeriod: 300
maxReplicaCount: 5
minReplicaCount: 0
pollingInterval: 120
scaleTargetRef:
deploymentName: log-forwarding-535176920
triggers:
metadata:
credentials: GENERIC_GOOGLE_CREDENTIALS
subscriptionName: lf-data-535176920-sub
subscriptionSize: "50"
type: gcp-pubsub
Specifications
The text was updated successfully, but these errors were encountered: