-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keda Identity Is not authenitcating to Service Bus after few hours #3977
Comments
Hum... weird. |
@tshaiman Is this consistently reproducible? Did you set a custom token expiration period using If this error aligns with the time that the token is about to expire, then maybe something is wrong with the way we're reading the token. If it's not, then there is some error in the updated token being injected into the pod. But I can't tell why this would happen, @JorTurFer. |
@v-shenoy : this was not happening more then once per region we are deployed ( total 23 regions ) but the trigger was moving from Ubuntu VM's to Mariner Based images on AKS. it is also important to know ( even may be irrelevant) that we ran with replicas=2 , and 1 Pod was always idle ( like Active-Passive) kind of things , where they did not share the workloads between them . |
@v-shenoy : Karma is a bitch. I was just speaking about not reproducing this item and then I saw it again on many of our regions after pods were around 10H in the air. we have decided to increase the token expiration to max value 86400 and forcely restart Keda every hour
|
I haven't talked to people in the workload identity team myself, before. But I think @aramase can help. |
On an unrelated but tertiary issue about workload identity. We need to refactor some parts there. The |
I think so, having an issue we could have help from contributors, without it, we will be who do it |
@JorTurFer : I want to become contributor , but my GoLang is a bit rough in the last 2 years (* moved to other languages) |
You'll never know if you don't try xD. We are here to help if you have any question or doubt :) |
@v-shenoy / @JorTurFer : i have some updates on new experiment done on Keda 2.9.1 : Replicas : 2
|
@v-shenoy this is inaccurate. |
|
@v-shenoy : well 2 diffrent docs from the same provider . not surprising . anyhow I don't think that is the root cause for the "file not found" issue here on this bug . as we are running without this label and its working for few hours and then looses the file |
I think the docs on Yup. I don't think these are related, just wanted to clarify on it. |
The only thing I wonder here is this somehow specific to AKS and Mariner, or does this happen for all OS'es? |
@v-shenoy : only AKS on mariner . it never happened when we ran on Ubuntu |
That's interesting. This makes it more of an AKS issue than KEDA, doesn't it? |
absolutely i’m trying everything now 😔 |
Is there an issue created on AKS / Workload Identity for this? If so can you link it here? |
sure . for AKS Team I've used internal Microsoft Bug reporting system i have another concern related to Docs of Keda. looking at how to set the Trigger Authentication it is not clear what is the identityId field represent ? The Term ID is misleading since in case of Workload its not Id , its client Id , and in case of Pod Identity its also not Id |
@v-shenoy
|
closing this thread and opening a dedicated 2.9.1 issue |
Hello we seem to be getting similar errors on from AKS
We're running We have tried several different version include version 2.9.1 and all exhibit the same error. TIA |
@jmos5156 : actually we ran on so just making sure this is the same issue- i’m not sure 🤔 |
I think @jmos5156 issue isn't related with this and it's related with this other issue |
I came to +1 this. I just updated my AKS Cluster to 1.24.6 from 1.21 as well as AADPOD update. No ScaledJobs are kicking off and here is the keda-operator log
Are there any work arounds by any chance? Thanks. |
No, there isn't any workaround because it's related with a bug. This PR fixes it |
Are you sure that is the issue? It looks like there is another underlying issue as well wrt the missing files |
To be clear, this is for the pod identity problem reported by @jmos5156 & @Dragonsong3k which is tracked in #4026. Not the issue @tshaiman is having |
@tomkerkhove thank you for the clarification. |
@tomkerkhove / @JorTurFer : I have another important update . the good news , I have Step-by-Steps instruction on how you can easily recreated the bug
===> now for the fun part as I always suspected something is happening when keda Pods needs to move from one node to the other due to system updates or node upgrades this is the easiest way to recreate the issue , and it can also explain why somethimes it took days or more for the error to occur : we need to see that the pods are evicted from one Node to another. |
WOW!! |
One thing @tshaiman , how many instances of |
I have reproduced the behaviour draining a node in my cluster with 2 nodes, with 1 instance of workload identity without PDB. When I drain the node where KEDA and the wi webhooks are, the new KEDA pod is scheduled without the volume because when KEDA pods are scheduled, the wi mutating webhook is down. |
@JorTurFer : I deployed workload Identity as AKS add on ( using az aks update --enable-workload-identity) not with the helm chat. this installs version 0.14 which isn't the latest ( 0.15 is the latest) could that be the issue you've mentioned ? |
Very interesting discussion here. @JorTurFer I had a thought regarding your point about lack of PDBs for the Workload Identity webhook pods, making the potentially unavailable during a node drain to mount the token volume onto the KEDA pods. The webhook is responsible for injecting the right environment variables as well as mounting the volume. How is it that the environment variables are present but not the volume? |
okey, let me share my theory (it could be crazy):
which is mounted in the pod like this:
At this moment, the pods also have some environment variables defined (maybe they are other extra or they have different values):
These 3 things are added by the mutating webhook, this means that the deployment doesn't have them, only the pods because it is the workload identity mutating webhooks who add them. As the lack of these things is the cause of KEDA issues using managed identities (because the SKDs uses them), if my theory is correct, this can be the root cause. |
@JorTurFer : Your Theory is CONFIRMED ! here is the pod definition AFTER :
so as you can see both the volume and the ENV variables are missing |
Yeah, I understood this. My question was (which in hindsight is dumb) is how was it possible for only the volume mount to be missing while the env variables are present. Clearly the env variables are also absent. And the file path read by the KEDA is "", which obviously has not mounted token. I think we should print the file path in the error message from our end when we are unable to read the token. |
So, essentially WI needs to have PDBs deployed as a part of the AKS add-on, which is already done when installed via the Helm chart. |
So, now that we have pinpointed this issue due to node drain / missing PDBs. Do you have ideas as to why was it initially only happening on Mariner? @tshaiman @JorTurFer |
@v-shenoy : it wasn’t always happening on Mariner , we got to the wrong conclusion since the trigger who got the node drain was switching to mariner . my colleague @yehiam livneh has added the following suggestion to the keda team: “It’s also Keda issue since they don't have any liveness probes . They don't know that they lost their identity - if they had liveness then they would have restarted and everything would be okay.” i agree with his statement |
Maybe we can include somehow a dynamic |
@JorTurFer I see your point , but as a reminder - the pod does goes to healthy once it is restarted since the volume mount works then. since we already have the log + azure alerts on top of logs , i think the real impact here would be to add a metric on such scenario - so that prom alerts can leverage that . having said that - indeed this is not a keda error |
You are right, but let's say you have an error with workload identity (any persistent issue), KEDA won't start because it will be restarting all the time, if it's a transient error, nothing happens, but with other errors could be a pain. If all your workloads use WI, this can make sense , but in the case you have multiple triggers with and without WI, this doesn't make sense. e.g:we have a product with 15-20 Prometheus triggers and only 1 case with WI integration for reading Azure EventHub topics, in our case this could be a problem for example The operator is who requests the metrics to the upstreams since 2.9, so we cannot restart it all the time if we are not 100% sure because metrics server won't response to HPA controller. Maybe a new parameter in the TriggerAuthentication for including the podIdentity in the health probes, but this is risky because on development team could impact to other team. With a multi-tenant scenario I see this use case better. @kedacore/keda-core-contributors WDYT? |
I agree with this. This needs to be discussed thoroughly. |
@v-shenoy @JorTurFer : another input that caused me to look at things differently - Yet again.
so this makes me think that there is something wrong in the way keda pod is starting itself , it will happen on all scenarios where the keda pod is evicted from the node , regardless if there is a PDB or not . I mean the fact that the keda pod does not have the ENV variables when it is evicted from one Node pool to the other |
since workload identity team has shared the following information, i’m closing this ticket as it’s not keda issue : “… webhook use fail policy Ignore to not block pod admission. So if the keda pods are run before the webhook is running, they'll not be mutated. As mentioned in the thread before…we are adding a object selector and mandating label on pods to enforce pod admission through the webhook.” i will add another issue for request to enable us to do health checks based on the existence of the token file . since the base image is distroless - we cannot use “cat” command |
Report
While running Keda 2.8 .1on AKS 1.24.6 on few regions we see that after some time ( could be Days / Hours )
Keda looses the Managed Identity and has many authentication errors .
Restarting Keda Pod will fix the issue.
we wonder if this is a bug in Workload Identity / Azure VM restart or Keda Issue .
we also want to see whether Keda Pod Health Check can be integrated with those logs and restart itself in case those error occurs
Expected Behavior
Actual Behavior
Steps to Reproduce the Problem
Logs from KEDA operator
KEDA Version
2.8.1
Kubernetes Version
1.24
Platform
Microsoft Azure
Scaler Details
Azure Service Bus
Anything else?
The text was updated successfully, but these errors were encountered: