Keda Identity Is not authenitcating to Service Bus after few hours #3977

tshaiman · 2022-12-07T14:48:38Z

Report

While running Keda 2.8 .1on AKS 1.24.6 on few regions we see that after some time ( could be Days / Hours )
Keda looses the Managed Identity and has many authentication errors .

Restarting Keda Pod will fix the issue.
we wonder if this is a bug in Workload Identity / Azure VM restart or Keda Issue .
we also want to see whether Keda Pod Health Check can be integrated with those logs and restart itself in case those error occurs

Expected Behavior

let Keda restart in case those error start to fire

Actual Behavior

Keda keeps emmitting those logs without the option to self-heal

Steps to Reproduce the Problem

AKS with 1.24.6
17 Service bus Queues
Enable Workload identity Federation on the cluster + on the TriggerAuthentication
Keda has Service Account + Federation Credentials is working
Keda is running OK for few hours and then all of a suddent looses Token

Logs from KEDA operator

2022-12-07T14:36:38Z    ERROR   scalehandler    Error getting scale decision    {"scaledobject.Name": "cb", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-cb-api", "error": "error reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149

KEDA Version

2.8.1

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

the Logs says it cannot find the service account token but the service account token is there .
Workload Identity Federation is active

The text was updated successfully, but these errors were encountered:

JorTurFer · 2022-12-07T21:49:20Z

Hum... weird.
Could this be any error in the AKS itself, which doesn't inject the token IDK why? @v-shenoy ?

v-shenoy · 2022-12-09T10:56:32Z

Hum... weird. Could this be any error in the AKS itself, which doesn't inject the token IDK why? @v-shenoy ?

@tshaiman Is this consistently reproducible? Did you set a custom token expiration period using --set azureWorkload.tokenExpiration while installing?

If this error aligns with the time that the token is about to expire, then maybe something is wrong with the way we're reading the token. If it's not, then there is some error in the updated token being injected into the pod. But I can't tell why this would happen, @JorTurFer.

tshaiman · 2022-12-09T17:50:14Z

@v-shenoy : this was not happening more then once per region we are deployed ( total 23 regions ) but the trigger was moving from Ubuntu VM's to Mariner Based images on AKS.
it is important to say that the Keda-Controller lost its auth token after around 2 days from the VM deployment.
when we looked at the pod age it was 2d where the Expiration token is set for 3600 (!)
meaning something is not aligned between the expiration scheduled which is every hour and the fact that we lost identity after 2 days.

it is also important to know ( even may be irrelevant) that we ran with replicas=2 , and 1 Pod was always idle ( like Active-Passive) kind of things , where they did not share the workloads between them .

tshaiman · 2022-12-09T19:43:28Z

@v-shenoy : Karma is a bitch. I was just speaking about not reproducing this item and then I saw it again on many of our regions after pods were around 10H in the air.

we have decided to increase the token expiration to max value 86400 and forcely restart Keda every hour
do you happen to have contact in Workload Identity Team ?

╰─ k logs -f  -l app=keda-operator -n keda --since 5m
        /workspace/pkg/scaling/cache/scalers_cache.go:94
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149
2022-12-09T19:07:54Z    ERROR   scalehandler    Error getting scale decision    {"scaledobject.Name": "tm", "scaledObject.Namespace": "vi-be-map", "scaleTarget.Name": "vi-tm-api", "error": "error reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:278
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:149

v-shenoy · 2022-12-09T19:46:07Z

I haven't talked to people in the workload identity team myself, before. But I think @aramase can help.

v-shenoy · 2022-12-09T19:50:08Z

On an unrelated but tertiary issue about workload identity. We need to refactor some parts there. The structs here embed context which is not a recommended Go practice. Should we have an issue for this? @JorTurFer

JorTurFer · 2022-12-10T15:38:23Z

On an unrelated but tertiary issue about workload identity. We need to refactor some parts there. The structs here embed context which is not a recommended Go practice. Should we have an issue for this? @JorTurFer

I think so, having an issue we could have help from contributors, without it, we will be who do it

tshaiman · 2022-12-10T15:58:23Z

@JorTurFer : I want to become contributor , but my GoLang is a bit rough in the last 2 years (* moved to other languages)
do you think the ramp up is reasonable ?

JorTurFer · 2022-12-15T19:55:00Z

You'll never know if you don't try xD. We are here to help if you have any question or doubt :)

tshaiman · 2022-12-17T11:11:41Z

I'm starting to migrate to Keda 2.9 and I do see a diffrence in the Deployment yaml where it now contains additional
annotation to use workload Identity.
usually this annotation needs to be placed on the Service Account , but I wonder if that change has got anything to do with the issue we see .

in addition I want to empheseise that we ran on replicaSet=2 , ensuring each Keda Pod runs on a different VM with
PodAntiAffinity Rules. in such case they were acting like active/passive and only 1 Pod actually handled traffic.
thinking out loud whether this can also cause the issue we saw.

tshaiman · 2022-12-18T08:51:53Z

@v-shenoy / @JorTurFer :

i have some updates on new experiment done on Keda 2.9.1 :

Replicas : 2
Token Expiration Timeout : 3600
Node Pool : Mariner2.0
Is Trigger Auth annotated with Azure-Workloads : Yes
Is Keda aad-Pod-Identity : No
Time Of running : 16H
Number Of Errors : 1
Is Recurring : No

ct": {"name":"op","namespace":"vi-be-map"}, "namespace": "vi-be-map", "name": "op", "reconcileID": "5ea568d6-6cb6-4e00-9102-766b8d4f1298"}
2022-12-18T02:41:33Z    ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map", "name": "aed", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror acquiring aad token - unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
        /workspace/pkg/scaling/cache/scalers_cache.go:136
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:360
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:162

v-shenoy · 2022-12-18T09:54:23Z

I'm starting to migrate to Keda 2.9 and I do see a diffrence in the Deployment yaml where it now contains additional
annotation to use workload Identity.
usually this annotation needs to be placed on the Service Account , but I wonder if that change has got anything to do with the issue we see .

in addition I want to empheseise that we ran on replicaSet=2 , ensuring each Keda Pod runs on a different VM with
PodAntiAffinity Rules. in such case they were acting like active/passive and only 1 Pod actually handled traffic.
thinking out loud whether this can also cause the issue we saw.

As per Workload Identity docs, both the pods and the service account require the label now.

tshaiman · 2022-12-18T10:01:53Z

@v-shenoy this is inaccurate.
there is no "azure/workload.identity/use" label on the pod
https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview

v-shenoy · 2022-12-18T10:03:01Z

@v-shenoy this is inaccurate.
there is no "azure/workload.identity/use" label on the pod
https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview

https://azure.github.io/azure-workload-identity/docs/topics/service-account-labels-and-annotations.html

tshaiman · 2022-12-18T10:07:55Z

@v-shenoy : well 2 diffrent docs from the same provider . not surprising . anyhow I don't think that is the root cause for the "file not found" issue here on this bug . as we are running without this label and its working for few hours and then looses the file

v-shenoy · 2022-12-18T10:12:13Z

@v-shenoy : well 2 diffrent docs from the same provider . not surprising . anyhow I don't think that is the root cause for the "file not found" issue here on this bug . as we are running without this label and its working for few hours and then looses the file

I think the docs on learn.microsoft.com have not been updated yet.

Yup. I don't think these are related, just wanted to clarify on it.

v-shenoy · 2022-12-18T10:14:39Z

@v-shenoy / @JorTurFer :

i have some updates on new experiment done on Keda 2.9.1 :

Replicas : 2
Token Expiration Timeout : 3600
Node Pool : Mariner2.0
Is Trigger Auth annotated with Azure-Workloads : Yes
Is Keda aad-Pod-Identity : No
Time Of running : 16H
Number Of Errors : 1
Is Recurring : No

ct": {"name":"op","namespace":"vi-be-map"}, "namespace": "vi-be-map", "name": "op", "reconcileID": "5ea568d6-6cb6-4e00-9102-766b8d4f1298"}
2022-12-18T02:41:33Z    ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map", "name": "aed", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror acquiring aad token - unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/v2.0/.well-known/openid-configuration\": context deadline exceeded"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
        /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
        /workspace/pkg/scaling/cache/scalers_cache.go:136
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:360
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:162

The only thing I wonder here is this somehow specific to AKS and Mariner, or does this happen for all OS'es?

tshaiman · 2022-12-18T10:48:48Z

@v-shenoy : only AKS on mariner . it never happened when we ran on Ubuntu

v-shenoy · 2022-12-18T11:20:11Z

That's interesting. This makes it more of an AKS issue than KEDA, doesn't it?

tshaiman · 2022-12-18T11:22:11Z

absolutely
they got a bug as well as the Workload Identity

i’m trying everything now 😔

v-shenoy · 2022-12-18T11:25:37Z

absolutely they got a bug as well as the Workload Identity

i’m trying everything now pensive

Is there an issue created on AKS / Workload Identity for this? If so can you link it here?

tshaiman · 2022-12-18T12:18:35Z

sure .
Azure/azure-workload-identity#665

for AKS Team I've used internal Microsoft Bug reporting system

i have another concern related to Docs of Keda.

looking at how to set the Trigger Authentication it is not clear what is the identityId field represent ?
Is it Label Name ? (a.k.a the selector ) ?
is it Client Id ? ( It is for Workload Identity ,and in such case the name should be IdentityClientId, not IdentityId)
Is it Full Resource Identity Id of the ARM represenation of managed Identity ? ( "/subscription/...../resourcegroup/my-rg/my-managed-identity-name" ) .

The Term ID is misleading since in case of Workload its not Id , its client Id , and in case of Pod Identity its also not Id
its the label / selector if I'm not mistaken

tshaiman · 2022-12-18T15:15:50Z

@v-shenoy
more NEW logs from Keda 2.9.1 :

keda-operator-d5464cdd6-zvdw2 keda-operator 2022-12-18T15:03:48Z        ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map-dev11", "name": "rc-visolo", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scalers/azure_servicebus_scaler.go:266
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/cache/scalers_cache.go:136
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
^Ckeda-operator-d5464cdd6-zvdw2 keda-operator   /workspace/pkg/scaling/scale_handler.go:360
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/scale_handler.go:162
keda-operator-d5464cdd6-zvdw2 keda-operator 2022-12-18T15:03:48Z        ERROR   azure_servicebus_scaler error getting service bus entity length {"type": "ScaledObject", "namespace": "vi-be-map-dev11", "name": "celebs", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scalers/azure_servicebus_scaler.go:266
keda-operator-d5464cdd6-zvdw2 keda-operator github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
keda-operator-d5464cdd6-zvdw2 keda-operator     /workspace/pkg/scaling/cache/scalers_cache.go:136


so what we saw here was never seen before as this is a ChainedTokenCredential error : I guess the AzureCLI was not remvoed from the Chanin list and it causes lots of errors.

tshaiman · 2022-12-18T19:49:35Z

closing this thread and opening a dedicated 2.9.1 issue

jmos5156 · 2023-01-09T19:49:10Z

Hello we seem to be getting similar errors on from AKS

2023-01-09T19:45:39Z	ERROR	azure_servicebus_scaler	error getting service bus entity length	{"type": "ScaledObject", "namespace": "do", "name": "azure-servicebus-queue-scaled-eventparser", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
	/workspace/pkg/scaling/cache/scalers_cache.go:140
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:356
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:162
2023-01-09T19:45:39Z	ERROR	scalers_cache	error getting scale decision	{"scaledobject.Name": "azure-servicebus-queue-scaled-eventparser", "scaledObject.Namespace": "do", "scaleTarget.Name": "eventparser", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetScaledObjectState
	/workspace/pkg/scaling/cache/scalers_cache.go:154
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
	/workspace/pkg/scaling/scale_handler.go:356
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
	/workspace/pkg/scaling/scale_handler.go:162

We're running
AKS - version 1.23.5
aad-pod - chart version 4.1.15
keda - chart version: 2.9.0

We have tried several different version include version 2.9.1 and all exhibit the same error.

TIA

tshaiman · 2023-01-09T20:18:34Z

@jmos5156 : actually we ran on
workloadIdentity, not on podIdentity
have you configure pod identity correctly with
“az aks pod-identity” command ? there are many permission you need to grant to the vm/vmss if you are using pod identity and the cli does it for you .

so just making sure this is the same issue- i’m not sure 🤔

JorTurFer · 2023-01-09T21:01:30Z

I think @jmos5156 issue isn't related with this and it's related with this other issue

Dragonsong3k · 2023-01-10T06:30:41Z

I came to +1 this.

I just updated my AKS Cluster to 1.24.6 from 1.21 as well as AADPOD update.

No ScaledJobs are kicking off and here is the keda-operator log

2023-01-10T06:28:23Z    ERROR    azure_servicebus_scaler    error getting service bus entity length    {"type": "ScaledJob", "namespace": "gaccc", "name": "provision-svc-ctrl-job", "error": "ChainedTokenCredential: failed to acquire a token.\nAttempted credentials:\n\tAzureCLICredential: fork/exec /bin/sh: no such file or directory\n\terror reading service account token - open : no such file or directory"}
github.com/kedacore/keda/v2/pkg/scalers.(*azureServiceBusScaler).GetMetricsAndActivity
    /workspace/pkg/scalers/azure_servicebus_scaler.go:266
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).getScaledJobMetrics
    /workspace/pkg/scaling/cache/scalers_cache.go:306
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).IsScaledJobActive
    /workspace/pkg/scaling/cache/scalers_cache.go:178
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
    /workspace/pkg/scaling/scale_handler.go:372
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
    /workspace/pkg/scaling/scale_handler.go:162
2023-01-10T06:28:23Z    INFO    scaleexecutor    Scaling Jobs    {"scaledJob.Name": "provision-svc-ctrl-job", "scaledJob.Namespace": "gaccc", "Number of running Jobs": 0}
2023-01-10T06:28:23Z    INFO    scaleexecutor    Scaling Jobs    {"scaledJob.Name": "provision-svc-ctrl-job", "scaledJob.Namespace": "gaccc", "Number of pending Jobs ": 0}

Are there any work arounds by any chance?

Thanks.

JorTurFer · 2023-01-10T07:50:41Z

Are there any work arounds by any chance?

No, there isn't any workaround because it's related with a bug. This PR fixes it

tomkerkhove · 2023-01-10T07:55:51Z

Are you sure that is the issue? It looks like there is another underlying issue as well wrt the missing files

tomkerkhove · 2023-01-10T08:07:30Z

Are there any work arounds by any chance?

No, there isn't any workaround because it's related with a bug. This PR fixes it

To be clear, this is for the pod identity problem reported by @jmos5156 & @Dragonsong3k which is tracked in #4026. Not the issue @tshaiman is having

Dragonsong3k · 2023-01-10T15:45:57Z

@tomkerkhove thank you for the clarification.

tshaiman · 2023-01-12T23:07:32Z

@tomkerkhove / @JorTurFer : I have another important update .
I have managed to reproduce the issue on Ubuntu node pool , meaning this is not a Mariner issue as well.

the good news , I have Step-by-Steps instruction on how you can easily recreated the bug
( I did that during my ski vacation on a private subscription ;-) )

Create AKS Cluster , onboard it to workload Identity , make sure you choose version 1.24 ( this is important for later steps)
use a single Node Pool for system , use D4as_v5 with 2 node pools ( just a suggestion )
install Keda 2.9.1 , enable workload identity and assign user-Assigned managed Identity
deploy simple Service Bus Queue , and create basic ScaleObject on a queue on that Service Bus Queue
Sanity : ensure everything works by placing a message on the queue and verify your pods go up.

===> now for the fun part
6. upgrade your cluster from 1.24 to 1.25 using the Azure Portal for both the control plane + node pools.

as I always suspected something is happening when keda Pods needs to move from one node to the other due to system updates or node upgrades this is the easiest way to recreate the issue , and it can also explain why somethimes it took days or more for the error to occur : we need to see that the pods are evicted from one Node to another.

result : the same old "File was not find" issue.

JorTurFer · 2023-01-12T23:37:13Z

WOW!!
Thanks for the clear explanation 🙇
I'm trying to reproduce it just draining the node and if it doesn't work just draining, tomorrow I'll try with the upgrade
@kedacore/keda-core-contributors , I still think that this isn't related with KEDA itself as we doesn't manage the token file at all, it's workload identity webhook who does it. Should we escalate it in MSFT somehow?

JorTurFer · 2023-01-12T23:45:33Z

One thing @tshaiman , how many instances of azure-wi-webhook do you have? Is the pod properly mutated by the hook? I mean, the token isn't mounted, but the pod manifest has the token defined?
Could it be that KEDA pod is moved together with the workload identity webhook pods during the node draining? If this is the case, the mutating webhook could be skipped (because there isn't any wi webhook instance ready to execute it), not mounting the volume (which is our case), but I'd say that if that's the case, the pod manifest doesn't have the token volume section

JorTurFer · 2023-01-12T23:56:33Z

I have reproduced the behaviour draining a node in my cluster with 2 nodes, with 1 instance of workload identity without PDB. When I drain the node where KEDA and the wi webhooks are, the new KEDA pod is scheduled without the volume because when KEDA pods are scheduled, the wi mutating webhook is down.
I guess this isn't your case because default workload identity helm chart deploys 2 replicas + pdb to ensure at least 1 replica always but I prefer to ask just in case

tshaiman · 2023-01-13T04:46:51Z

@JorTurFer : I deployed workload Identity as AKS add on ( using az aks update --enable-workload-identity) not with the helm chat. this installs version 0.14 which isn't the latest ( 0.15 is the latest)
so I get 2 replicas of the azure-wi-webhook but no PDB is defined, which is strange since I do see the PDB is integral part of the helm chart .

could that be the issue you've mentioned ?
regarding your quetsion "Is the pod properly mutated by the hook?" not sure I understand what Pod are you referring to ? the Keda Pod or the webhook pod ? can you provide some short examples on the commands needed to be run here ?

v-shenoy · 2023-01-13T06:11:48Z

Very interesting discussion here. @JorTurFer I had a thought regarding your point about lack of PDBs for the Workload Identity webhook pods, making the potentially unavailable during a node drain to mount the token volume onto the KEDA pods. The webhook is responsible for injecting the right environment variables as well as mounting the volume. How is it that the environment variables are present but not the volume?

JorTurFer · 2023-01-13T08:05:26Z

okey, let me share my theory (it could be crazy):
as you have 2 nodes and the webhooks don't have pdbs, during the 2nd node draining, you lose both instances at the same time when you move KEDA pods, why?
Lets say that you have 1 instance in the node A and another in the node B, during node A update, both instances are moved to node B with KEDA pods as well, for keeping node A empty. During the node B draining, all KEDA instances are evicted but also webhooks instances, so there is a period during the process without any webhbook available for mutating the pods.
How to validate my theory:
if you get the pods yaml in the beginning, when they are working well, you have to see one volume defined like this:

- name: azure-identity-token
      projected:
        sources:
          - serviceAccountToken:
              audience: api://AzureADTokenExchange
              expirationSeconds: 3600
              path: azure-identity-token
        defaultMode: 420

which is mounted in the pod like this:

- name: azure-identity-token
  readOnly: true
  mountPath: /var/run/secrets/azure/tokens

At this moment, the pods also have some environment variables defined (maybe they are other extra or they have different values):

- name: AZURE_CLIENT_ID
- name: AZURE_TENANT_ID
- name: AZURE_FEDERATED_TOKEN_FILE
  value: /var/run/secrets/azure/tokens/azure-identity-token
- name: AZURE_AUTHORITY_HOST
   value: https://login.microsoftonline.com/

These 3 things are added by the mutating webhook, this means that the deployment doesn't have them, only the pods because it is the workload identity mutating webhooks who add them.
If my theory is correct and there is a period with workload identity webhooks unavailable at the same time as KEDA pods are being moved, you should check the new pods yaml and these things have to be missing.

As the lack of these things is the cause of KEDA issues using managed identities (because the SKDs uses them), if my theory is correct, this can be the root cause.

tshaiman · 2023-01-13T09:34:13Z

@JorTurFer : Your Theory is CONFIRMED !
I re-created the scenario and have before/after yaml files.
indeed the file after node drains has No AZURE_CLIENT_ID/TOKEN_FILE/TENANT_ID and does not have the volume.
well done @JorTurFer
now - > lets push this forward as critical bug to WIF people .

here is the pod definition AFTER :

    env:
    - name: WATCH_NAMESPACE
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: keda-operator
    - name: KEDA_HTTP_DEFAULT_TIMEOUT
      value: "3000"
    image: ghcr.io/kedacore/keda:2.9.1
    imagePullPolicy: Always
 
 ....
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p5qf6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true

so as you can see both the volume and the ENV variables are missing
thanks a lot

v-shenoy · 2023-01-13T09:41:16Z

okey, let me share my theory (it could be crazy): as you have 2 nodes and the webhooks don't have pdbs, during the 2nd node draining, you lose both instances at the same time when you move KEDA pods, why? Lets say that you have 1 instance in the node A and another in the node B, during node A update, both instances are moved to node B with KEDA pods as well, for keeping node A empty. During the node B draining, all KEDA instances are evicted but also webhooks instances, so there is a period during the process without any webhbook available for mutating the pods. How to validate my theory: if you get the pods yaml in the beginning, when they are working well, you have to see one volume defined like this:
- name: azure-identity-token
      projected:
        sources:
          - serviceAccountToken:
              audience: api://AzureADTokenExchange
              expirationSeconds: 3600
              path: azure-identity-token
        defaultMode: 420
which is mounted in the pod like this:
- name: azure-identity-token
  readOnly: true
  mountPath: /var/run/secrets/azure/tokens
At this moment, the pods also have some environment variables defined (maybe they are other extra or they have different values):
- name: AZURE_CLIENT_ID
- name: AZURE_TENANT_ID
- name: AZURE_FEDERATED_TOKEN_FILE
  value: /var/run/secrets/azure/tokens/azure-identity-token
- name: AZURE_AUTHORITY_HOST
   value: https://login.microsoftonline.com/
These 3 things are added by the mutating webhook, this means that the deployment doesn't have them, only the pods because it is the workload identity mutating webhooks who add them. If my theory is correct and there is a period with workload identity webhooks unavailable at the same time as KEDA pods are being moved, you should check the new pods yaml and these things have to be missing.

As the lack of these things is the cause of KEDA issues using managed identities (because the SKDs uses them), if my theory is correct, this can be the root cause.

Yeah, I understood this. My question was (which in hindsight is dumb) is how was it possible for only the volume mount to be missing while the env variables are present. Clearly the env variables are also absent. And the file path read by the KEDA is "", which obviously has not mounted token.

I think we should print the file path in the error message from our end when we are unable to read the token.

v-shenoy · 2023-01-13T09:42:13Z

@JorTurFer : Your Theory is CONFIRMED ! I re-created the scenario and have before/after yaml files. indeed the file after node drains has No AZURE_CLIENT_ID/TOKEN_FILE/TENANT_ID and does not have the volume. well done @JorTurFer now - > lets push this forward as critical bug to WIF people .

here is the pod definition AFTER :
    env:
    - name: WATCH_NAMESPACE
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: keda-operator
    - name: KEDA_HTTP_DEFAULT_TIMEOUT
      value: "3000"
    image: ghcr.io/kedacore/keda:2.9.1
    imagePullPolicy: Always
 
 ....
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-p5qf6
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
so as you can see both the volume and the ENV variables are missing thanks a lot

So, essentially WI needs to have PDBs deployed as a part of the AKS add-on, which is already done when installed via the Helm chart.

v-shenoy · 2023-01-13T09:44:41Z

So, now that we have pinpointed this issue due to node drain / missing PDBs. Do you have ideas as to why was it initially only happening on Mariner? @tshaiman @JorTurFer

tshaiman · 2023-01-13T09:59:21Z

@v-shenoy : it wasn’t always happening on Mariner , we got to the wrong conclusion since the trigger who got the node drain was switching to mariner .

my colleague @yehiam livneh has added the following suggestion to the keda team:

“It’s also Keda issue since they don't have any liveness probes . They don't know that they lost their identity - if they had liveness then they would have restarted and everything would be okay.”

i agree with his statement

JorTurFer · 2023-01-13T10:56:24Z

“It’s also Keda issue since they don't have any liveness probes . They don't know that they lost their identity - if they had liveness then they would have restarted and everything would be okay.”

i agree with his statement

Maybe we can include somehow a dynamic livenessProbe based on needed resources, e.g: If there is any TriggerAuthentication with workload identity (azure/gcp) or aws role assumptions we can check those files as part of the probe.
TBH, I don't like this idea because a single failing trigger can force a crashloop and KEDA works, the problem it isn't that KEDA fails or is death, KEDA works but the trigger fails.
Maybe we can add a log trace saying that the file doesn't exist, but that message is already available error reading service account token - open : no such file or directory

tshaiman · 2023-01-13T11:00:51Z

@JorTurFer I see your point , but as a reminder - the pod does goes to healthy once it is restarted since the volume mount works then.
i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

since we already have the log + azure alerts on top of logs , i think the real impact here would be to add a metric on such scenario - so that prom alerts can leverage that .

having said that - indeed this is not a keda error
can we now join hands to work with WIF people on it ?

JorTurFer · 2023-01-13T11:15:59Z

i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

You are right, but let's say you have an error with workload identity (any persistent issue), KEDA won't start because it will be restarting all the time, if it's a transient error, nothing happens, but with other errors could be a pain. If all your workloads use WI, this can make sense , but in the case you have multiple triggers with and without WI, this doesn't make sense. e.g:we have a product with 15-20 Prometheus triggers and only 1 case with WI integration for reading Azure EventHub topics, in our case this could be a problem for example

The operator is who requests the metrics to the upstreams since 2.9, so we cannot restart it all the time if we are not 100% sure because metrics server won't response to HPA controller.

Maybe a new parameter in the TriggerAuthentication for including the podIdentity in the health probes, but this is risky because on development team could impact to other team. With a multi-tenant scenario I see this use case better.

@kedacore/keda-core-contributors WDYT?

v-shenoy · 2023-01-13T11:26:42Z

i do believe this is the exact role of health ! to
report that it cannot operate if it really can’t
today the health check is always “true” leaving it with no real value.

You are right, but let's say you have an error with workload identity (any persistent issue), KEDA won't start because it will be restarting all the time, if it's a transient error, nothing happens, but with other errors could be a pain. If all your workloads use WI, this can make sense , but in the case you have multiple triggers with and without WI, this doesn't make sense. e.g:we have a product with 15-20 Prometheus triggers and only 1 case with WI integration for reading Azure EventHub topics, in our case this could be a problem for example

The operator is who requests the metrics to the upstreams since 2.9, so we cannot restart it all the time if we are not 100% sure because metrics server won't response to HPA controller.

Maybe a new parameter in the TriggerAuthentication for including the podIdentity in the health probes, but this is risky because on development team could impact to other team. With a multi-tenant scenario I see this use case better.

@kedacore/keda-core-contributors WDYT?

I agree with this. This needs to be discussed thoroughly.

tshaiman · 2023-01-13T21:06:28Z

@v-shenoy @JorTurFer : another input that caused me to look at things differently - Yet again.

I have installed Workload Identity with the latest helm chart -> PDB is now installed with min=1 on webhook.
have only 1 system node pool where keda is installed at
Stopped the cluster and restarted it again using the Portal .
Keda -> Looses the token again , "no such file" infamous error , no Volume for Cert and No Environment variables.
Restart the keda pod restore everything back - the volume and the env variables.

so this makes me think that there is something wrong in the way keda pod is starting itself , it will happen on all scenarios where the keda pod is evicted from the node , regardless if there is a PDB or not .

I mean the fact that the keda pod does not have the ENV variables when it is evicted from one Node pool to the other
is a big warning sign here , don't you think ?

tshaiman · 2023-01-15T10:20:15Z

since workload identity team has shared the following information, i’m closing this ticket as it’s not keda issue :

“… webhook use fail policy Ignore to not block pod admission. So if the keda pods are run before the webhook is running, they'll not be mutated. As mentioned in the thread before…we are adding a object selector and mandating label on pods to enforce pod admission through the webhook.”

i will add another issue for request to enable us to do health checks based on the existence of the token file . since the base image is distroless - we cannot use “cat” command

tshaiman added the bug Something isn't working label Dec 7, 2022

tomkerkhove added this to Roadmap - KEDA Core Dec 7, 2022

tomkerkhove moved this to Proposed in Roadmap - KEDA Core Dec 7, 2022

tshaiman mentioned this issue Dec 9, 2022

Error reading service account token - open : no such file or directory Azure/azure-workload-identity#665

Closed

tshaiman closed this as completed Dec 18, 2022

Repository owner moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Dec 18, 2022

jmos5156 mentioned this issue Jan 10, 2023

Keda 2.9.1 on AKS with Pod-Identity Looks for AzureCLICredential- #4026

Closed

tshaiman closed this as completed Jan 15, 2023

github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Jan 15, 2023

tshaiman mentioned this issue Jan 15, 2023

add “cat” to base image - adds the ability to query missing cert file #4109

Closed

Keda Identity Is not authenitcating to Service Bus after few hours #3977

Keda Identity Is not authenitcating to Service Bus after few hours #3977

Comments

tshaiman commented Dec 7, 2022

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Dec 7, 2022

v-shenoy commented Dec 9, 2022

tshaiman commented Dec 9, 2022

tshaiman commented Dec 9, 2022

v-shenoy commented Dec 9, 2022 • edited Loading

v-shenoy commented Dec 9, 2022 • edited Loading

JorTurFer commented Dec 10, 2022

tshaiman commented Dec 10, 2022

JorTurFer commented Dec 15, 2022

tshaiman commented Dec 17, 2022

tshaiman commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

tshaiman commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

tshaiman commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

tshaiman commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

tshaiman commented Dec 18, 2022

v-shenoy commented Dec 18, 2022

tshaiman commented Dec 18, 2022

tshaiman commented Dec 18, 2022 • edited Loading

tshaiman commented Dec 18, 2022

jmos5156 commented Jan 9, 2023

tshaiman commented Jan 9, 2023

JorTurFer commented Jan 9, 2023

Dragonsong3k commented Jan 10, 2023

JorTurFer commented Jan 10, 2023

tomkerkhove commented Jan 10, 2023

tomkerkhove commented Jan 10, 2023

Dragonsong3k commented Jan 10, 2023

tshaiman commented Jan 12, 2023 • edited Loading

JorTurFer commented Jan 12, 2023 • edited Loading

JorTurFer commented Jan 12, 2023 • edited Loading

JorTurFer commented Jan 12, 2023 • edited Loading

tshaiman commented Jan 13, 2023

v-shenoy commented Jan 13, 2023

JorTurFer commented Jan 13, 2023 • edited Loading

tshaiman commented Jan 13, 2023

v-shenoy commented Jan 13, 2023

v-shenoy commented Jan 13, 2023

v-shenoy commented Jan 13, 2023

tshaiman commented Jan 13, 2023

JorTurFer commented Jan 13, 2023

tshaiman commented Jan 13, 2023

JorTurFer commented Jan 13, 2023 • edited Loading

v-shenoy commented Jan 13, 2023

tshaiman commented Jan 13, 2023

tshaiman commented Jan 15, 2023 • edited Loading

v-shenoy commented Dec 9, 2022 •

edited

Loading

v-shenoy commented Dec 9, 2022 •

edited

Loading

tshaiman commented Dec 18, 2022 •

edited

Loading

tshaiman commented Jan 12, 2023 •

edited

Loading

JorTurFer commented Jan 12, 2023 •

edited

Loading

JorTurFer commented Jan 12, 2023 •

edited

Loading

JorTurFer commented Jan 12, 2023 •

edited

Loading

JorTurFer commented Jan 13, 2023 •

edited

Loading

JorTurFer commented Jan 13, 2023 •

edited

Loading

tshaiman commented Jan 15, 2023 •

edited

Loading