Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during startup in reconciler #4413

Closed
aberres opened this issue Mar 30, 2023 · 8 comments
Closed

Crash during startup in reconciler #4413

aberres opened this issue Mar 30, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@aberres
Copy link

aberres commented Mar 30, 2023

Report

Let me start with a heads-up: This bug report is best-effort. Keda was in a crash loop, so getting things back up was more important than figuring out what exactly happened. I will try to add all the details I have. I it is not enough, feel free to close.

I cannot tell exactly what triggered it, but after a CI run Keda started to crash loop. Find the relevant part of the log below.
No new scaled objects were introduced. The very same config has been running for quite some time.

Looking at the scaled object, we can see that active and fallback are Unknown instead of True or False.

❯ kubectl get scaledobjects.keda.sh 
NAME                                         SCALETARGETKIND      SCALETARGETNAME                              MIN   MAX   TRIGGERS   AUTHENTICATION   READY   ACTIVE    FALLBACK   AGE
...
master-c96-planningsvc-worker-interactive    apps/v1.Deployment   master-c96-planningsvc-worker-interactive    0     10    cron                        True    Unknown   Unknown    2d2h
...

As obviously something was wrong, I kubectl deleted the scaled object and recreated it. Keda happily started right away.

Expected Behavior

Keda does not crash when "something" (whatever something is) is wrong with a metric.

Actual Behavior

Keda crashes during startup.

Steps to Reproduce the Problem

I have no idea so far. It only happened once.

Logs from KEDA operator

Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"master-c96-planningsvc-worker-interactive","namespace":"mpp-backend-master-c96"}, "namespace": "mpp-backend-master-c96", "name": "master-c96-planningsvc-worker-interactive", "reconcileID": "23c24efb-96df-482e-acc7-ce3b2dd36986"}

panic: runtime error: invalid memory address or nil pointer dereference [recovered]

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2f62194]

goroutine 467 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x34bf3a0, 0x63ea940})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/kedacore/keda/v2/pkg/scaling/resolver.ResolveScaleTargetPodSpec({0x434be30, 0xc000fa4510}, {0x4360950, 0xc0008deea0}, {0x3a94540?, 0xc000ec0a00})
	/workspace/pkg/scaling/resolver/scale_resolvers.go:73 +0xd4
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).performGetScalersCache(0xc000805110, {0x434be30, 0xc000fa4510}, {0xc0000551d0, 0x4d}, {0x3a94540, 0xc000ec0a00}, 0xc000e3cf00, {0x0, 0x0}, ...)
	/workspace/pkg/scaling/scale_handler.go:347 +0x6e5
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScalersCache(0xc000ec0600?, {0x434be30, 0xc000fa4510}, {0x3a94540, 0xc000ec0a00})
	/workspace/pkg/scaling/scale_handler.go:273 +0xf6
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).getScaledObjectMetricSpecs(0xc000f226c0, {0x434be30, 0xc000fa4510}, {{0x4353fc0?, 0xc000fa4540?}, 0xc001260cc0?}, 0xc000ec0600)
	/workspace/controllers/keda/hpa.go:200 +0xda
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).newHPAForScaledObject(0xc000f226c0, {0x434be30?, 0xc000fa4510?}, {{0x4353fc0?, 0xc000fa4540?}, 0x3a221a0?}, 0xc000ec0600, 0xc0012395f0)
	/workspace/controllers/keda/hpa.go:74 +0x66
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).updateHPAIfNeeded(0xc000f226c0, {0x434be30, 0xc000fa4510}, {{0x4353fc0?, 0xc000fa4540?}, 0xc000fa4510?}, 0xc000ec0600, 0xc00087ba40, 0xc001580580?)
	/workspace/controllers/keda/hpa.go:152 +0x7b
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).ensureHPAForScaledObjectExists(0xc000f226c0, {0x434be30, 0xc000fa4510}, {{0x4353fc0?, 0xc000fa4540?}, 0x4353fc0?}, 0xc000ec0600, 0x0?)
	/workspace/controllers/keda/scaledobject_controller.go:431 +0x238
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).reconcileScaledObject(0xc000f226c0?, {0x434be30, 0xc000fa4510}, {{0x4353fc0?, 0xc000fa4540?}, 0xc0011bae10?}, 0xc000ec0600)
	/workspace/controllers/keda/scaledobject_controller.go:229 +0x1c9
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile(0xc000f226c0, {0x434be30, 0xc000fa4510}, {{{0xc000a7b740?, 0x10?}, {0xc0011bae10?, 0x40da87?}}})
	/workspace/controllers/keda/scaledobject_controller.go:175 +0x526
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x434be30?, {0x434be30?, 0xc000fa4510?}, {{{0xc000a7b740?, 0x32ae080?}, {0xc0011bae10?, 0x0?}}})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00080b720, {0x434bd88, 0xc00109e180}, {0x3619020?, 0xc000e4a4e0?})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323 +0x38f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00080b720, {0x434bd88, 0xc00109e180})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:231 +0x333

KEDA Version

2.10.0

Kubernetes Version

1.25

Platform

None

Scaler Details

cron

Anything else?

No response

@aberres aberres added the bug Something isn't working label Mar 30, 2023
@JorTurFer
Copy link
Member

Hi,
Definitively KEDA shouldn't panic. This looks like something wrong with the deployment and the scaledobject github.com/kedacore/keda/v2/pkg/scaling/resolver.ResolveScaleTargetPodSpec. Could you share both please? (deployment and scaledobject).
Obviously you can anonymize it replacing all sensitive data, but don't remove any field please because it looks like something related with the spec

@JorTurFer JorTurFer mentioned this issue Mar 30, 2023
17 tasks
@erez-levi
Copy link

Hi, had the same issue in my environment, I also kubectl deleted the scaled object and recreated it. I'm running keda version 2.10.0, was this issue fixed on 2.10.1? Did I understood correctly? @JorTurFer

@JorTurFer
Copy link
Member

JorTurFer commented Apr 25, 2023

Hi @erez-levi ,
No, this issue is still active, the OP didn't share the required information, so we couldn't continue with it (and that's why it's still opened)

@aberres
Copy link
Author

aberres commented Apr 25, 2023

Sorry, forgot about it. I can look into it later.

But just to make sure there are no misunderstandings: The resources have been unchanged for quite some time (I'd say a year at least) and after delete/recreate continuing to work as expected. So I am not sure if there is really an issue with the resources. I'd expect the situation to happen more often.

@AndreOrlovE
Copy link

Hi there! On behalf of @erez-levi , here are deployment and scaled object configs (some data were removed):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment-name
  namespace: myns
spec:
  progressDeadlineSeconds: 600
  replicas: 34
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    spec:
      containers:
        image: gcr.io/image_name:1.0.0
        imagePullPolicy: IfNotPresent
        name: container-name
        ports:
        - containerPort: 8888
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: "6"
            memory: 24Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /volume1
          name: nfs
          subPath: img
      restartPolicy: Always
      volumes:
      - name: nfs
        persistentVolumeClaim:
          claimName: nfs-server
status:
  availableReplicas: 34
  collisionCount: 1
  conditions:
  observedGeneration: 4250
  readyReplicas: 34
  replicas: 34
  updatedReplicas: 34

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
  finalizers:
  - finalizer.keda.sh
  generation: 3
  labels:
    scaledobject.keda.sh/name: my-deployment-name
  name: my-deployment-name
  namespace: myns
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 120
            type: Pods
            value: 1
          selectPolicy: Max
          stabilizationWindowSeconds: 300
        scaleUp:
          policies:
          - periodSeconds: 120
            type: Pods
            value: 1
          - periodSeconds: 120
            type: Percent
            value: 10
          selectPolicy: Max
          stabilizationWindowSeconds: 300
      name: keda-my-deployment-name
  cooldownPeriod: 0
  maxReplicaCount: 100
  minReplicaCount: 3
  pollingInterval: 30
  scaleTargetRef:
    name: my-deployment-name
  triggers:
  - metadata:
      activationThreshold: "0"
      query: <some_metrics>
      serverAddress: http://prometheus-server:9090
      threshold: "0.01"
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is performed because triggers are active
    reason: ScalerActive
    status: "True"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  externalMetricNames:
  - s0-prometheus-prometheus
  health:
    s0-prometheus-prometheus:
      numberOfFailures: 0
      status: Happy
  hpaName: keda-my-deployment-name
  lastActiveTime: "2023-04-25T10:56:53Z"
  originalReplicaCount: 15
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment

the errors:

	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2f62194]

goroutine 476 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x34bf3a0, 0x63ea940})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/kedacore/keda/v2/pkg/scaling/resolver.ResolveScaleTargetPodSpec({0x434be30, 0xc001725a40}, {0x4360950, 0xc0013a6d80}, {0x3a94540?, 0xc0013b8000})
	/workspace/pkg/scaling/resolver/scale_resolvers.go:73 +0xd4
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).performGetScalersCache(0xc000533c70, {0x434be30, 0xc001725a40}, {0xc001689ad0, 0x2c}, {0x3a94540, 0xc0013b8000}, 0xc0009baf00, {0x0, 0x0}, ...)
	/workspace/pkg/scaling/scale_handler.go:347 +0x6e5
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScalersCache(0xc00172e200?, {0x434be30, 0xc001725a40}, {0x3a94540, 0xc0013b8000})
	/workspace/pkg/scaling/scale_handler.go:273 +0xf6
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).getScaledObjectMetricSpecs(0xc000e2bc80, {0x434be30, 0xc001725a40}, {{0x4353fc0?, 0xc001725a70?}, 0x17?}, 0xc00172e200)
	/workspace/controllers/keda/hpa.go:200 +0xda
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).newHPAForScaledObject(0xc000e2bc80, {0x434be30?, 0xc001725a40?}, {{0x4353fc0?, 0xc001725a70?}, 0x4374d98?}, 0xc00172e200, 0xc001e7b5f0)
	/workspace/controllers/keda/hpa.go:74 +0x66
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).updateHPAIfNeeded(0xc000e2bc80, {0x434be30, 0xc001725a40}, {{0x4353fc0?, 0xc001725a70?}, 0xc001725a40?}, 0xc00172e200, 0xc00157fa40, 0xc000922760?)
	/workspace/controllers/keda/hpa.go:152 +0x7b
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).ensureHPAForScaledObjectExists(0xc000e2bc80, {0x434be30, 0xc001725a40}, {{0x4353fc0?, 0xc001725a70?}, 0x4353fc0?}, 0xc00172e200, 0x0?)
	/workspace/controllers/keda/scaledobject_controller.go:431 +0x238
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).reconcileScaledObject(0xc000e2bc80?, {0x434be30, 0xc001725a40}, {{0x4353fc0?, 0xc001725a70?}, 0xc000922740?}, 0xc00172e200)
	/workspace/controllers/keda/scaledobject_controller.go:229 +0x1c9
github.com/kedacore/keda/v2/controllers/keda.(*ScaledObjectReconciler).Reconcile(0xc000e2bc80, {0x434be30, 0xc001725a40}, {{{0xc0017a1fe0?, 0x10?}, {0xc000922740?, 0x40da87?}}})
	/workspace/controllers/keda/scaledobject_controller.go:175 +0x526
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x434be30?, {0x434be30?, 0xc001725a40?}, {{{0xc0017a1fe0?, 0x32ae080?}, {0xc000922740?, 0x0?}}})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000da6140, {0x434bd88, 0xc000b36840}, {0x3619020?, 0xc0009de360?})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323 +0x38f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000da6140, {0x434bd88, 0xc000b36840})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:231 +0x333

--
2023-04-25T00:21:23Z	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"w-panoptic-segmentation-t4","namespace":"pred"}, "namespace": "pred", "name": "w-panoptic-segmentation-t4", "reconcileID": "81b99eee-c87d-44d2-9e83-5b12deba73f7"}
--
"panic: runtime error: invalid memory address or nil pointer dereference [recovered]"

@tomkerkhove tomkerkhove moved this from To Triage to To Do in Roadmap - KEDA Core Apr 26, 2023
@zroubalik
Copy link
Member

Thanks for reporting, this seems like a duplicate of: #4389

@djsly
Copy link

djsly commented May 15, 2023

We just got it again by this yesterday, one random ScaledObejct started crashLooping Keda, rendering ALL scaled operation un-operational

The weird part if that it's never the same ScaledObject, and the affected ScaledObject is in the cluster for a long time.

@zroubalik
Copy link
Member

Closing this in favor of #4389 please continue there.

@github-project-automation github-project-automation bot moved this from To Do to Ready To Ship in Roadmap - KEDA Core Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

6 participants