Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If pendingPodConditions isn't set, KEDA never detects any pending jobs #6157

Closed
Makeshift opened this issue Sep 12, 2024 · 7 comments
Closed
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@Makeshift
Copy link

Makeshift commented Sep 12, 2024

Report

If pendingPodConditions is not set on a ScaledJob with strategy accurate, KEDA appears to incorrectly set the number of pending jobs as always being 0. This means that if a job takes longer to start up than the poll time of the trigger, you end up with duplicate jobs.

Expected Behavior

When using scalingStrategy.strategy: accurate, I'd expect KEDA to correctly count the number of pods that had been scheduled but are not yet running, and calculate the quantity to scale up by as QueueLength-RunningJobs-PendingJobs.

Actual Behavior

If pendingPodConditions is not set in your ScaledJob, KEDA will always calculate the number of pending jobs as being 0, resulting in duplicate jobs if your job has a long startup time.

Steps to Reproduce the Problem

  1. Define a ScaledJob with:
jobTargetRef:
  pollingInterval: 5
  scalingStrategy:
    strategy: accurate
  • a pod spec that will spend at least 10 seconds in the pending state
  1. Add 3 items to the trigger such that KEDA will scale up to 3 replicas. These 3 replicas should stay in the 'pending' state, however the KEDA operator will log "Number of pending Jobs": 0

  2. After 5 seconds (the next poll), KEDA will attempt to launch 3 additional replicas (bringing the total to 6) because it does not see the pending jobs.

  3. Reset and modify your ScaledJob to add all possible pendingPodConditions:

jobTargetRef:
  pollingInterval: 5
  scalingStrategy:
    strategy: accurate
    pendingPodConditions:
      - Ready
      - PodReadyToStartContainers
      - ContainersReady
      - Initialized
      - PodScheduled
  1. Add 3 items to the trigger such that KEDA will scale up to 3 replicas. These 3 replicas should stay in the 'pending' state, and KEDA should log
"Number of pending Jobs": 4
No need to create jobs - all requested jobs already exist

Logs from KEDA operator

15s poll time with pendingPodConditions unset, comments mine:

2024-09-12T14:48:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 0}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}
# 3 items submitted to queue - 3 jobs created and pending
2024-09-12T14:48:13Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Effective number of max jobs": 3}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
2024-09-12T14:48:13Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
# KEDA claims 0 pending jobs
2024-09-12T14:48:28Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 0}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}
# No new items added, previous 3 jobs still pending, KEDA creates 3 more jobs
2024-09-12T14:48:28Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Effective number of max jobs": 6}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
2024-09-12T14:48:28Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of jobs": 3}
# The first 3 jobs are finally now running, KEDA still can't see the most recent 3 jobs that are pending
2024-09-12T14:48:43Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of running Jobs": 3}
2024-09-12T14:48:43Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 0}

15s poll time with pendingPodConditions set to all conditions, only part of the log but shows it's working correctly:

2024-09-12T14:42:13Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "Number of pending Jobs": 4}
2024-09-12T14:42:13Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "qj-editionsendconsumer-staging", "scaledJob.Namespace": "staging", "jobs": 0}

KEDA Version

2.15.1

Kubernetes Version

1.28

Platform

Amazon Web Services

Scaler Details

SQS FIFO queue

Anything else?

I dug into this a tiny bit and I think the culprit is here. It looks like this function might be returning that a pod is running or complete when it's actually pending.

My conclusion that this is a bug is based on how the default behaviour is described here:

Default behavior - Job that have not finished yet and the underlying pod is either not running or has not been completed yet
@Makeshift Makeshift added the bug Something isn't working label Sep 12, 2024
@quiqueg
Copy link

quiqueg commented Oct 11, 2024

I noticed the same behavior and can confirm that adding pendingPodConditions fixed it for us.

@andretibolaintelipost
Copy link

andretibolaintelipost commented Oct 12, 2024

Adding pendingPodConditions also fixed for me, but it took me a while to get this fixed. I resorted to a large polling time at first, but that was affecting end users. Only after coming across this issue that i finally got to scale without duplicates and with a short polling period

Copy link

stale bot commented Dec 12, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Dec 12, 2024
@Makeshift
Copy link
Author

Bump.

Copy link

stale bot commented Dec 28, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Dec 28, 2024
@github-project-automation github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Dec 28, 2024
@Makeshift
Copy link
Author

Makeshift commented Dec 28, 2024

That is literally not how inactivity works. This is still an issue and imo it is quite important that people are able to find this so it doesn't bite them and waste a bunch of their time.

@Makeshift
Copy link
Author

I'm not sure who to ping for this, but @zroubalik could this be reopened and marked non-stale, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Status: Ready To Ship
Development

No branches or pull requests

3 participants