operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

etamarw · 2021-09-12T14:55:40Z

Report

on version 2.2, whenever a scaledjob object got updated, the operator used to create only new jobs with the latest changes - leaving running jobs that were created by the older version of the scaledjob object to finish gracefully.
on version 2.4, the operator deletes all running jobs whenever the scaledjob objects gets updated.
this behaviour is very problematic since its causing termination of running jobs on each update.

Expected Behavior

rollout mechanism should be just like it was on version 2.2.
one of the main reasons to use scale jobs and not deployments is to avoid terminating long running operations in the middle of the run.

Actual Behavior

on version 2.4, the operator deletes all running jobs whenever the scaledjob objects gets updated.

Steps to Reproduce the Problem

reproduced on vanilla k8s cluster on on kind.

install kind

https://kind.sigs.k8s.io/docs/user/quick-start/#installation

create new cluster

kind create cluster --name test-keda

install keda on the new cluster

kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.4.0/keda-2.4.0.yaml

deploy prometheus

helm --namespace keda install prom prometheus-community/prometheus

configure a scaled job to use the local prometheus

  triggers:
  - metadata:
      metricName: test
      query: vector(1)
      serverAddress: http://prom-prometheus-server
      threshold: "1"
    type: prometheus

update the image tag the scaledjob is using.
all jobs that were already running. will get terminated

Logs from KEDA operator

2021-09-12T14:41:55.057Z	INFO	controllers.ScaledJob	Reconciling ScaledJob	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test"}
2021-09-12T14:41:55.057Z	INFO	controllers.ScaledJob	**Deleting jobs owned by the previous version of the scaledJob**	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test", "Number of jobs to delete": 50461}
2021-09-12T14:41:55.262Z	INFO	controllers.ScaledJob	Initializing Scaling logic according to ScaledJob Specification	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test"}

KEDA Version

2.4.0

Kubernetes Version

1.19

Platform

Any

Scaler Details

prometheus

Anything else?

this behaviour seems pretty similar to this issue from keda v1:
#1021

The text was updated successfully, but these errors were encountered:

etamarw · 2021-09-12T15:01:37Z

@zroubalik suggested that this pr might be relevant to this change 2.2-> 2.4
#1970
maybe you guys are familiar with this topic? @TsuyoshiUshio @tomkerkhove

etamarw · 2021-09-13T15:05:26Z

worth mentioning that on v2.2 log looks the same "Deleting jobs owned by the previous version of the scaledJob"
but jobs are not being deleted

TsuyoshiUshio · 2021-09-13T18:04:39Z

Hi @etamarw It looks happens in this point. https://github.com/kedacore/keda/blame/61740daffac2194dac91d76bb0526b2660e0acdd/controllers/keda/scaledjob_controller.go#L137 It might be possible to add configuration for this part, however, I need to understand the context. @zroubalik do you know the context why it deletes the running jobs if we update the scaled job?

etamarw · 2021-09-14T18:07:24Z

i did some debugging and found a few issues with this function:

over here controller is trying to fetch the number of jobs it needs to delete. it looks like Size doesn't work that way. i noticed that on version 2.2 it always returns 8 no matter how many jobs exists (line 171):

logger.Info("Deleting jobs owned by the previous version of the scaledJob", "Number of jobs to delete", jobs.Size())

on version 2.4, number seems to be weird(50363,21953,39908 - doesn't correlates with the number of running jobs in any way) but when there are no jobs, running jobs.Size returns 8. it means that on version 2.2 it is not deleting the jobs cause it Is not fetching them correctly.
on this line condition is always true since it is not really checking the number of jobs as expected (when 0 jobs running jobs.Size is 8).
might be related to this line but im totally not sure.
reference to the Size implementation

anyways, i think the main reason to use scaled job and not scaled object is to avoid terminating running pods in the middle of their work. therefore, im not really sure why we need this function at all - "Deleting jobs owned by the previous version of the scaledJob" but it is just my opinion :)
to enjoy both worlds, making it configurable like @TsuyoshiUshio suggested will be great, or remove it if theres a consensus about it.

thanks a lot for helping with this one guys!

zroubalik · 2021-09-15T08:20:23Z

Thanks for the information @etamarw ! @TsuyoshiUshio I don't recall the context of that, it has been implemented way too loong ago and not by me :)

I agree with suggested approach, to make this configurable. @etamarw are you willing to give it a try? Since you have already started with the analysis?

etamarw · 2021-09-29T08:42:02Z

ill try to dive it a try.
i hope its a good first issue ;)

etamarw · 2021-10-06T16:08:18Z

hi, @zroubalik i created a relevant pr with a suggestion for a fix. Its pretty raw so ill need some guidance in order to proceed :)

zroubalik · 2021-11-12T14:36:00Z

Fixed in #2164

etamarw added the bug Something isn't working label Sep 12, 2021

etamarw mentioned this issue Oct 6, 2021

ScaledJob: introduce rolloutStrategy #2164

Merged

3 tasks

zroubalik closed this as completed Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

etamarw commented Sep 12, 2021

etamarw commented Sep 12, 2021

etamarw commented Sep 13, 2021

TsuyoshiUshio commented Sep 13, 2021 •

edited

Loading

etamarw commented Sep 14, 2021

zroubalik commented Sep 15, 2021

etamarw commented Sep 29, 2021

etamarw commented Oct 6, 2021

zroubalik commented Nov 12, 2021

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

Comments

etamarw commented Sep 12, 2021

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

etamarw commented Sep 12, 2021

etamarw commented Sep 13, 2021

TsuyoshiUshio commented Sep 13, 2021 • edited Loading

etamarw commented Sep 14, 2021

zroubalik commented Sep 15, 2021

etamarw commented Sep 29, 2021

etamarw commented Oct 6, 2021

zroubalik commented Nov 12, 2021

TsuyoshiUshio commented Sep 13, 2021 •

edited

Loading