Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator deletes all running jobs whenever the relevant scaledjob object gets updated, and recreates them based on the new scaledjob spec. #2098

Closed
etamarw opened this issue Sep 12, 2021 · 8 comments
Labels
bug Something isn't working

Comments

@etamarw
Copy link
Contributor

etamarw commented Sep 12, 2021

Report

on version 2.2, whenever a scaledjob object got updated, the operator used to create only new jobs with the latest changes - leaving running jobs that were created by the older version of the scaledjob object to finish gracefully.
on version 2.4, the operator deletes all running jobs whenever the scaledjob objects gets updated.
this behaviour is very problematic since its causing termination of running jobs on each update.

Expected Behavior

rollout mechanism should be just like it was on version 2.2.
one of the main reasons to use scale jobs and not deployments is to avoid terminating long running operations in the middle of the run.

Actual Behavior

on version 2.4, the operator deletes all running jobs whenever the scaledjob objects gets updated.

Steps to Reproduce the Problem

reproduced on vanilla k8s cluster on on kind.

  1. install kind
https://kind.sigs.k8s.io/docs/user/quick-start/#installation
  1. create new cluster
kind create cluster --name test-keda
  1. install keda on the new cluster
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.4.0/keda-2.4.0.yaml
  1. deploy prometheus
helm --namespace keda install prom prometheus-community/prometheus
  1. configure a scaled job to use the local prometheus
  triggers:
  - metadata:
      metricName: test
      query: vector(1)
      serverAddress: http://prom-prometheus-server
      threshold: "1"
    type: prometheus
  1. update the image tag the scaledjob is using.
  2. all jobs that were already running. will get terminated

Logs from KEDA operator

2021-09-12T14:41:55.057Z	INFO	controllers.ScaledJob	Reconciling ScaledJob	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test"}
2021-09-12T14:41:55.057Z	INFO	controllers.ScaledJob	**Deleting jobs owned by the previous version of the scaledJob**	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test", "Number of jobs to delete": 50461}
2021-09-12T14:41:55.262Z	INFO	controllers.ScaledJob	Initializing Scaling logic according to ScaledJob Specification	{"ScaledJob.Namespace": "default", "ScaledJob.Name": "sj-test"}

KEDA Version

2.4.0

Kubernetes Version

1.19

Platform

Any

Scaler Details

prometheus

Anything else?

this behaviour seems pretty similar to this issue from keda v1:
#1021

@etamarw etamarw added the bug Something isn't working label Sep 12, 2021
@etamarw
Copy link
Contributor Author

etamarw commented Sep 12, 2021

@zroubalik suggested that this pr might be relevant to this change 2.2-> 2.4
#1970
maybe you guys are familiar with this topic? @TsuyoshiUshio @tomkerkhove

@etamarw
Copy link
Contributor Author

etamarw commented Sep 13, 2021

worth mentioning that on v2.2 log looks the same "Deleting jobs owned by the previous version of the scaledJob"
but jobs are not being deleted

@TsuyoshiUshio
Copy link
Contributor

TsuyoshiUshio commented Sep 13, 2021

Hi @etamarw It looks happens in this point. https://github.com/kedacore/keda/blame/61740daffac2194dac91d76bb0526b2660e0acdd/controllers/keda/scaledjob_controller.go#L137 It might be possible to add configuration for this part, however, I need to understand the context. @zroubalik do you know the context why it deletes the running jobs if we update the scaled job?

@etamarw
Copy link
Contributor Author

etamarw commented Sep 14, 2021

i did some debugging and found a few issues with this function:

  • over here controller is trying to fetch the number of jobs it needs to delete. it looks like Size doesn't work that way. i noticed that on version 2.2 it always returns 8 no matter how many jobs exists (line 171):
logger.Info("Deleting jobs owned by the previous version of the scaledJob", "Number of jobs to delete", jobs.Size())

on version 2.4, number seems to be weird(50363,21953,39908 - doesn't correlates with the number of running jobs in any way) but when there are no jobs, running jobs.Size returns 8. it means that on version 2.2 it is not deleting the jobs cause it Is not fetching them correctly.
on this line condition is always true since it is not really checking the number of jobs as expected (when 0 jobs running jobs.Size is 8).
might be related to this line but im totally not sure.
reference to the Size implementation

anyways, i think the main reason to use scaled job and not scaled object is to avoid terminating running pods in the middle of their work. therefore, im not really sure why we need this function at all - "Deleting jobs owned by the previous version of the scaledJob" but it is just my opinion :)
to enjoy both worlds, making it configurable like @TsuyoshiUshio suggested will be great, or remove it if theres a consensus about it.

thanks a lot for helping with this one guys!

@zroubalik
Copy link
Member

Thanks for the information @etamarw ! @TsuyoshiUshio I don't recall the context of that, it has been implemented way too loong ago and not by me :)

I agree with suggested approach, to make this configurable. @etamarw are you willing to give it a try? Since you have already started with the analysis?

@etamarw
Copy link
Contributor Author

etamarw commented Sep 29, 2021

ill try to dive it a try.
i hope its a good first issue ;)

@etamarw
Copy link
Contributor Author

etamarw commented Oct 6, 2021

hi, @zroubalik i created a relevant pr with a suggestion for a fix. Its pretty raw so ill need some guidance in order to proceed :)

@zroubalik
Copy link
Member

Fixed in #2164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants