Long Running Workloads #157

whilke · 2019-05-07T19:13:52Z

I've been browsing through the source code, so apologizes if this is supported and I just missed it.

Does keda scaling support long running workloads from queues?

If keda scales up a deployment to 1 pod because there is a queue with an item (assuming that's your scale config), and that pod pulls down the job to work on it and it's long running (say hours), will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

If it's not supported now, any plans on the roadmap? Or just not a use case keda is designed for?

ahmelsayed · 2019-05-07T19:27:18Z

will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

Yes today that will happen based on the cooldownPeriod which defaults to 5 minutes. It's however a static value per ScaledObject. So if the long running workload is not consistent in how long it takes, you'd have to set the largest possible value.

Ideally the scaledown logic should measure the cooldownPeriod from the last function execution or function in progress heartbeat, not the last check of the event source. This would require keda to have access to the metrics emitted by the container.

I think enabling keda to incorporate Prometheus queries into its scaling logic would allow for that logic, but we didn't want to have a hard dependency on Prometheus. #156 is tracking Prometheus integration.

whilke · 2019-05-07T20:01:19Z

I guess one gross hack today is to have pod workers continue to update the ScaledObject update stamp while processing a long job.

That should prevent the cool down from triggering? Assuming you update faster then the cool down time.

yaron2 · 2019-05-07T20:18:37Z

Currently no plans to address this, but I don't think the use case isn't valid.

Like @ahmelsayed, if you know your processing time takes X, you can set the cooldownPeriod to X.

Having the pods report ongoing work to Keda or have Keda probe pods for ongoing work is a path one does not take lightly..

yaron2 · 2019-05-14T19:40:04Z

@whilke The best solution is to remove the item from the queue after you finish processing your job, and not before.

jeffhollan · 2019-05-16T16:03:15Z

Circling back on this one as it's an area we've discussed some and wanted to get thought down. I think there are 2 scenarios here:

Preventing KEDA from scaling a deployment to zero if there is work being executed
Preventing the HPA from scaling down a deployment that may have active work happening on it

In the case of 1 I think the recommendations for now would be to:

Don't remove the event item / checkpoint / complete until processing is completed. Ideally KEDA still sees at least 1 item still alive on the queue. This isn't always possible, but where it is, it's the best way to handle.
You may need to make your cooldown period as long as the potential execution time.
Potentially something else like whatever is the result of general refactoring #2 below

For the case of 2, that can happen when the rate of events may decrease and the Kubernetes HPA decides to scale down the deployment. If you have 5 replicas and replica A is 20 minutes into a 40 minute execution, it's possible that HPA will scale down replica A and terminate it. In this case we've discussed a few options. I'm not sure what the best one would be, but I plan to bring this up with the autoscaling SIG and see if any thoughts around what could be done to "postpone" or delay scaling down an instance if it signals somehow that it is the middle of some work. @yaron2 also brought up that maybe the right Kubernetes paradigm for these types of long running jobs may be a Kubernetes Job - created this issue to help track that potential approach as well.

lee0c · 2019-05-16T17:16:18Z

Related: kubernetes/kubernetes#45509 - discussion around scaling down via removing specific pods & queue driven long running workloads

whilke · 2019-05-17T20:47:26Z

@jeffhollan, Jobs as they are implemented today would have a lot of problems handling queue based long running work loads.

They work well for processing through a queue of items, and can scale the jobs based on your own parallel logic, but once the queue goes cold and all the jobs finish you'd need some middleware watching the queue to spin up a new job when new items come back in.

Sounds ripe for race conditions, or over spinning jobs.

whilke · 2019-05-17T20:51:11Z

@yaron2 That doesn't solve the problem, and in fact wouldn't that result in spinning up to max pods? If you have a scale metric set for 1 queue item, and you start processing the only item but keep it in the queue, keda & hpa is going to see you still have items in your queue and will keep scaling up to try and handle it.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it, otherwise it can get picked up by another listener. That goes for at most once, and at least once implementations.

yaron2 · 2019-05-17T21:25:24Z

and in fact wouldn't that result in spinning up to max pods?

No, that's not the case. the HPA will scale to meet the threshold, it (for good or for bad) has no knowledge of whether or not you actually finished processing the work. It will bring up the number of pods that meet the threshold metrics.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it

Basically every queue system I know has a Peek-Lock functionality or means of achieving it. (noAck for Rabbit).

A consumer can lock and get the message, and the message gets deleted from the queue when the job is done.
If the consumer crashes, there's usually a timeout that releases the lock after a certain period of time.

Queues (normally) support a single consumer per messageID per queue, meaning multiple consumers can't get the same message on a given queue.

jeffhollan · 2019-06-06T17:43:00Z

@lee0c is working on a proposal for the "job dispatch" pattern. Any questions, comments, upvotes, etc. should be focused on #199 to see if that's one viable option to solve this that KEDA could provide. I provided some background and questions on that issue as well. Thanks @lee0c

ahmelsayed · 2019-06-24T08:24:58Z

I have a slightly alternative proposal for long running jobs that would also work for functions and other "listener" style applications rather than job style application.

Similar to this proposal, we would have a scaleType: with current behavior being hpa, and options are simple, hpa. In simple scale mode, there is no HPA. Keda handles the scaling. This will mean that you'll loose HPA scaling metrics like CPU and memory, but you should still be able to get Keda metric scaling.

Not having HPA allows us to implement a simple "I'm busy" protocol with the application.

Possible options:

Keda-injected endpoint the application reaches out to:
File modified time in the container filesystem.

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

For #2: The app is expected to touch a file KEDA_STATE_FILE=/run/application/is_busy every timeout period if trigger is not active, but the app is busy. keda will exec stat the file and reset last active timer if needed. This requires stat in the container.

This should allow most applications (from functions to bash scripts) to easily signal out to Keda that they are busy.

tomkerkhove · 2019-06-24T09:37:04Z

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

whilke · 2019-06-24T12:41:22Z

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

This is the approach we went (having the scaler check an instance if it's busy). It follows the normal health/ready check pattern. Also reduces race type conditions as it's a live check when the scaler is looking to scale down a set.

ahmelsayed · 2019-06-24T17:34:03Z

We could do that too. My only reservation is on expecting the pod to have an HTTP endpoint. Since all KEDA applications are meant to process polling-based events. I'm not sure most pods would need to expose an http endpoint, which depending on the framework you're using, it might not be as simple to add to a simple queue or topic consumer app.

Since KEDA doesn't require deployments to expose a service, we will need KEDA to port-forward from each pod to query the http endpoint inside.

To me option #1 above is if you want to push your state into KEDA and option #2 is if you want KEDA to poll your state, but a file is simpler than exposing an http endpoint.

whilke · 2019-06-24T18:19:55Z

option #1 and option #2 smell, a lot.

#1 makes your data very stale and introduces easy race conditions from when keda last got status and when it's terminating pods (where the pod might have picked up a new job by then).

#2 requires workers to implement the status file correctly or it's worse then #1 (same race condition issue plus all the overhead of requiring a exec stat. Still doesn't stop race conditions completely since it's a disconnected status system.

KEDA could create a headless service for the deployment. Which would give a CNAME DNS entry for every pod so it can connect directly to an endpoint vs port-forward.

I still think it's the best option, for an optional feature, even though it requires the worker to listen on a socket

ahmelsayed · 2019-06-24T19:01:11Z

I'm not sure how exposing an endpoint resolves the race condition though. KEDA could query the endpoint and get "not busy", then start terminating the pod while the pod has picked up a new work item in the mean time. To eliminate that race condition we'll have to re-implement the SIGTERM, SIGKILL approach of kubernetes where we inform a pod that we want to shut it down, and the pod says "yes or no", and if it says "yes, I'm ready to be terminated", then it's on your application to not accept any new tasks until it's terminated. This is quite a bit more involved than just an "is busy" end point.

What are the benefits of creating a service per-pod vs port-forward a pod while checking its status? Also what are the benefits of requiring every queue consumer to implement an HTTP interface to signal busy state as opposed to touch $KEDA_STATE_FILE or curl $KEDA_STATE_URI on a timer while processing a queue message that you expect to take a long time given that both mechanisms are susceptible to the same type of race condition?

If the problem we're trying to solve is KEDA terminating a 3 hour job that's 2 hours into processing, then those race conditions are not applicable.

If the problem is KEDA terminating a pod that has received a workitem between the time of KEDA checking the "busy" status and terminating the pod, then we need the pod to cooperate in the shutdown process, which is more involved.

Aarthisk · 2019-07-09T22:43:30Z

@ahmelsayed sorry for resurrecting an old thread. Isn't this exactly what liveness probes does? Are you suggesting using it or implementing it as brand new? I am playing with the pod disruption budgets to see if HPA respects it. If thats the case, we can have the pod clear the label when it is not "busy" and let HPA scale it down

agowdamsft · 2019-11-11T18:12:21Z

Is this still open? I see #322 merged which gives external checkpoint functionality, was wondering if these 2 issues are different.

jeffhollan · 2019-11-11T18:27:56Z

Yes the only gap now is documentation. I think the two recommendations are:

Leverage the pre-hook for termination to delay scaling down of a pod until your application returns the correct SIGTERM to let Kubernetes know it’s ok to scale down. The upside of this approach is you can still use standard deployments / concurrent containers. The downside is if a job is delaying scale down for a long time (say hours) the Kubernetes API will set the status as “terminating...” which may be confusing to operators.
Use the new jobs functionality to consume long running tasks as a job. The upside to this is there’s a single atomic job per long running operation, and status will just be “running”. Downside is it requires you architect your app in a way so the container only pulls a single message and then terminates.

stale · 2021-10-13T21:46:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2021-10-20T22:06:44Z

This issue has been automatically closed due to inactivity.

jeffhollan added needs-discussion enhancement New feature or request labels May 8, 2019

jeffhollan mentioned this issue May 16, 2019

Enable KEDA to create and scale Kubernetes Jobs instead of only Deployments #199

Closed

jeffhollan assigned ahmelsayed May 28, 2019

jeffhollan assigned jeffhollan and unassigned ahmelsayed Sep 10, 2019

tomkerkhove added the docs:pending All issues where docs are pending addition/update label Nov 12, 2019

stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021

stale bot closed this as completed Oct 20, 2021

preflightsiren pushed a commit to preflightsiren/keda that referenced this issue Nov 7, 2021

Clarify cooldownPeriod for Deployments. (kedacore#157)

4244719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Running Workloads #157

Long Running Workloads #157

whilke commented May 7, 2019

ahmelsayed commented May 7, 2019

whilke commented May 7, 2019

yaron2 commented May 7, 2019

yaron2 commented May 14, 2019 •

edited

Loading

jeffhollan commented May 16, 2019 •

edited

Loading

lee0c commented May 16, 2019 •

edited

Loading

whilke commented May 17, 2019

whilke commented May 17, 2019

yaron2 commented May 17, 2019

jeffhollan commented Jun 6, 2019 •

edited

Loading

ahmelsayed commented Jun 24, 2019

tomkerkhove commented Jun 24, 2019

whilke commented Jun 24, 2019

ahmelsayed commented Jun 24, 2019

whilke commented Jun 24, 2019 •

edited

Loading

ahmelsayed commented Jun 24, 2019

Aarthisk commented Jul 9, 2019

agowdamsft commented Nov 11, 2019

jeffhollan commented Nov 11, 2019

stale bot commented Oct 13, 2021

stale bot commented Oct 20, 2021

Long Running Workloads #157

Long Running Workloads #157

Comments

whilke commented May 7, 2019

ahmelsayed commented May 7, 2019

whilke commented May 7, 2019

yaron2 commented May 7, 2019

yaron2 commented May 14, 2019 • edited Loading

jeffhollan commented May 16, 2019 • edited Loading

lee0c commented May 16, 2019 • edited Loading

whilke commented May 17, 2019

whilke commented May 17, 2019

yaron2 commented May 17, 2019

jeffhollan commented Jun 6, 2019 • edited Loading

ahmelsayed commented Jun 24, 2019

tomkerkhove commented Jun 24, 2019

whilke commented Jun 24, 2019

ahmelsayed commented Jun 24, 2019

whilke commented Jun 24, 2019 • edited Loading

ahmelsayed commented Jun 24, 2019

Aarthisk commented Jul 9, 2019

agowdamsft commented Nov 11, 2019

jeffhollan commented Nov 11, 2019

stale bot commented Oct 13, 2021

stale bot commented Oct 20, 2021

yaron2 commented May 14, 2019 •

edited

Loading

jeffhollan commented May 16, 2019 •

edited

Loading

lee0c commented May 16, 2019 •

edited

Loading

jeffhollan commented Jun 6, 2019 •

edited

Loading

whilke commented Jun 24, 2019 •

edited

Loading