Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Running Workloads #157

Closed
whilke opened this issue May 7, 2019 · 21 comments
Closed

Long Running Workloads #157

whilke opened this issue May 7, 2019 · 21 comments
Assignees
Labels
docs:pending All issues where docs are pending addition/update enhancement New feature or request needs-discussion stale All issues that are marked as stale due to inactivity

Comments

@whilke
Copy link

whilke commented May 7, 2019

I've been browsing through the source code, so apologizes if this is supported and I just missed it.

Does keda scaling support long running workloads from queues?

If keda scales up a deployment to 1 pod because there is a queue with an item (assuming that's your scale config), and that pod pulls down the job to work on it and it's long running (say hours), will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

If it's not supported now, any plans on the roadmap? Or just not a use case keda is designed for?

@ahmelsayed
Copy link
Contributor

will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

Yes today that will happen based on the cooldownPeriod which defaults to 5 minutes. It's however a static value per ScaledObject. So if the long running workload is not consistent in how long it takes, you'd have to set the largest possible value.

Ideally the scaledown logic should measure the cooldownPeriod from the last function execution or function in progress heartbeat, not the last check of the event source. This would require keda to have access to the metrics emitted by the container.

I think enabling keda to incorporate Prometheus queries into its scaling logic would allow for that logic, but we didn't want to have a hard dependency on Prometheus. #156 is tracking Prometheus integration.

@whilke
Copy link
Author

whilke commented May 7, 2019

I guess one gross hack today is to have pod workers continue to update the ScaledObject update stamp while processing a long job.

That should prevent the cool down from triggering? Assuming you update faster then the cool down time.

@yaron2
Copy link
Contributor

yaron2 commented May 7, 2019

Currently no plans to address this, but I don't think the use case isn't valid.

Like @ahmelsayed, if you know your processing time takes X, you can set the cooldownPeriod to X.

Having the pods report ongoing work to Keda or have Keda probe pods for ongoing work is a path one does not take lightly..

@jeffhollan jeffhollan added needs-discussion enhancement New feature or request labels May 8, 2019
@yaron2
Copy link
Contributor

yaron2 commented May 14, 2019

@whilke The best solution is to remove the item from the queue after you finish processing your job, and not before.

@jeffhollan
Copy link
Member

jeffhollan commented May 16, 2019

Circling back on this one as it's an area we've discussed some and wanted to get thought down. I think there are 2 scenarios here:

  1. Preventing KEDA from scaling a deployment to zero if there is work being executed
  2. Preventing the HPA from scaling down a deployment that may have active work happening on it

In the case of 1 I think the recommendations for now would be to:

  • Don't remove the event item / checkpoint / complete until processing is completed. Ideally KEDA still sees at least 1 item still alive on the queue. This isn't always possible, but where it is, it's the best way to handle.
  • You may need to make your cooldown period as long as the potential execution time.
  • Potentially something else like whatever is the result of general refactoring #2 below

For the case of 2, that can happen when the rate of events may decrease and the Kubernetes HPA decides to scale down the deployment. If you have 5 replicas and replica A is 20 minutes into a 40 minute execution, it's possible that HPA will scale down replica A and terminate it. In this case we've discussed a few options. I'm not sure what the best one would be, but I plan to bring this up with the autoscaling SIG and see if any thoughts around what could be done to "postpone" or delay scaling down an instance if it signals somehow that it is the middle of some work. @yaron2 also brought up that maybe the right Kubernetes paradigm for these types of long running jobs may be a Kubernetes Job - created this issue to help track that potential approach as well.

@lee0c
Copy link
Contributor

lee0c commented May 16, 2019

Related: kubernetes/kubernetes#45509 - discussion around scaling down via removing specific pods & queue driven long running workloads

@whilke
Copy link
Author

whilke commented May 17, 2019

@jeffhollan, Jobs as they are implemented today would have a lot of problems handling queue based long running work loads.

They work well for processing through a queue of items, and can scale the jobs based on your own parallel logic, but once the queue goes cold and all the jobs finish you'd need some middleware watching the queue to spin up a new job when new items come back in.

Sounds ripe for race conditions, or over spinning jobs.

@whilke
Copy link
Author

whilke commented May 17, 2019

@yaron2 That doesn't solve the problem, and in fact wouldn't that result in spinning up to max pods? If you have a scale metric set for 1 queue item, and you start processing the only item but keep it in the queue, keda & hpa is going to see you still have items in your queue and will keep scaling up to try and handle it.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it, otherwise it can get picked up by another listener. That goes for at most once, and at least once implementations.

@yaron2
Copy link
Contributor

yaron2 commented May 17, 2019

and in fact wouldn't that result in spinning up to max pods?

No, that's not the case. the HPA will scale to meet the threshold, it (for good or for bad) has no knowledge of whether or not you actually finished processing the work. It will bring up the number of pods that meet the threshold metrics.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it

Basically every queue system I know has a Peek-Lock functionality or means of achieving it. (noAck for Rabbit).

A consumer can lock and get the message, and the message gets deleted from the queue when the job is done.
If the consumer crashes, there's usually a timeout that releases the lock after a certain period of time.

Queues (normally) support a single consumer per messageID per queue, meaning multiple consumers can't get the same message on a given queue.

@jeffhollan
Copy link
Member

jeffhollan commented Jun 6, 2019

@lee0c is working on a proposal for the "job dispatch" pattern. Any questions, comments, upvotes, etc. should be focused on #199 to see if that's one viable option to solve this that KEDA could provide. I provided some background and questions on that issue as well. Thanks @lee0c

@ahmelsayed
Copy link
Contributor

I have a slightly alternative proposal for long running jobs that would also work for functions and other "listener" style applications rather than job style application.

Similar to this proposal, we would have a scaleType: with current behavior being hpa, and options are simple, hpa. In simple scale mode, there is no HPA. Keda handles the scaling. This will mean that you'll loose HPA scaling metrics like CPU and memory, but you should still be able to get Keda metric scaling.

Not having HPA allows us to implement a simple "I'm busy" protocol with the application.

Possible options:

  1. Keda-injected endpoint the application reaches out to:
  2. File modified time in the container filesystem.

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

For #2: The app is expected to touch a file KEDA_STATE_FILE=/run/application/is_busy every timeout period if trigger is not active, but the app is busy. keda will exec stat the file and reset last active timer if needed. This requires stat in the container.

This should allow most applications (from functions to bash scripts) to easily signal out to Keda that they are busy.

@tomkerkhove
Copy link
Member

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

@whilke
Copy link
Author

whilke commented Jun 24, 2019

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

This is the approach we went (having the scaler check an instance if it's busy). It follows the normal health/ready check pattern. Also reduces race type conditions as it's a live check when the scaler is looking to scale down a set.

@ahmelsayed
Copy link
Contributor

We could do that too. My only reservation is on expecting the pod to have an HTTP endpoint. Since all KEDA applications are meant to process polling-based events. I'm not sure most pods would need to expose an http endpoint, which depending on the framework you're using, it might not be as simple to add to a simple queue or topic consumer app.

Since KEDA doesn't require deployments to expose a service, we will need KEDA to port-forward from each pod to query the http endpoint inside.

To me option #1 above is if you want to push your state into KEDA and option #2 is if you want KEDA to poll your state, but a file is simpler than exposing an http endpoint.

@whilke
Copy link
Author

whilke commented Jun 24, 2019

option #1 and option #2 smell, a lot.

#1 makes your data very stale and introduces easy race conditions from when keda last got status and when it's terminating pods (where the pod might have picked up a new job by then).

#2 requires workers to implement the status file correctly or it's worse then #1 (same race condition issue plus all the overhead of requiring a exec stat. Still doesn't stop race conditions completely since it's a disconnected status system.

KEDA could create a headless service for the deployment. Which would give a CNAME DNS entry for every pod so it can connect directly to an endpoint vs port-forward.

I still think it's the best option, for an optional feature, even though it requires the worker to listen on a socket

@ahmelsayed
Copy link
Contributor

I'm not sure how exposing an endpoint resolves the race condition though. KEDA could query the endpoint and get "not busy", then start terminating the pod while the pod has picked up a new work item in the mean time. To eliminate that race condition we'll have to re-implement the SIGTERM, SIGKILL approach of kubernetes where we inform a pod that we want to shut it down, and the pod says "yes or no", and if it says "yes, I'm ready to be terminated", then it's on your application to not accept any new tasks until it's terminated. This is quite a bit more involved than just an "is busy" end point.

What are the benefits of creating a service per-pod vs port-forward a pod while checking its status? Also what are the benefits of requiring every queue consumer to implement an HTTP interface to signal busy state as opposed to touch $KEDA_STATE_FILE or curl $KEDA_STATE_URI on a timer while processing a queue message that you expect to take a long time given that both mechanisms are susceptible to the same type of race condition?

If the problem we're trying to solve is KEDA terminating a 3 hour job that's 2 hours into processing, then those race conditions are not applicable.

If the problem is KEDA terminating a pod that has received a workitem between the time of KEDA checking the "busy" status and terminating the pod, then we need the pod to cooperate in the shutdown process, which is more involved.

@Aarthisk
Copy link
Contributor

Aarthisk commented Jul 9, 2019

@ahmelsayed sorry for resurrecting an old thread. Isn't this exactly what liveness probes does? Are you suggesting using it or implementing it as brand new? I am playing with the pod disruption budgets to see if HPA respects it. If thats the case, we can have the pod clear the label when it is not "busy" and let HPA scale it down

@jeffhollan jeffhollan assigned jeffhollan and unassigned ahmelsayed Sep 10, 2019
@agowdamsft
Copy link

Is this still open? I see #322 merged which gives external checkpoint functionality, was wondering if these 2 issues are different.

@jeffhollan
Copy link
Member

Yes the only gap now is documentation. I think the two recommendations are:

  1. Leverage the pre-hook for termination to delay scaling down of a pod until your application returns the correct SIGTERM to let Kubernetes know it’s ok to scale down. The upside of this approach is you can still use standard deployments / concurrent containers. The downside is if a job is delaying scale down for a long time (say hours) the Kubernetes API will set the status as “terminating...” which may be confusing to operators.
  2. Use the new jobs functionality to consume long running tasks as a job. The upside to this is there’s a single atomic job per long running operation, and status will just be “running”. Downside is it requires you architect your app in a way so the container only pulls a single message and then terminates.

@tomkerkhove tomkerkhove added the docs:pending All issues where docs are pending addition/update label Nov 12, 2019
@stale
Copy link

stale bot commented Oct 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021
@stale
Copy link

stale bot commented Oct 20, 2021

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Oct 20, 2021
preflightsiren pushed a commit to preflightsiren/keda that referenced this issue Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs:pending All issues where docs are pending addition/update enhancement New feature or request needs-discussion stale All issues that are marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

8 participants