-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long Running Workloads #157
Comments
Yes today that will happen based on the Ideally the scaledown logic should measure the I think enabling keda to incorporate Prometheus queries into its scaling logic would allow for that logic, but we didn't want to have a hard dependency on Prometheus. #156 is tracking Prometheus integration. |
I guess one gross hack today is to have pod workers continue to update the ScaledObject update stamp while processing a long job. That should prevent the cool down from triggering? Assuming you update faster then the cool down time. |
Currently no plans to address this, but I don't think the use case isn't valid. Like @ahmelsayed, if you know your processing time takes X, you can set the Having the pods report ongoing work to Keda or have Keda probe pods for ongoing work is a path one does not take lightly.. |
@whilke The best solution is to remove the item from the queue after you finish processing your job, and not before. |
Circling back on this one as it's an area we've discussed some and wanted to get thought down. I think there are 2 scenarios here:
In the case of 1 I think the recommendations for now would be to:
For the case of 2, that can happen when the rate of events may decrease and the Kubernetes HPA decides to scale down the deployment. If you have 5 replicas and replica A is 20 minutes into a 40 minute execution, it's possible that HPA will scale down replica A and terminate it. In this case we've discussed a few options. I'm not sure what the best one would be, but I plan to bring this up with the autoscaling SIG and see if any thoughts around what could be done to "postpone" or delay scaling down an instance if it signals somehow that it is the middle of some work. @yaron2 also brought up that maybe the right Kubernetes paradigm for these types of long running jobs may be a Kubernetes Job - created this issue to help track that potential approach as well. |
Related: kubernetes/kubernetes#45509 - discussion around scaling down via removing specific pods & queue driven long running workloads |
@jeffhollan, Jobs as they are implemented today would have a lot of problems handling queue based long running work loads. They work well for processing through a queue of items, and can scale the jobs based on your own parallel logic, but once the queue goes cold and all the jobs finish you'd need some middleware watching the queue to spin up a new job when new items come back in. Sounds ripe for race conditions, or over spinning jobs. |
@yaron2 That doesn't solve the problem, and in fact wouldn't that result in spinning up to max pods? If you have a scale metric set for 1 queue item, and you start processing the only item but keep it in the queue, keda & hpa is going to see you still have items in your queue and will keep scaling up to try and handle it. I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it, otherwise it can get picked up by another listener. That goes for at most once, and at least once implementations. |
No, that's not the case. the HPA will scale to meet the threshold, it (for good or for bad) has no knowledge of whether or not you actually finished processing the work. It will bring up the number of pods that meet the threshold metrics.
Basically every queue system I know has a Peek-Lock functionality or means of achieving it. ( A consumer can lock and get the message, and the message gets deleted from the queue when the job is done. Queues (normally) support a single consumer per messageID per queue, meaning multiple consumers can't get the same message on a given queue. |
@lee0c is working on a proposal for the "job dispatch" pattern. Any questions, comments, upvotes, etc. should be focused on #199 to see if that's one viable option to solve this that KEDA could provide. I provided some background and questions on that issue as well. Thanks @lee0c |
I have a slightly alternative proposal for long running jobs that would also work for functions and other "listener" style applications rather than job style application. Similar to this proposal, we would have a Not having HPA allows us to implement a simple "I'm busy" protocol with the application. Possible options:
For For This should allow most applications (from functions to bash scripts) to easily signal out to Keda that they are busy. |
Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed. |
This is the approach we went (having the scaler check an instance if it's busy). It follows the normal health/ready check pattern. Also reduces race type conditions as it's a live check when the scaler is looking to scale down a set. |
We could do that too. My only reservation is on expecting the pod to have an HTTP endpoint. Since all KEDA applications are meant to process polling-based events. I'm not sure most pods would need to expose an http endpoint, which depending on the framework you're using, it might not be as simple to add to a simple queue or topic consumer app. Since KEDA doesn't require deployments to expose a service, we will need KEDA to port-forward from each pod to query the http endpoint inside. To me option |
option
KEDA could create a headless service for the deployment. Which would give a CNAME DNS entry for every pod so it can connect directly to an endpoint vs port-forward. I still think it's the best option, for an optional feature, even though it requires the worker to listen on a socket |
I'm not sure how exposing an endpoint resolves the race condition though. KEDA could query the endpoint and get "not busy", then start terminating the pod while the pod has picked up a new work item in the mean time. To eliminate that race condition we'll have to re-implement the What are the benefits of creating a service per-pod vs port-forward a pod while checking its status? Also what are the benefits of requiring every queue consumer to implement an HTTP interface to signal busy state as opposed to If the problem we're trying to solve is KEDA terminating a 3 hour job that's 2 hours into processing, then those race conditions are not applicable. If the problem is KEDA terminating a pod that has received a workitem between the time of KEDA checking the "busy" status and terminating the pod, then we need the pod to cooperate in the shutdown process, which is more involved. |
@ahmelsayed sorry for resurrecting an old thread. Isn't this exactly what liveness probes does? Are you suggesting using it or implementing it as brand new? I am playing with the pod disruption budgets to see if HPA respects it. If thats the case, we can have the pod clear the label when it is not "busy" and let HPA scale it down |
Is this still open? I see #322 merged which gives external checkpoint functionality, was wondering if these 2 issues are different. |
Yes the only gap now is documentation. I think the two recommendations are:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
I've been browsing through the source code, so apologizes if this is supported and I just missed it.
Does keda scaling support long running workloads from queues?
If keda scales up a deployment to 1 pod because there is a queue with an item (assuming that's your scale config), and that pod pulls down the job to work on it and it's long running (say hours), will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?
If it's not supported now, any plans on the roadmap? Or just not a use case keda is designed for?
The text was updated successfully, but these errors were encountered: