-
Notifications
You must be signed in to change notification settings - Fork 21
Add cool off period to prevent (slowdown) scaling down #28
Comments
I don't fully understand your answer "a)", maybe you can explain a bit more. I totally get "b)" and this could be a very simple "cool down" period, i.e. if we would scale up, check first if the last "scale up" event was at least T seconds ago. T could IMHO be something like 5-10 minutes (and configurable of course). |
@ichekrygin having a scoring logic for scale down (#4) could also help in those frequent scale scenarios you described. BTW: we haven't seen such problems in our clusters at Zalando yet (we mostly have relatively stable webapp workloads right now which don't incur frequent up/down scaling). |
@hjacobs "a)" is basically a counter + check/wait attempt. Currently, the autoscaler performs check every T seconds. If it detects that "scale-down" is needed, it can skip "scale-down" trigger after N successive attempts. For example on every check: In general, the autoscaler performs great for 95% of our use cases, however, we noticed this "peak & valley" behavior on some occasions. |
@ichekrygin ok, I understand. Do you have any suggestion on how to keep the "state" (counter or time)? I want to keep the autoscaler as stateless as possible (making it very simple to reason about and robust) --- I like the time based cool off as it could be solved by checking the AWS API (scaling activities), i.e. the autoscaler would still be completely stateless (state is kept on AWS side). |
@ichekrygin btw, thanks for your valuable input and feedback 😄 Could you describe your workload in some more detail to help me understand where you run into this "peak & valley" behavior? |
This would be useful. With AWS we pay for the hour of computing anyway. So it's wasteful to kill the nodes right away. A delay, just like the official autoscaler would make this autoscaler cover my use cases. At the moment, neither seems suitable to me |
I'm actually wrong. AWS now bills by the second. I must have missed the memo ;) |
Overview
Currently, resource check runs in equal intervals with either result in:
a) No Op
b) Scale Up
c) Scale Down
The resource utilization that resembles above pattern results in the same cluster scale behavior (thrashing). We can possibly avoid this by introducing "cool off" period if we trigger scale down even either:
a) after N successive "scale-down" condition matches after last "scale-up" event
b) after T timeout after last "scale-up" event
The text was updated successfully, but these errors were encountered: