Skip to content
This repository has been archived by the owner on Dec 24, 2019. It is now read-only.

Add cool off period to prevent (slowdown) scaling down #28

Open
ichekrygin opened this issue May 8, 2017 · 8 comments
Open

Add cool off period to prevent (slowdown) scaling down #28

ichekrygin opened this issue May 8, 2017 · 8 comments

Comments

@ichekrygin
Copy link
Contributor

Overview

Currently, resource check runs in equal intervals with either result in:
a) No Op
b) Scale Up
c) Scale Down

image

The resource utilization that resembles above pattern results in the same cluster scale behavior (thrashing). We can possibly avoid this by introducing "cool off" period if we trigger scale down even either:
a) after N successive "scale-down" condition matches after last "scale-up" event
b) after T timeout after last "scale-up" event

@hjacobs
Copy link
Owner

hjacobs commented May 9, 2017

I don't fully understand your answer "a)", maybe you can explain a bit more.

I totally get "b)" and this could be a very simple "cool down" period, i.e. if we would scale up, check first if the last "scale up" event was at least T seconds ago. T could IMHO be something like 5-10 minutes (and configurable of course).

@hjacobs
Copy link
Owner

hjacobs commented May 9, 2017

@ichekrygin having a scoring logic for scale down (#4) could also help in those frequent scale scenarios you described. BTW: we haven't seen such problems in our clusters at Zalando yet (we mostly have relatively stable webapp workloads right now which don't incur frequent up/down scaling).

@ichekrygin
Copy link
Contributor Author

@hjacobs "a)" is basically a counter + check/wait attempt. Currently, the autoscaler performs check every T seconds. If it detects that "scale-down" is needed, it can skip "scale-down" trigger after N successive attempts. For example on every check:
a) cluster scaled up
b) cluster scale down is needed (1) - skip
c) cluster scale down is needed (2) - skip
d) cluster scale down is needed (3) - go ahead.

In general, the autoscaler performs great for 95% of our use cases, however, we noticed this "peak & valley" behavior on some occasions.

@hjacobs
Copy link
Owner

hjacobs commented May 9, 2017

@ichekrygin ok, I understand. Do you have any suggestion on how to keep the "state" (counter or time)? I want to keep the autoscaler as stateless as possible (making it very simple to reason about and robust) --- I like the time based cool off as it could be solved by checking the AWS API (scaling activities), i.e. the autoscaler would still be completely stateless (state is kept on AWS side).

@hjacobs
Copy link
Owner

hjacobs commented May 9, 2017

@ichekrygin btw, thanks for your valuable input and feedback 😄 Could you describe your workload in some more detail to help me understand where you run into this "peak & valley" behavior?

@hjacobs
Copy link
Owner

hjacobs commented May 11, 2017

A cool off period after scale up probably makes sense, we see some short spikes too (scaling up from 9 to 10 nodes for less than 10 minutes):
screenshot_2017-05-11_23-24-57

This is only one specific time frame, the cluster size looks more stable for other days.

@Vince-Cercury
Copy link

This would be useful. With AWS we pay for the hour of computing anyway. So it's wasteful to kill the nodes right away.
I need an autoscaler for feature branches testing. Which happens mostly 9-7pm. During that period I'd rather oversize my cluster and avoid delays. I don't mind keeping nodes for longer than required as they might get used soon again.

A delay, just like the official autoscaler would make this autoscaler cover my use cases. At the moment, neither seems suitable to me

@Vince-Cercury
Copy link

I'm actually wrong. AWS now bills by the second. I must have missed the memo ;)
Which means this feature is less critical for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants