Add cool off period to prevent (slowdown) scaling down #28

ichekrygin · 2017-05-08T23:43:09Z

Overview

Currently, resource check runs in equal intervals with either result in:
a) No Op
b) Scale Up
c) Scale Down

The resource utilization that resembles above pattern results in the same cluster scale behavior (thrashing). We can possibly avoid this by introducing "cool off" period if we trigger scale down even either:
a) after N successive "scale-down" condition matches after last "scale-up" event
b) after T timeout after last "scale-up" event

hjacobs · 2017-05-09T06:43:42Z

I don't fully understand your answer "a)", maybe you can explain a bit more.

I totally get "b)" and this could be a very simple "cool down" period, i.e. if we would scale up, check first if the last "scale up" event was at least T seconds ago. T could IMHO be something like 5-10 minutes (and configurable of course).

hjacobs · 2017-05-09T06:47:56Z

@ichekrygin having a scoring logic for scale down (#4) could also help in those frequent scale scenarios you described. BTW: we haven't seen such problems in our clusters at Zalando yet (we mostly have relatively stable webapp workloads right now which don't incur frequent up/down scaling).

ichekrygin · 2017-05-09T14:15:55Z

@hjacobs "a)" is basically a counter + check/wait attempt. Currently, the autoscaler performs check every T seconds. If it detects that "scale-down" is needed, it can skip "scale-down" trigger after N successive attempts. For example on every check:
a) cluster scaled up
b) cluster scale down is needed (1) - skip
c) cluster scale down is needed (2) - skip
d) cluster scale down is needed (3) - go ahead.

In general, the autoscaler performs great for 95% of our use cases, however, we noticed this "peak & valley" behavior on some occasions.

hjacobs · 2017-05-09T15:46:00Z

@ichekrygin ok, I understand. Do you have any suggestion on how to keep the "state" (counter or time)? I want to keep the autoscaler as stateless as possible (making it very simple to reason about and robust) --- I like the time based cool off as it could be solved by checking the AWS API (scaling activities), i.e. the autoscaler would still be completely stateless (state is kept on AWS side).

hjacobs · 2017-05-09T15:47:29Z

@ichekrygin btw, thanks for your valuable input and feedback 😄 Could you describe your workload in some more detail to help me understand where you run into this "peak & valley" behavior?

hjacobs · 2017-05-11T21:27:27Z

A cool off period after scale up probably makes sense, we see some short spikes too (scaling up from 9 to 10 nodes for less than 10 minutes):

This is only one specific time frame, the cluster size looks more stable for other days.

Vince-Cercury · 2017-11-12T23:14:33Z

This would be useful. With AWS we pay for the hour of computing anyway. So it's wasteful to kill the nodes right away.
I need an autoscaler for feature branches testing. Which happens mostly 9-7pm. During that period I'd rather oversize my cluster and avoid delays. I don't mind keeping nodes for longer than required as they might get used soon again.

A delay, just like the official autoscaler would make this autoscaler cover my use cases. At the moment, neither seems suitable to me

Vince-Cercury · 2017-11-12T23:37:21Z

I'm actually wrong. AWS now bills by the second. I must have missed the memo ;)
Which means this feature is less critical for me.

hjacobs added the enhancement label May 9, 2017

hjacobs mentioned this issue Aug 26, 2017

Reduce small fluctuations of node count #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cool off period to prevent (slowdown) scaling down #28

Add cool off period to prevent (slowdown) scaling down #28

ichekrygin commented May 8, 2017

hjacobs commented May 9, 2017

hjacobs commented May 9, 2017

ichekrygin commented May 9, 2017

hjacobs commented May 9, 2017

hjacobs commented May 9, 2017

hjacobs commented May 11, 2017 •

edited

Loading

Vince-Cercury commented Nov 12, 2017

Vince-Cercury commented Nov 12, 2017

Add cool off period to prevent (slowdown) scaling down #28

Add cool off period to prevent (slowdown) scaling down #28

Comments

ichekrygin commented May 8, 2017

Overview

hjacobs commented May 9, 2017

hjacobs commented May 9, 2017

ichekrygin commented May 9, 2017

hjacobs commented May 9, 2017

hjacobs commented May 9, 2017

hjacobs commented May 11, 2017 • edited Loading

Vince-Cercury commented Nov 12, 2017

Vince-Cercury commented Nov 12, 2017

hjacobs commented May 11, 2017 •

edited

Loading