-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallback wrapper scaler #1872
Comments
So to summarize, if it can't get metrics for Prometheus you want it to fallback to 8 instances, is that correct? |
Just to refine that; we would like this trigger/scaler to present as "8 instances" from the HPA. Other scalers would continue to work as normal and HPA would just take the maximum of all of them. |
In that case, have you considered to change the configuration the other way around? Something like this: triggers:
- type: prometheus
metadata:
# Required
serverAddress: http://<prometheus-host>:9090
metricName: http_requests_total
query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) # Note: query must return a vector/scalar single element response
threshold: '100'
fallback:
# Optional
replicas: 8 (forget about the implementation for now) |
I think that looks good, but maybe we should even flatten to Without forgetting about implementation, I think this also simplifies the approach. A new scaler that acts with a delegate probably still makes sense. |
The reason why I'm not keeping it flat is so that we can add more things alter on if we want to. For example, send a notification or something. (thinking out loud) |
That makes sense. Implementation-wise we've realised it might be a bit difficult to define the target ( |
This feels like a very overcomplicated solution? We would we not just let the current scale stay when things are broken? That seems like a safer behavior. Having a pinned fallback like this could get very flappy during partial failure scenarios. |
I think you would want more safety when your metrics sources are not available. If your application is scaled to its minimum replicas, and is suddenly hit with load while your metrics sources are down, it could easily be overloaded. If you have a safe fallback (it could be your maximum replicas to be extra safe), then you will always be able to handle that load. |
My worry is if Prom (or anything else) is overloaded and timing out on ~50% of requests, you could suddenly have a situation where your replica count is rapidly shifting between whatever it's supposed to be at and the fallback number. This would likely put a lot of stress on things. I guess that's fine as long as recommend people set their HPA cooldown parameters when using this feature? That way it would stay at the higher one until things stabilize (or until the cooldown expires). |
Yeah, that situation makes sense. But I think the fact that by default HPA scales up immediately when it needs to and has a default stabilisation window of 5 minutes would alleviate this. Yes, it might be worth documenting that aggressive scale-down configurations should be avoided when using this feature. |
I do get your concern but like @mishamo is saying I can see value in specifying a fallback. However, it should only kick in after x cycles of checking, not for every failure imo, and come with the required bells and whistles around it such as putting the SO in degraded state or so and emit kubernetes events. |
Yes. I was thinking about this myself and came to the same conclusion; we would only want this to happen after a few (configurable) consecutive failed attempts. This likely complicates the implementation somewhat but is probably the right move. |
Why not both? :) In the fallback section, users could specify either the exact number of replicas or an option to stay on the current replicas number. I agree with Tom's suggestion on keeping the configuration simple (proposed in this comment: #1872 (comment)) |
Isn't this the existing behaviour? If so, that would be the same as not including the fallback section at all. |
It will stay on the current replicas, but not one you can define |
If there's a problem in connecting the scaler |
why? |
There's two separate issues:
|
That is how it is today, no? You cannot define what to scale to on failure |
Ok. I think we have an approach that should work and feels acceptable to us. It is as follows:
|
I would propose this instead: (nit) ...
fallback:
failureThreshold: 3
- fallbackReplicas: 10
+ replicas: 10
triggers:
... In terms of the status, does it have to fall in the fallback category? Imagine I'm not using it, I would still be interested to see the # of failures in the status field. |
No issue with the change of
The only limitation with moving the map out of |
What if we can it What do you think @zroubalik ? |
Sorry for the long response time, I was overhelmed with other stuff. Pardon my ingorance, I am still trying to understand what is the specific ask, what is the scenario you'd like to cover? You want to define the fallback per ScaledObject or per Scaler(trigger)? Let's say we have these triggers in a ScaledObject with some metrics:
Possible use cases:
|
Hi @zroubalik, the original idea was to define the fallback per Scaler (trigger), but as this was more complicated to implement, we decided to shift the fallback definition to the ScaledObject (which is still a fallback per scaler, it's just that the same fallback would be used for all scalers on that object). The use case described in 1. looks correct to me. |
@mishamo okay. Sounds good |
How do you plan to count the number of failures? For example if you have 3 scalers and first one starts to fail and then another scaler starts to fail as well? |
The plan is to maintain a map of |
Okay, so you don't plan to sum the failures from all scalers, ack. Anything else you'd like to discuss? The proposal looks good for me 👍 |
Nope, I think we're all good for now. We have some code that should work on https://github.com/mishamo/keda/tree/fallback-scaler, but there's quite a lot of testing to be done before we create a PR. Just to clarify there's 2 pieces of work:
When we come to it, should we PR these as separate PRs or is it OK to PR as one? EDIT: As I wrote that, it made me think that maybe the |
I think that one PR is fine (if it is not huge, because reviewing is a bit harder in that case). Yeah, those are all valid concerns. Thanks a lot for the investigation and the work on this. Once you feel that you are somehow happy with the solution, open a PR and we can follow up the discussion over there. |
Re-opening since code wasn't merged yet |
Proposal
A metrics source can become unavailable. This causes it to be ignored if there are multiple scalers present. The proposal is to add a new scaler that would delegate to a wrapped instance of another scaler. If that delegate fails, it would provide a fallback metrics of
fallbackReplicas
/currentReplicas
(a little like the cron scaler).Scaler Source
A delegate scaler, falling back to a configured replica value
Scaling Mechanics
A proposed configuration could look something like the following:
The catch with this is the current type of
TriggerMetadata
ismap[string]string
. Because of this, either that type has to be changed to handle arbitrary object values, or the config has to be flattened. I think the former is preferred to avoid awkward config for users.Authentication Source
Delegate
Anything else?
This is a continuation of some thoughts from #965
We are happy to contribute this, but would like some discussion about the approach
The text was updated successfully, but these errors were encountered: