-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added NifiNodeGroupAutoscaler with basic scaling strategies #89
Added NifiNodeGroupAutoscaler with basic scaling strategies #89
Conversation
88ba4d3
to
6d5debc
Compare
Co-authored-by: Alexandre Guitton <[email protected]>
…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback
Okay i've updated this PR with the following suggested changes:
I'll update the PR description to reflect this. |
Co-authored-by: Alexandre Guitton <[email protected]>
…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback
1f021bc
to
dc09428
Compare
Here are some example logs from upscale and downscale events:
|
I'm happy to squash all of these commits if that's desired. |
Just following up on this PR. I believe it's functionally complete with one exception. When a node group is scaled up (e.g. the Is there a way we can tell the operator to run an upscale operation to begin with to avoid this pod restarting? Once this is resolved, i will remove the draft status of this PR so that we can move to merge it. |
I will merge this PR on a branch and create a PR to remove the pod restart because with the latest version on NiFi it is no longer necessary. |
Sounds good! I can help review & test these changes on that PR. |
Added NifiNodeGroupAutoscaler with basic scaling strategies
What's in this PR?
This PR contains an implementation of the autoscaling design voted on here: https://konpytika.slack.com/archives/C035FHN1MNG/p1650031443231799
It adds:
NifiNodeGroupAutoscaler
Nodes.Spec.Labels
so that nodes in aNifiCluster
can be tagged with arbitrary labels and specifically for theNifiNodeGroupAutoscaler
to identify nodes to manageNifiCluster.Spec.Nodes
list, filter it by the nodes within to manage by the providedNifiNodeGroupAutoscaler.Spec.NodeLabelsSelector
NifiNodeGroupAutoscaler.Spec.Replicas
setting and determine if scaling needs to happen. If the # replicas > # managed nodes, then scale up. If the # replicas < # managed nodes, then scale down. Else do nothing.NiFiCluster.Spec.Nodes
list with the added/removed nodes & update the scale subresource status fields.NifiNodeGroupAutoscaler
can manage any subset of nodes in aNifiCluster
up to the entire cluster. With this, you can enable highly resourced (mem/cpu) node groups for volume bursts or to just autoscale entire clusters driven by metrics.I don't necessarily consider this PR complete, which is why its in draft status. See the additional context below. However, i have tested this on a live system and it does work.
Why?
To enable horizontal scaling of nifi clusters.
Additional context
There are a few scenarios that need to be addressed prior to merging this:
On a scale up event, when the
NifiNodeGroupAutoscaler
adds nodes to theNifiCluster.Spec.Nodes
list, the nifi cluster controller correctly adds a new pod to the deployment. However, when that node comes completely up and isRunning
in k8s, nifikop kills it and kicks off aRollingUpgrade
and basically restarts the new node. This occurs here, but i'm not exactly sure what causes it. Scaling down happens "gracefully"It's not possible to deploy a
NifiCluster
with only autoscaled node groups. TheNifiCluster
CRD requires that you specify at least one node in thespec.nodes
list. Do we want to support this? If so, we may need to adjust the cluster initialization logic in theNifiCluster
controller.I did all of my testing via ArgoCD. When the live state differs from the state in git, ArgoCD immediately reverts it so I had to craft my applications carefully to avoid ArgoCD undoing the changes that the
HorizontalPodAutoscaler
andNifiNodeGroupAutoscaler
controllers were making.I've tested scaling up and down and successfully configured the
HorizontalPodAutoscaler
to pull nifi metrics from Prometheus. Here's the autoscaler config that i used for that setup:And the
nifi_amount_items_queued_sum
metric is added to the prometheus-adapter deployment as follows:Checklist