Added NifiNodeGroupAutoscaler with basic scaling strategies #89

mh013370 · 2022-04-22T17:08:07Z

Q	A
Bug fix?	no
New feature?	yes
API breaks?	no
Deprecations?	no
Related tickets
License	Apache 2.0

What's in this PR?

This PR contains an implementation of the autoscaling design voted on here: https://konpytika.slack.com/archives/C035FHN1MNG/p1650031443231799

It adds:

A new custom resource called NifiNodeGroupAutoscaler
Adds Nodes.Spec.Labels so that nodes in a NifiCluster can be tagged with arbitrary labels and specifically for the NifiNodeGroupAutoscaler to identify nodes to manage
A new controller for this custom resource. Its reconciliation loop is as follows:
- Fetch the current NifiCluster.Spec.Nodes list, filter it by the nodes within to manage by the provided NifiNodeGroupAutoscaler.Spec.NodeLabelsSelector
- Compare the managed node list with the current NifiNodeGroupAutoscaler.Spec.Replicas setting and determine if scaling needs to happen. If the # replicas > # managed nodes, then scale up. If the # replicas < # managed nodes, then scale down. Else do nothing.
- If a scale event happened, patch the NiFiCluster.Spec.Nodes list with the added/removed nodes & update the scale subresource status fields.
A NifiNodeGroupAutoscaler can manage any subset of nodes in a NifiCluster up to the entire cluster. With this, you can enable highly resourced (mem/cpu) node groups for volume bursts or to just autoscale entire clusters driven by metrics.
I significantly reduced the verbosity of the logs by default since it was difficult to track what the operator was actually doing. In general, i think this should be continued to the point where the operator only logs what actions it has taken or changes it has made in k8s.

I don't necessarily consider this PR complete, which is why its in draft status. See the additional context below. However, i have tested this on a live system and it does work.

Why?

To enable horizontal scaling of nifi clusters.

Additional context

There are a few scenarios that need to be addressed prior to merging this:

On a scale up event, when the NifiNodeGroupAutoscaler adds nodes to the NifiCluster.Spec.Nodes list, the nifi cluster controller correctly adds a new pod to the deployment. However, when that node comes completely up and is Running in k8s, nifikop kills it and kicks off a RollingUpgrade and basically restarts the new node. This occurs here, but i'm not exactly sure what causes it. Scaling down happens "gracefully"
It's not possible to deploy a NifiCluster with only autoscaled node groups. The NifiCluster CRD requires that you specify at least one node in the spec.nodes list. Do we want to support this? If so, we may need to adjust the cluster initialization logic in the NifiCluster controller.

I did all of my testing via ArgoCD. When the live state differs from the state in git, ArgoCD immediately reverts it so I had to craft my applications carefully to avoid ArgoCD undoing the changes that the HorizontalPodAutoscaler and NifiNodeGroupAutoscaler controllers were making.

I've tested scaling up and down and successfully configured the HorizontalPodAutoscaler to pull nifi metrics from Prometheus. Here's the autoscaler config that i used for that setup:

apiVersion: nifi.konpyutaika.com/v1alpha1
kind: NifiNodeGroupAutoscaler
metadata:
  labels:
    argocd.argoproj.io/instance: as-nf
  name: as-nf-scale-group
  namespace: nifi
spec:
  clusterRef:
    name: as-nf
    namespace: nifi
  nodeConfigGroupId: scale-group
  nodeLabelsSelector:
    matchLabels:
      scale_me: 'true'
  upscaleStrategy: simple
  downscaleStrategy: lifo

And the nifi_amount_items_queued_sum metric is added to the prometheus-adapter deployment as follows:

prometheus-adapter:
  rules:
    custom:
    - seriesQuery: 'nifi_amount_items_queued'
      resources:
        # skip specifying generic resource<->label mappings, and just
        # attach only pod and namespace resources by mapping label names to group-resources
        overrides:
          kubernetes_namespace:
            resource: "namespace"
          kubernetes_pod_name: 
            resource: "pod"
      name:
        matches: "^(.*)"
        as: "${1}_sum"
      metricsQuery: (sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>))

Checklist

Implementation tested
Error handling code meets the guideline
Logging code meets the guideline
User guide and development docs updated (if needed)
Append changelog with changes

api/v1alpha1/nifinodegroupautoscaler_types.go

controllers/nifinodegroupautoscaler_controller.go

pkg/util/autoscale/strategy.go

controllers/nifinodegroupautoscaler_controller.go

pkg/resources/nifi/pod.go

Co-authored-by: Alexandre Guitton <[email protected]>

…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback

mh013370 · 2022-05-16T10:22:40Z

Okay i've updated this PR with the following suggested changes:

The operator no longer creates a HorizontalPodAutoscaler. This is expected to be created externally
The way new nodeIds are assigned for autoscaling strategies is now centralized and should be shared by all following HorizontalUpscaleStrategy implementations
Added optional NifiNodeGroupAutoscaler.Spec.NodeConfig
Added NifiCluster.Status.NodesState.CreationTime to capture when a node was added to the cluster. This is used by the LIFOHorizontalDownscaleStrategy to remove the most recently added nodes on a scale-down event

I'll update the PR description to reflect this.

…re it's used

Co-authored-by: Alexandre Guitton <[email protected]>

…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback

…re it's used

…g clusters

mh013370 · 2022-05-16T15:09:50Z

Here are some example logs from upscale and downscale events:

Mon, May 16 2022 4:04:06 pm | {"level":"info","ts":1652713446.1393616,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Replicas changed from 3 to 1"}
Mon, May 16 2022 4:04:06 pm | {"level":"info","ts":1652713446.1590438,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Removing 2 nodes from cluster as-nf spec.nodes configuration for node group scale-group"}
Mon, May 16 2022 4:04:06 pm | {"level":"info","ts":1652713446.1590834,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Using LIFO downscale strategy for cluster as-nf node group scale-group"}
Mon, May 16 2022 4:05:53 pm | {"level":"info","ts":1652713553.9995086,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Replicas changed from 1 to 2"}
Mon, May 16 2022 4:05:54 pm | {"level":"info","ts":1652713554.0240824,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Adding 1 more nodes to cluster as-nf spec.nodes configuration for node group scale-group"}
Mon, May 16 2022 4:05:54 pm | {"level":"info","ts":1652713554.024129,"logger":"controllers.NifiNodeGroupAutoscaler","msg":"Using Simple upscale strategy for cluster as-nf node group scale-group"}

mh013370 · 2022-05-16T15:26:05Z

I'm happy to squash all of these commits if that's desired.

mh013370 · 2022-06-06T07:14:50Z

Just following up on this PR. I believe it's functionally complete with one exception.

When a node group is scaled up (e.g. the NifiCluster.spec.nodes list is patched), NiFiKop will correctly create the new pod. However, when the pod status eventually is ready & running, NiFiKop kills the pod and runs a graceful upscale cluster task effectively restarting the new pod.

Is there a way we can tell the operator to run an upscale operation to begin with to avoid this pod restarting?

Once this is resolved, i will remove the draft status of this PR so that we can move to merge it.

erdrix · 2022-07-13T13:34:19Z

Just following up on this PR. I believe it's functionally complete with one exception.

When a node group is scaled up (e.g. the NifiCluster.spec.nodes list is patched), NiFiKop will correctly create the new pod. However, when the pod status eventually is ready & running, NiFiKop kills the pod and runs a graceful upscale cluster task effectively restarting the new pod.

Is there a way we can tell the operator to run an upscale operation to begin with to avoid this pod restarting?

Once this is resolved, i will remove the draft status of this PR so that we can move to merge it.

I will merge this PR on a branch and create a PR to remove the pod restart because with the latest version on NiFi it is no longer necessary.

mh013370 · 2022-07-13T14:18:23Z

Just following up on this PR. I believe it's functionally complete with one exception.
When a node group is scaled up (e.g. the NifiCluster.spec.nodes list is patched), NiFiKop will correctly create the new pod. However, when the pod status eventually is ready & running, NiFiKop kills the pod and runs a graceful upscale cluster task effectively restarting the new pod.
Is there a way we can tell the operator to run an upscale operation to begin with to avoid this pod restarting?
Once this is resolved, i will remove the draft status of this PR so that we can move to merge it.

I will merge this PR on a branch and create a PR to remove the pod restart because with the latest version on NiFi it is no longer necessary.

Sounds good! I can help review & test these changes on that PR.

Added NifiNodeGroupAutoscaler with basic scaling strategies

Added NifiNodeGroupAutoscaler with basic scaling strategies

6d5debc

mh013370 force-pushed the create-nodegroup-autoscaler branch from 88ba4d3 to 6d5debc Compare April 23, 2022 08:02

mh013370 added 4 commits April 23, 2022 09:20

fix typo, update CRDs

44e0159

update changelog, add CRD documentation to site

b73015a

Add missing finalizer

806ef9d

fix docs

95007c1

erdrix requested changes May 13, 2022

View reviewed changes

mh013370 and others added 3 commits May 13, 2022 18:34

Update api/v1alpha1/nifinodegroupautoscaler_types.go

2aac444

Co-authored-by: Alexandre Guitton <[email protected]>

do not create HPA, centralize nodeId assignment, refactor scaling str…

8035701

…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback

added NodeConfig to autoscaler spec

bd633f9

mh013370 and others added 12 commits May 16, 2022 11:37

Update docs

1c773b7

relocate nodeId assignment logic to strategy package since that's whe…

1f021bc

…re it's used

Added NifiNodeGroupAutoscaler with basic scaling strategies

b34c98b

fix typo, update CRDs

61b5af1

update changelog, add CRD documentation to site

cf49b77

Add missing finalizer

f76a5c2

fix docs

f728f2a

Update api/v1alpha1/nifinodegroupautoscaler_types.go

29279bc

Co-authored-by: Alexandre Guitton <[email protected]>

do not create HPA, centralize nodeId assignment, refactor scaling str…

1ebf92f

…ategies, added CreationTime to NifiCluster.Status.NodeState, various PR feedback

added NodeConfig to autoscaler spec

5aaff70

Update docs

b958073

relocate nodeId assignment logic to strategy package since that's whe…

dc09428

…re it's used

mh013370 force-pushed the create-nodegroup-autoscaler branch from 1f021bc to dc09428 Compare May 16, 2022 11:02

mh013370 added 3 commits May 16, 2022 14:57

make node status creation time optional to avoid breaking pre-existin…

98afaa5

…g clusters

merge

6f5df2a

fixed LIFO downscale bug, added unit tests

1a27dde

revert minor change

37dc526

Merge branch 'master' into create-nodegroup-autoscaler

935c200

mh013370 marked this pull request as ready for review June 16, 2022 07:15

mh013370 requested a review from erdrix June 16, 2022 07:16

mh013370 and others added 4 commits June 23, 2022 13:04

update CRDs after merging master

17df0c2

Merge branch 'master' into create-nodegroup-autoscaler

385f748

update CRDs after merging master in

c40bd64

Merge branch 'master' into create-nodegroup-autoscaler

cf7946d

erdrix changed the base branch from master to auto-scaler July 13, 2022 13:34

erdrix approved these changes Jul 13, 2022

View reviewed changes

erdrix merged commit 50aeb49 into konpyutaika:auto-scaler Jul 13, 2022

erdrix added a commit that referenced this pull request Aug 3, 2022

Merge pull request #89 from michael81877/create-nodegroup-autoscaler

80df780

Added NifiNodeGroupAutoscaler with basic scaling strategies

mh013370 mentioned this pull request Jan 26, 2023

nifikop crashes with HPA Autoscaler Enabled #224

Closed

mh013370 mentioned this pull request Sep 18, 2023

Manage autoscale 0 and no node cluster #300

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added NifiNodeGroupAutoscaler with basic scaling strategies #89

Added NifiNodeGroupAutoscaler with basic scaling strategies #89

mh013370 commented Apr 22, 2022 •

edited

Loading

mh013370 commented May 16, 2022 •

edited

Loading

mh013370 commented May 16, 2022

mh013370 commented May 16, 2022

mh013370 commented Jun 6, 2022 •

edited

Loading

erdrix commented Jul 13, 2022

mh013370 commented Jul 13, 2022

Added NifiNodeGroupAutoscaler with basic scaling strategies #89

Added NifiNodeGroupAutoscaler with basic scaling strategies #89

Conversation

mh013370 commented Apr 22, 2022 • edited Loading

What's in this PR?

Why?

Additional context

Checklist

mh013370 commented May 16, 2022 • edited Loading

mh013370 commented May 16, 2022

mh013370 commented May 16, 2022

mh013370 commented Jun 6, 2022 • edited Loading

erdrix commented Jul 13, 2022

mh013370 commented Jul 13, 2022

mh013370 commented Apr 22, 2022 •

edited

Loading

mh013370 commented May 16, 2022 •

edited

Loading

mh013370 commented Jun 6, 2022 •

edited

Loading