Feature Request: Enhance ClusterSet/Set with Custom Failover Support #981

kahirokunn · 2025-01-30T10:32:16Z

I'll help format this as a feature request.

Title: Enhance ClusterSet/Set with KeptnMetric-based Failover Support

Is your feature request related to a problem? Please describe.
Currently, while ClusterSet/Set provides failover functionality, it could be enhanced by leveraging existing CNCF capabilities for metrics-based failover decisions. There's an opportunity to integrate with established tools rather than implementing custom solutions.

Describe the solution you'd like
I propose enhancing ClusterSet/Set to support failover triggers based on KeptnMetric custom resources. The Keptn Lifecycle Toolkit already provides a metrics operator (https://github.com/keptn/lifecycle-toolkit/tree/main/metrics-operator) that we can leverage.

This integration would allow ClusterSet to make failover decisions based on various metrics like HTTP success rates, kubelet error rates, and kube-apiserver error rates. Here's a proof of concept example:

apiVersion: lib.projectsveltos.io/v1beta1
kind: ClusterSet
metadata:
  name: prod
spec:
  clusterSelector:
    matchLabels:
      env: prod
  maxReplicas: 1
  templateResourceRefs:
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-http-success-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: httpSuccessRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubelet-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeletErrorRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubeApiserver-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeApiserverErrorRate
  validateHealths:
    - script: |
        function evaluate()
          local hs = {
              healthy = true,
              message = "cluster is healthy"
          }

          local httpSuccessRate       = tonumber(ctx.httpSuccessRate.status.value) or 0
          local kubeletErrorRate      = tonumber(ctx.kubeletErrorRate.status.value) or 0
          local kubeApiserverErrorRate = tonumber(ctx.kubeApiserverErrorRate.status.value) or 0

          if httpSuccessRate <= 0.1 and (kubeletErrorRate + kubeApiserverErrorRate) <= 1.0 then
              hs.healthy = false
              hs.message = "too many errors"
          end

          return hs
        end

Describe alternatives you've considered
The main alternative would be implementing custom metrics collection and evaluation within ClusterSet/Set itself. However, this would:

Increase implementation complexity and maintenance burden
Duplicate functionality that already exists in the CNCF ecosystem
Require additional testing and validation of metrics collection

Additional context
This approach offers several benefits:

Leverages existing CNCF capabilities rather than reinventing the wheel
Reduces implementation costs by utilizing established metrics collection
Provides a flexible framework for defining various metrics-based failover scenarios
Integrates smoothly with existing Kubernetes custom resources

kahirokunn · 2025-01-31T06:38:36Z

I was thinking of revising the proposal as follows.

Introduction of Independent Custom Resources for Health Checks and Metrics-based Failover

Is this feature request related to a problem? Please explain in detail.

Currently, ClusterSet/Set has become quite feature-rich by allowing the definition of "what states are considered unhealthy" within its own spec. Specifically, it includes sections such as variableResourceRefs and validateHealths, which enable metrics-based failover (for example, utilizing KeptnMetric). However, cramming all this logic into the ClusterSet/Set resource increases complexity and risks making future feature extensions more difficult. Furthermore, the lifecycles of different aspects - cluster management and health check logic - often belong to different roles, suggesting a design principle that these concerns should be separated.

What solution would you like?

Instead of adding variableResourceRefs and validateHealths directly to ClusterSet/Set, introduce a new custom resource (e.g., "ClusterHealthPolicy" or "FailoverPolicy") for defining health check jobs, metric references, and validation logic in a clearly separated manner. ClusterSet/Set can reference (or be referenced by) these new custom resources via labels, selectors, or explicit references.

For example, the new CR would look like this:

apiVersion: lib.projectsveltos.io/v1beta1
kind: ClusterHealthPolicy
metadata:
  name: prod
spec:
  # Link with clusters via selectors, similar to ClusterSet/Set
  clusterSelector:
    matchLabels:
      env: prod

  variableResourceRefs:
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-http-success-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: httpSuccessRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubelet-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeletErrorRate
    - resource:
        apiVersion: metrics.keptn.sh/v1
        kind: KeptnMetric
        name: "{{ .Cluster.metadata.name }}-kubeApiserver-error-rate-metric"
        namespace: "{{ .Cluster.metadata.namespace }}"
      identifier: kubeApiserverErrorRate

  validateHealths:
    - # Pod-based checks to verify cluster health
      pod:
        intervalSeconds: 60
        successRateSeconds: 300
        successRateThresholdPercent: 10.5
        consecutiveFailureThreshold: 3
        template:
          metadata:
            labels:
              network-policy: "allow-all"
          spec:
            restartPolicy: Never
            volumes:
              - name: data-volume
                persistentVolumeClaim:
                  claimName: ceph
            containers:
              - name: check-pvc-mount
                image: busybox
                command:
                - /bin/sh
                - -c
                - |
                  echo "Checking PVC mount..."
                  ls -la /mnt/pvc
                  if [ $? -ne 0 ]; then
                    echo "PVC mount check failed"
                    exit 1
                  fi
                volumeMounts:
                  - name: pvc-volume
                    mountPath: /mnt/pvc

              - name: internal-check
                image: curlimages/curl:latest
                command:
                - /bin/sh
                - -c
                - |
                  echo "Checking internal service..."
                  if [ "$(curl -s -o /dev/null -w '%{http_code}' http://internal-service:80/healthz)" != "200" ]; then
                    echo "Health check failed (non-200 status)"
                    exit 1
                  fi
                  nc -zv grpc-service 50051

              - name: aws-check
                image: amazon/aws-cli
                env:
                  - name: AWS_ACCESS_KEY_ID
                    valueFrom:
                      secretKeyRef:
                        name: aws-credentials
                        key: AWS_ACCESS_KEY_ID
                  - name: AWS_SECRET_ACCESS_KEY
                    valueFrom:
                      secretKeyRef:
                        name: aws-credentials
                        key: AWS_SECRET_ACCESS_KEY
                command:
                - /bin/sh
                - -c
                - |
                  echo "Checking AWS API..."
                  aws apigateway get-rest-apis --query 'items[0].name' --output text

              - name: external-check
                image: curlimages/curl
                command:
                - /bin/sh
                - -c
                - |
                  echo "Checking external service..."
                  if [ "$(curl -s -o /dev/null -w '%{http_code}' https://example.com)" != "200" ]; then
                    echo "External service check failed (non-200 status)"
                    exit 1
                  fi

              - name: internet-check
                image: busybox
                command:
                - /bin/sh
                - -c
                - |
                  echo "Checking internet connectivity..."
                  ping -c 3 8.8.8.8 || echo "Internet connection failed"

    - language: lua
      duration: 10m
      script: |
        function evaluate()
          local hs = {
            healthy = true,
            message = "Cluster is healthy"
          }

          local httpSuccessRate        = tonumber(ctx.httpSuccessRate.status.value) or 0
          local kubeletErrorRate       = tonumber(ctx.kubeletErrorRate.status.value) or 0
          local kubeApiserverErrorRate = tonumber(ctx.kubeApiserverErrorRate.status.value) or 0

          if httpSuccessRate <= 0.1 and (kubeletErrorRate + kubeApiserverErrorRate) <= 1.0 then
              hs.healthy = false
              hs.message = "Too many errors"
          end

          return hs
        end

Existing ClusterSet/Set would simply reference (or be discovered by) this new custom resource to handle metrics-based failover logic, keeping core cluster grouping and scheduling logic in one place while delegating health checks and variable references to a resource explicitly designed for that purpose.

Considered Alternatives

Continuing to embed metrics and health check logic directly in ClusterSet/Set would result in an increasingly large and complex spec, making future maintenance and feature enhancements difficult.
Implementing custom metrics collection within ClusterSet/Set would duplicate existing CNCF functionality and add development and maintenance overhead.

Additional Context

• Moving the health check mechanism to a dedicated custom resource better maintains the single responsibility principle.
• Teams managing health check and failover logic lifecycles can work independently from teams managing and operating the actual clusters.
• Continuing to leverage KeptnMetric and other CNCF building blocks ensures flexibility for future extensions while avoiding reinventing the wheel.

kahirokunn · 2025-01-31T06:40:17Z

However, it seems that this task can be integrated and broken down into the following two.
@gianlucam76 What do you think? 👀

projectsveltos/sveltos#458
projectsveltos/sveltos#459

gianlucam76 · 2025-01-31T15:54:30Z

Thanks @kahirokunn I was thinking about this enhancements as well.

I believe the right place to define those is the SveltosCluster.Spec. So the SveltosCluster controller on top of validating the apiserver connectivity, can also validate those extra resource statuses. If the apiserver is down a failover is needed anyhow (because Sveltos won't be able to deploy new applications or detect drift configurations or events). But if the apiserver is up, those extra resources state will indicate as you suggested whether cluster is healthy or not.

The ClusterSet/Set controller will continue to simply use the SveltosCluster Status (updated by SveltosCluster controller) to decide whether a failover is needed.

Introducing new CRD has problems: configuration is in different CRDs and more importantly two different controllers try to update the same SveltosCluster status.

The approach I am suggesting will only work with SveltosCluster (so it won't work for ClusterAPI clusters). But event framework can be used to create a SveltosCluster automatically for every new ClusterAPI cluster (and delete it once the ClusterAPI cluster is deleted). And so this solution will always work.

This consolidates both the configuration (SveltosCluster.Spec) and the status update (by SveltosCluster controller) in a single place. And the scope of this work will be limited to the SveltosCluster controller repo.

What do you think?

kahirokunn · 2025-02-01T01:33:46Z

However, if the Spec of the custom resource that is automatically generated and managed by the EventFramework can also be modified by a GitOps tool such as ArgoCD, there is a concern that “the ownership of the resource” will become ambiguous, leading to potential conflicts and inconsistencies. From the perspective of Kubernetes and GitOps best practices—specifically the “Single Responsibility Principle” and the idea of a “Single Source of Truth”—this design is not considered ideal.

Furthermore, from an operational standpoint, there may be cases where you want to manage health checks without granting the authority to perform CRUD operations on the cluster. In such scenarios, instead of insisting on using SveltosCluster, it may be preferable to represent it with a different custom resource. This approach can facilitate more fine-grained RBAC management and help avoid operational confusion.

That said, there is certainly a concern, as noted in: “Introducing new CRD has problems: configuration is in different CRDs and more importantly two different controllers try to update the same SveltosCluster status.” When multiple controllers attempt to update SveltosCluster, it can create conflicts and complicate management. Additionally, splitting CRDs could cause configuration details to become scattered, which can be a disadvantage.

Nevertheless, by introducing a new CRD, you can gain the benefits of clearer single responsibility and more flexible RBAC as mentioned above. Hence, this approach should not be dismissed outright—there is value in considering it from a design perspective.

gianlucam76 · 2025-02-01T07:41:48Z

Not sure I understand.

All I am saying is that if SveltosCluster is automatically created by the user (via GitOps or manually or script or whatever), user itself can set the validateHealths fields. So in this case all good.

If the user is instead creating ClusterAPI Cluster, then since clusterAPI does not have a validateHealthsfield, we can use eventFramework to create and delete corresponding SveltosCluster automatically.

Once a SveltosCluster is created, user can modify the SveltosCluster validateHealths field. That won't interfere with event framework because the event framework will essentially just own and set the metadata (not the Spec).

kahirokunn · 2025-02-01T10:54:41Z

I see, that makes sense!
I feel confident that this approach will work well.

gianlucam76 · 2025-02-17T08:16:43Z

Readiness checks and Liveness Checks can now be defined on SveltoCluster.

By default Sveltos moves a cluster to ready as soon as it can connect to managed cluster ApiServer. And by default Sveltos considers the cluster healthy as long as it can periodically connect to ApiServer.

Now users can define Readiness checks instructing Sveltos to check specified resources in the managed cluster. Only when Sveltos can connect to ApiServer and all the Readiness checks pass, cluster is moved to Ready.

Likewise, users can define Liveness checks instructing Sveltos to check specified resources in the managed cluster. Periodically Sveltos connect to the managed cluster ApiServer and evaluates all Liveness Checks. The cluster is marked as healthy as long as all Liveness Checks are passing.

Both liveness and readiness checks are generic. It is possible to define which resources to collect and a Lua function expressing whether checks is passing or not.

To know more about this including looking and an example see PR

So closing this one as implemented.

kahirokunn · 2025-02-18T03:05:27Z

It's wonderful!
I'm looking forward to making use of it!
Thank you!

kahirokunn mentioned this issue Jan 30, 2025

Feature Request: Support Multiple Strategies for ClusterSet Failover projectsveltos/sveltos#449

Open

gianlucam76 mentioned this issue Feb 4, 2025

Feature Request: Control failover based on Degraded of HealthCheck conditions in ClusterSet / Set projectsveltos/sveltos#459

Closed

gianlucam76 closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Enhance ClusterSet/Set with Custom Failover Support #981

Feature Request: Enhance ClusterSet/Set with Custom Failover Support #981

kahirokunn commented Jan 30, 2025

kahirokunn commented Jan 31, 2025

kahirokunn commented Jan 31, 2025

gianlucam76 commented Jan 31, 2025

kahirokunn commented Feb 1, 2025

gianlucam76 commented Feb 1, 2025

kahirokunn commented Feb 1, 2025

gianlucam76 commented Feb 17, 2025

kahirokunn commented Feb 18, 2025

Feature Request: Enhance ClusterSet/Set with Custom Failover Support #981

Feature Request: Enhance ClusterSet/Set with Custom Failover Support #981

Comments

kahirokunn commented Jan 30, 2025

kahirokunn commented Jan 31, 2025

Introduction of Independent Custom Resources for Health Checks and Metrics-based Failover

Is this feature request related to a problem? Please explain in detail.

What solution would you like?

Considered Alternatives

Additional Context

kahirokunn commented Jan 31, 2025

gianlucam76 commented Jan 31, 2025

kahirokunn commented Feb 1, 2025

gianlucam76 commented Feb 1, 2025

kahirokunn commented Feb 1, 2025

gianlucam76 commented Feb 17, 2025

kahirokunn commented Feb 18, 2025