Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Enhance Resource Health Monitoring within App CR #717

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions proposals/kapp-controller/004-health-status-reporting/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
title: "Enhancing Resource Health Monitoring within App CR"
authors: [ "Varsha Prasad Narsing <[email protected]>" ]
status: "in review"
approvers: [ "Carvel Maintainers" ]
---

# Enhancing Resource Health Monitoring within App CR

## Problem Statement

After applying the package contents fetched from the source to the cluster through the App CR, it becomes challenging to ascertain the real-time health status of individual resources within the cluster. This lack of visibility poses a significant challenge for SRE/ops teams tasked with overseeing hundreds or even thousands of clusters. Consequently, having this information integrated into the Kapp controller can establish a centralized and reliable source of truth for monitoring purposes.

## Terminology / Concepts

## Proposal

We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:

1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to know more and discuss about how we can enable watches in the App reconciler. Looks like currently, the kapp command is called, and based on its output the App status is popoulated:
https://github.com/carvel-dev/kapp-controller/blob/df87efdcf0c0c140ff644c8286257cd38a74fd42/pkg/app/app_deploy.go#L25

If we could return the list of resources which are being created (which currently is present in the cmd output) and dynamically set up watches, it would make it easier.

This is how we do in Rukpak for reference: https://github.com/operator-framework/rukpak/blob/6a8a84c9aff05efaba7b05992704ad38462a7ee8/internal/controllers/bundledeployment/bundledeployment.go#L389-L402.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be ideal, but you can find all the resources by taking the involved GroupKinds from the ConfigMap associated with the kapp app, then listing/watching/informing using the appropriate app label selector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible approach would be to have kapp write some information about specific resources to a file upon reconciliation and have this information copied over to the status.
This is how we have information about used group versions and namespaces to the App status, using output from the --app-metadata-file-output flag.

This however means this information will be reported whenever the App syncs.

2. Introduce a `Healthy` condition in App CR's `status` [field][app_cr_status].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believed the original suggestion was to introduce conditions per resource as well, is that not required for your use case?
To illustrate something like

- type: HealthCheck/someapi/someversion/somenamespace/resource
  status: False
  message: "Failed to meet condition: "some more information""

Which could live in a separate in a separate field such as status.resourceConditions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced we need to report a condition for every resource. Imagine the results of an App being hundreds of resources created on cluster. The primary need is to signal to a user "this app is degraded". Including a subset of the unhealthy resources would be useful. I'd like to ensure we don't get anywhere close to the 1.5MB size limit for data in etcd, and whenever there's an unbounded data set (# of resources in this case), I start to get concerned.

From there, the user can go investigate further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what Andy mentioned. A single condition, which consists of a consolidated list of unhealthy objects is sufficient. Something like this:

- lastTransitionTime: "2023-08-02T04:24:27Z"
  Message: "unhealthy resources: ["apiextensions.k8s.io/v1/CustomResourceDefinition/my.new.crd":"InvalidVersion", "deployments/test-ns/my-deploy":"MinimumReplicasUnavailable", "pods/test-ns/standalone-pod":"ImagePullBackoff"]


This field would be set to:
1. `True`: if all the installed resources are healthy.
2. `False`: if there is any error during the `fetch`, `install`, `deploy` step or if any of the resources are unhelathy.

The proposed set to criteria to determine if a resource is healthy or not is described below:

1. A Pod resource will be healthy if/when:
- `.status.Phase` is Running or Completed.
- If `.status.Phase` is Running, "Available" status condition is True.

2. A ReplicationController resource will be healthy if/when:
- `.spec.replicas` matches `.status.availableReplicas`
- `ReplicaFailure` type condition is not present in the status.

3. A ReplicaSet resource will be healthy if/when:
- `.spec.replicas` matches `.status.availableReplicas`.
- `ReplicaFailure` type condition is not present in the status.

3. A Deployment resource will be healthy if/when:
- `.spec.replicas` matches `.status.availableReplicas`.
- `Available` type condition is true.

4. A StatefulSet resource will be healthy if/when:
- `.spec.replicas` matches `.status.availableReplicas`.
- A DaemonSet resource will be healthy if/when:
`.status.desiredNumberScheduled` matches `.status.numberAvailable`

5. An APIService resource will be healthy if/when:
- `Available` type condition in status is true.

6. A CustomResourceDefinition resource will be healthy if/when:
- `StoredVersions` has the expected API version for the CRD.

7. All other unspecified resources will be considered healthy.
Comment on lines +50 to +56
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where 5,6,7 would not lead to a deployment failure? if so do we really need to report health in these resources types?


If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to not cases where the ReconcileFailed condition is not present, but the Healthy condition is false.
If this is not a possibility, is the problem we are trying to solve: having more structured information about failed resources surfaced?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment re not wanting to list every unhealthy resource


Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value to be able to treat some resources as more critical! (carvel-dev/kapp-controller#1279 comes to mind)

Today, in case of failure we have a mechanism which leads to immediate reconciliation on failure. However, in case of repeated failure, the reconciler exponentially backs off. Meaning that it would take longer to reconcile the app again if it has already failed >3 times (for example).
Worth noting the longest the app waits will always be equal to it's syncPeriod.

This prevents an app or a set of apps that is doomed to fail from hogging the reconciliation queue. Would we want something similar here as well?


#### Use Case: Monitoring the state of resources

Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm - the output of inspect command in App's status is populated during deploy. After which it is not dynamically updated when the health of any resources change? Am I missing anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, it would only report the health of resources when a reconciliation occurs.

One correction (though it is not very relevant to the context) would be that inspect is not a part of rawOptions, enabling it would look more like:

#.....other spec
deploy:
  - kapp:
      inspect: {}
#....


Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we work with informers it might be interesting to see what's an optimal number of resources to be watched we would recommend keeping resource utilisation in mind.
(Just a note, not something this proposal should address)


### Other Approaches:

## Open Questions:

1. Can using informers to watch resources increase cache size, potentially impacting the performance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be the case since the same kapp-controller can be in charge of hundreds of apps, and if we do this for all the apps, we might end up getting informers for every resource in the cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have this as an optional feature to start, similar to how inspect is right now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this proposal is implemented, health would ultimately be determined by evaluating only the following kinds:

  • Pod
  • ReplicationController
  • ReplicaSet
  • Deployment
  • StatefulSet
  • APIService
  • CustomResourceDefinition

Which would mean the additional overhead is at most 6 more informers with label selectors limiting the cache contents to just what kapp-controller is managing. We don't have to have informers for the App resources that do not contribute to the health condition, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd only need to be a subset of all APIs

Copy link
Author

@varshaprasad96 varshaprasad96 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion in the community meeting today -

The two use-cases for setting up informers to watch resources are:

  1. Health monitoring and aggregating status.
  2. Triggering a reconcile if any resource is unhealthy.
  • From OLM's end, the use case we want to fulfil is (1).
  • (2) is something that can cause performance issues in terms of continuously reconciling for any unhealthy resources even if we have informers set up for limited number of GVK's (especially on clusters where Kapp-ctrl is managing large no of App CRs).

If (2) is not to be addressed, to maintain modularity in terms of kapp controller's functionality, @joaopapereira suggested we explore having a separate controller to monitor the health status of individual resources which can be optionally enabled.

2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is what needs to happen eventually. It doesn't make sense to have two status fields serve similar purpose. The healthy condition can just list all the resources (instead of just the failed/unhealthy) ones or vice-versa.

Before refactoring the proposal for the same, I would like to confirm the use case of inspect and make sure of the direction we would like to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inspect section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect command.
We disabled it by default in favour of reducing the number of API calls we make.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

Copy link
Author

@varshaprasad96 varshaprasad96 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

If we are watching all the resources in the cluster anyway, and triggering a reconcile - wondering if calling an inspect command on top of it is necessary. If so, this may also end up loading the API server - since I can expect more no. of reconciliations due to dependent resources.

Additionally, the second aspect is inspect and healthy showing conflicting information at any point in time to the user (I haven't looked into the codebase of inspect yet, but assuming controller client and the one used with kapp are different?).

Given:
Inspect - would show a superset health status of all the resources.
Healthy - would show only the unhealthy resources.

If we decide to support both of them, then probably we should make them exclusive?


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another open question is if kapp-controller starts reacting to all changes in the cluster, what will happen to performance in general? At this point in time kapp-controller can become the major consumer of CPU of the full cluster.


[app_cr_status]: https://pkg.go.dev/github.com/vmware-tanzu/carvel-kapp-controller/pkg/apis/kappctrl/v1alpha1#AppStatus
Loading