Added collector cluster watcher prometheus metrics #2913

enekofb · 2023-06-02T13:06:26Z

Closes of #2872
Design Considerations here

What changed?

Added collector metric cluster_watcher to monitor cluster watchers.
Not exporting watcher to ensure the right encapsulation as only should be created through NewWatcher function.

Why was this change made?

We would like to monitor cluster watcher statuses to determine whether there are issues with the cluster watching infrastructure. This PR addresses this part of the collector pipelines

How was this change implemented?

Created a collector recorder
Integrated within the watching clusters status update

How did you validate the change?

Added unit tests
Added integration test for collector
UAT screenshot added for prometheus and grafana

Release notes

Added explorer collector cluster watcher prometheus metrics

Documentation Changes

weaveworks/weave-gitops#3753

pkg/metrics/prometheus.go

pkg/metrics/registry.go

pkg/query/objectscollector/objectscollector_integration_test.go

pkg/query/collector/collector.go

squaremo · 2023-06-05T14:25:28Z

Would you explain the change in the commit message please.

enekofb · 2023-06-05T14:28:18Z

Would you explain the change in the commit message please.

Hey @squaremo what kind of explanation you would like to see?

pkg/query/collector/metrics/recorder.go

bigkevmcd · 2023-06-05T14:28:36Z

pkg/query/objectscollector/objectscollector_integration_test.go

+	}
+}
+
+func makeObjectsCollector(t *testing.T, cfg *rest.Config, clustersNamespace string, testLog logr.Logger) (*objectscollector.ObjectsCollector, store.Store, error) {


You're returning an error here, but passing in a *testing.T you could just fail in this function?

removed the integration test, it should come as part of #2606

thanks for the comment!

pkg/query/collector/metrics/recorder_test.go

squaremo · 2023-06-05T14:57:09Z

Would you explain the change in the commit message please.

Hey @squaremo what kind of explanation you would like to see?

An explanation of the change that I wouldn't get from looking at the diff. For example: I can see that the DefaultWatcher struct is renamed in the diff, but there's nothing to explain why this was necessary as part of this change.

I have long sworn by this guide https://cbea.ms/git-commit/, which gives a great argument for taking care with commit messages and explains how to write them well.

pkg/query/collector/watchingcollector.go

pkg/query/objectscollector/objectscollector_integration_test.go

pkg/query/collector/watchingcollector_test.go

pkg/query/objectscollector/suite_test.go

pkg/query/collector/watcher.go

squaremo

I've made some suggestions, but I think this is a useful change as it is and I'd be happy for things to be followed up after merging.
The one thing I think it's really worth examining is the race in *watcher.Start, which means a cluster could be marked as Started when it has failed.

squaremo · 2023-06-05T14:43:23Z

pkg/query/collector/watcher.go

-	Status() (string, error)
-}
-
-type DefaultWatcher struct {


I'm not against changing the name of this, but is it really part of this changeset?

do you mean that is not mentioned in the PR description nor the commit message ?

added to the PR description

I mean it doesn't help with the purpose of the commit, or with the PR. This seems harmless, an improvement is an improvement right? But it has consequences:

Including incidental changes makes it more likely to conflict with other PRs (especially those that explicitly set out to refactor, rename and rewrite);

Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;

Someone examining git blame or git history to understand this code will have to spend longer doing so, to establish that this particular change is only what it looks like.

It's easy to find examples where I've done the same thing -- I find it difficult to resist improving things when I see them, too. Here's the habit I've tried to get into:

consider whether I really need to do the change now, or can I file an issue or put it on my todo list; or,

do a quick PR that does the renaming I want, and branch from that until it's merged (then rebase); but,

PRs are costly in turn-around time, so often I just do the rename in its own commit at the start of the branch, so at least it will be seen as its own change.

Thanks for sharing your thoughts and concerns. I guess that i take refactoring as a continuous activity that you do when you hit the codebase, the same as any other enhancement activity.

Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;

I could definitely see the scoping issue question that could raise. I definitely try to add it to my thinking process to provide either a single commit for the refactoring for that or further context in either the commit or the PR.

further context in either the commit or the PR.

In commit messages, please. When one is looking through git history, via the GitHub interface or otherwise, having to track explanations down in PRs is several extra steps (and distressingly often, fruitless).

pkg/query/collector/metrics/recorder.go

pkg/query/collector/metrics/recorder_test.go

pkg/query/collector/watcher.go

pkg/query/collector/watchingcollector.go

pkg/query/objectscollector/testdata/crds/helmrepository.yaml

pkg/query/collector/metrics/recorder.go

squaremo · 2023-06-09T09:01:43Z

pkg/query/collector/watcher.go

+	ClusterWatchingStarting ClusterWatchingStatus = "Starting"
+	ClusterWatchingStarted  ClusterWatchingStatus = "Started"
+	ClusterWatchingStopping ClusterWatchingStatus = "Stopping"
+	ClusterWatchingStopped  ClusterWatchingStatus = "Stopped"


This value isn't new, I know, but .. if it's stopped, wouldn't it just be absent from the collector's records?

yes ... we don't keep the watcher so its status once stopped it won't be shown. It makes sense to emit the metric but does not seem to have value beyond it.

do you think we should follow up in a particular way?

It makes sense to emit the metric

What does metric{state=Stopped"} represent -- all watchers that ... don't exist? ... ever existed?
I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.

What does metric{state=Stopped"} represent -- all watchers that ... don't exist? ... ever existed?

The way I think is that this metric has a double value as aggregation but also as resource state that i monitor as ops team.

As ops, i would have an alert for

cluster_watcher{cluster="my-important-cluster",state=Stopped"} = 1

that would trigger some further action

I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.

Created #2931

My model of how the prometheus client and querying work might be out of date, but I think this will have a couple of problems:

It will accumulate active time series, since you'll remember every watcher that was ever running;

If you want to sum the number of active watchers (which is probably the thing you care about), you have to exclude this state.

I would remove this as a state, make sure the time series is no longer active when a watcher is removed, and keep a count of removed watchers instead. If people care about watchers being stopped, they can alert on the counter, or on the "stopping" state (albeit they might not notice it).

Let's follow up Monday to align on the expected usage ... I wont merge today

…_watcher

removed recorder test as the logic is being tested within the collector lifecycle

* Added prometheus metric to monitor cluster watchers collector_cluster_watcher * added collector type so we could monitor them independently removed recorder test as the logic is being tested within the collector lifecycle

* fix card ui * fix params title color in dark mode * add some styled components * fix table, and details page * Fix explorer filter icon in dark mode (#3010) * add set color to icon * explorer snap * fix: no manual promotion button in pipelines It seems the issue was a misconfigured Notification object. I assume it was created in a "pipelines dev" environment where it was a practice to deploy the service and related resources in pipeline-system. Fixes #3020 Signed-off-by: Balazs Nadasdi <[email protected]> * Small styling fixes (#2962) * Updates following the Core changes on reconciledObjects and flags * Fix sorting of gitrepos * Replace local layout with Core layout - WIP * Replace local layout with Core layout - WIP2 * Update lib * Solve conflicts - 2 * go mod tidy * Replace Page Template with Page - WIP * Replace Page Template with Page - updated and remove content wrapper * Fix connect a cls modal display * Clean up Content Wrapper to be a Notification Wrapper and bring it back - WIP * Clean up Content Wrapper to be a Notification Wrapper and bring it back * Fix height in logo * Update with lib - stable * Bring back footer * Adjust notif display * Update go mod * Run go mod tidy * Replace Page directly from Core with local and update lib * Update lib * Cleanup * Replace remaining instances of old Page * Update lib * Update layout props and remove local Nav * Update go mod and sum * Redo snapshots * Update test and remove unused Section Header * Solve conflicts with main - 2 * Redo snapshots * Update nav style * Pass footer to Page * Redo snapshots * Update lib to latest * Fix button border * Fix select & input field style * Remove unused import * Update lib * Redo snapshots & update lib * Redo snapshots & update lib - 2 * Update lib * Update styles for form inputs * Uncomment commented code in makefile * Impersonate as soon as possible This commit moves the construction of the impersonating client config to the Collector implementation, with the goal of having fewer points where it could be forgotten. In principle the SA could just be passed through to NewWatcher by the caller of NewCollector; however, that would leave open the possibilities of a caller of NewCollector forgetting to do this, or the implementation of the watcher not bothering to use it. This way, it's part of the protocol for the Collector, and no other component gets a chance to muck it up. Signed-off-by: Michael Bridgen <[email protected]> * Use callback for processing object updates Make the reconciler take a callback, rather than a channel. Why? A callback in more general -- if you want backpressure or synchronisation, you can always make the func push to a chan. This also gives the opportunity to retry transactions, which was not implemented before and anyway would be difficult using channels. The "objects channel" is still passed down as far as the manager, in this commit. It's fairly easy to remove it from the Options structs and constructor funcs further up, but it would be better to simplify these out entirely. Signed-off-by: Michael Bridgen <[email protected]> * Make watcher interface look like manager.Manager The watcher implementation is mostly a wrapper around the controller manager, so exists to - be something you can start and stop; and, - hold some values before passing them on to a NewManager callback There's no particular need to do this dance -- it might be useful to use a fake controller-runtime manager in tests, but all those tests really cover is whether the scaffolding supports itself. This commit makes space for removing Watcher altogether, by changing the signature returned by NewWatcherFunc to match manager.Manager#Start(ctx). The consequences are: - the watchingCollector starts a goroutine for each watcher This is because manager.Manager#Start blocks until its context is canceled, or it fails. There was a goroutine already -- this just moves it up the collector, and obviates some synchronisation that was necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`. Since watcher failures are now visible to the collector, it can take action on them, like restarting the watcher. - the watchingCollector keeps track of each watcher's status This expands the statuses to "started"|"stopped"|"failed". Signed-off-by: Michael Bridgen <[email protected]> * Remove Watcher and WatcherOptions altogether Originally, Watcher existed to intermediate between the manager.Manager, which does all the work, and what watchingCollector wanted to deal with (Start/Stop/Status). But, the previous commit shows this isn't necessary. So: remove Watcher and just start up manager.Manager directly. Aside from swapping some arguments around, `NewWatcher` (previously newManager) is the same signature as `NewDefaultWatcher` was. Signed-off-by: Michael Bridgen <[email protected]> * Reinstate hook for deleting cluster contents Previously, when the watcher implementation was asked to stop, it sent a "delete all" message to the object transaction channel. The effect of this was to delete all items from the cluster in question, from the store and index. Now that manager.Manager is constructed directly, this finalisation has to be done somewhere else. A reasonable place is the collector, since that deals with the watcher lifecycle; e.g., it's also the called of `NewWatcherFunc`. Signed-off-by: Michael Bridgen <[email protected]> * Rewrite/remove tests after removing Watcher This commit mainly removes tests that checked starting and stopping DefaultWatcher, since that has been removed. It also fixes up a few tests to account for changes in types. Notably, the two test cases in TestWatcher_NewWatcher ended up being the same after the service account argument was removed, so I've inlined the single test case. Signed-off-by: Michael Bridgen <[email protected]> * Tighten interface of Collector There's no need to have Watch and Unwatch exported, or in the interface, since these are called only internally. Signed-off-by: Michael Bridgen <[email protected]> * Use Starter interface for collector It's handy to be able to cancel a context to stop goroutines, rather than having to remember to call stop on each one; and, there's less synchronisation needed. So: use the `Start(context.Context) error` method for Collector as well, rather than Start/Stop. In this case, the loop in watchingCollector doesn't need to be its own goroutine, which makes it a bit easier to think about. The trade-off is that you need to explicitly construct a goroutine for the (blocking) Start, if you do want it to run in another goroutine. Signed-off-by: Michael Bridgen <[email protected]> * Adjust tests for asynchronous Start(ctx) Previously, some tests in watching_test.go relied on the watchingCollector having synchronously processed existing clusters before it returns from Start(). This is a bit of a coincidence; it's also no longer the case, because the collector is started in its own goroutine. So: use `g.Eventually` to check processing has happened. Signed-off-by: Michael Bridgen <[email protected]> * Use a workqueue to process cluster watches At present, when the collector fails to watch a cluster, that's it -- the cluster will never be watched. So as to be able to handle retries, this commit interposes a client-go workqueue (same as that used in controller-runtime) between the notification of a new cluster and it being watched. This way, if the watch fails, it can be retried. Since the watchers are now run in goroutines managed by the collector, it can see when one fails and retry it. There is a bit of delicate (brittle?) tip-toeing around clusters that are invalid; e.g., are missing a client config. These return an error before starting a watcher. I have chosen to mark them as failed, and not requeue them. If they come in on the clustersmngr channel again, they will be retried then. This commit adds a watcher status "starting", for the period when a cluster can be started, but hasn't been yet. This makes sure that if a cluster is recorded at all, it has a non-empty status. Signed-off-by: Michael Bridgen <[email protected]> * Make collector stop test non-flakey I wanted to test when you stop the collector, all the goroutines it starts will exit. But, counting the number of goroutines isn't reliable for this, since there may be running goroutines from other tests hanging around. (Quite often it was failing because the number after stopping was _less_ than the number before starting ...). To avoid this particular flake, this commit reduces the strength of the test to just checking that the collector returns from `Start(ctx)`. Signed-off-by: Michael Bridgen <[email protected]> * Test that an error-returning watcher is restarted This puts in a check that if a watcher is started, and it returns an error from Start(), it'll be restarted. Signed-off-by: Michael Bridgen <[email protected]> * More care with watcher status - distinguish in watcher status between "error" meaning "returned an error, will be retried", and "failed" meaning "could be not started (and won't be retried)". - guard access to the clusterWatchers map, and watcher status, with a mutex. This prevents races between e.g., unwatching a cluster and reporting status. Signed-off-by: Michael Bridgen <[email protected]> * Fix race conditions in watching_tests with atomics Signed-off-by: Michael Bridgen <[email protected]> * Add Form Design Notes (#3023) * form notes * add neutral20ToPrimary color for input borders, update colors namespace * test updates * added query behaviour to be tolerant to inconsistencies (#3014) * added query behaviour to be tolerant to inconsistencies between indexer and datastore so we dont bring the full service down. we return best effort results and log the errors. * removed logging assertion * Added collector cluster watcher prometheus metrics (#2913) * Added prometheus metric to monitor cluster watchers collector_cluster_watcher * added collector type so we could monitor them independently removed recorder test as the logic is being tested within the collector lifecycle * add RED metrics for explorer datastore read operations (#3024) * Add retention policy concept to query service. This will allow us to retain things like events for longer than Kubernetes does by default. Adds a scheduled job to clean them up at a scheduled interval. * Do not log in WatcherRetry test. Logging in a go routine after a test has completed can cause a panic. * added explorer indexer read metrics (#3028) * add RED metrics for explorer indexer read operations * using metrics assertion based on _count so it is predictable. * Impersonate as soon as possible This commit moves the construction of the impersonating client config to the Collector implementation, with the goal of having fewer points where it could be forgotten. In principle the SA could just be passed through to NewWatcher by the caller of NewCollector; however, that would leave open the possibilities of a caller of NewCollector forgetting to do this, or the implementation of the watcher not bothering to use it. This way, it's part of the protocol for the Collector, and no other component gets a chance to muck it up. Signed-off-by: Michael Bridgen <[email protected]> * Make watcher interface look like manager.Manager The watcher implementation is mostly a wrapper around the controller manager, so exists to - be something you can start and stop; and, - hold some values before passing them on to a NewManager callback There's no particular need to do this dance -- it might be useful to use a fake controller-runtime manager in tests, but all those tests really cover is whether the scaffolding supports itself. This commit makes space for removing Watcher altogether, by changing the signature returned by NewWatcherFunc to match manager.Manager#Start(ctx). The consequences are: - the watchingCollector starts a goroutine for each watcher This is because manager.Manager#Start blocks until its context is canceled, or it fails. There was a goroutine already -- this just moves it up the collector, and obviates some synchronisation that was necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`. Since watcher failures are now visible to the collector, it can take action on them, like restarting the watcher. - the watchingCollector keeps track of each watcher's status This expands the statuses to "started"|"stopped"|"failed". Signed-off-by: Michael Bridgen <[email protected]> * Delete yarn.lock * fix some conflicts * update yarn file * Update yarn.lock * update style for policConfig * fix test * fix linting * fix linting * fix review comment * fix font sizes * replace list with flex * add data-testid * fix unit test --------- Signed-off-by: Balazs Nadasdi <[email protected]> Signed-off-by: Michael Bridgen <[email protected]> Co-authored-by: Joshua Israel <[email protected]> Co-authored-by: Balazs Nadasdi <[email protected]> Co-authored-by: AlinaGoaga <[email protected]> Co-authored-by: Michael Bridgen <[email protected]> Co-authored-by: Eneko Fernández <[email protected]> Co-authored-by: Jordan Pellizzari <[email protected]>

enekofb force-pushed the wge-2872 branch 4 times, most recently from 3e98a13 to b63cada Compare June 2, 2023 14:50

enekofb added the enhancement New feature or request label Jun 2, 2023

enekofb changed the title ~~Wge 2872 - collector metrics: add cluster watcher metrics~~ Added collector cluster watcher prometheus metrics Jun 2, 2023

enekofb commented Jun 2, 2023

View reviewed changes

pkg/metrics/prometheus.go Show resolved Hide resolved

enekofb commented Jun 2, 2023

View reviewed changes

pkg/metrics/registry.go Outdated Show resolved Hide resolved

enekofb commented Jun 2, 2023

View reviewed changes

pkg/query/objectscollector/objectscollector_integration_test.go Outdated Show resolved Hide resolved

enekofb force-pushed the wge-2872 branch from a7b993a to cb26af7 Compare June 5, 2023 07:49

enekofb commented Jun 5, 2023

View reviewed changes

pkg/query/collector/collector.go Outdated Show resolved Hide resolved

enekofb force-pushed the wge-2872 branch 2 times, most recently from 7d9c5cb to 0991133 Compare June 5, 2023 14:10

enekofb requested a review from jpellizzari June 5, 2023 14:23

enekofb marked this pull request as ready for review June 5, 2023 14:23

enekofb requested review from MohamedMSaeed and squaremo June 5, 2023 14:23

bigkevmcd reviewed Jun 5, 2023

View reviewed changes

enekofb mentioned this pull request Jun 5, 2023

Docs for collector_cluster_watcher metric weaveworks/weave-gitops#3753

Merged

jpellizzari reviewed Jun 5, 2023

View reviewed changes

squaremo approved these changes Jun 7, 2023

View reviewed changes

jpellizzari approved these changes Jun 7, 2023

View reviewed changes

enekofb force-pushed the wge-2872 branch from fc4d5dc to 1a2221d Compare June 8, 2023 16:44

enekofb commented Jun 8, 2023

View reviewed changes

pkg/query/collector/metrics/recorder.go Show resolved Hide resolved

enekofb force-pushed the wge-2872 branch 2 times, most recently from a065d3c to f111af9 Compare June 9, 2023 07:14

enekofb requested a review from bigkevmcd June 9, 2023 07:18

enekofb force-pushed the wge-2872 branch from f111af9 to 21e9b4f Compare June 9, 2023 07:37

squaremo reviewed Jun 9, 2023

View reviewed changes

enekofb mentioned this pull request Jun 9, 2023

revisit stopped status for collector cluster watcher #2931

Closed

enekofb force-pushed the wge-2872 branch from 21e9b4f to bce3acf Compare June 9, 2023 15:54

enekofb requested review from squaremo and jpellizzari June 23, 2023 15:34

enekofb mentioned this pull request Jun 26, 2023

Add RED metrics for explorer datastore write operations #2923

Merged

enekofb force-pushed the wge-2872 branch 2 times, most recently from bc52a6a to 6dfbfb1 Compare June 28, 2023 06:29

jpellizzari approved these changes Jun 28, 2023

View reviewed changes

enekofb force-pushed the wge-2872 branch 4 times, most recently from 0c21927 to 4255d31 Compare July 3, 2023 13:23

enekofb added 2 commits July 3, 2023 14:56

Added prometheus metric to monitor cluster watchers collector_cluster…

35779e4

…_watcher

added collector type so we could monitor them independently

f828883

removed recorder test as the logic is being tested within the collector lifecycle

enekofb force-pushed the wge-2872 branch from 4255d31 to f828883 Compare July 3, 2023 13:56

enekofb merged commit d4fb449 into main Jul 3, 2023

enekofb deleted the wge-2872 branch July 3, 2023 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added collector cluster watcher prometheus metrics #2913

Added collector cluster watcher prometheus metrics #2913

enekofb commented Jun 2, 2023 •

edited

Loading

squaremo commented Jun 5, 2023

enekofb commented Jun 5, 2023

bigkevmcd Jun 5, 2023

enekofb Jun 8, 2023

squaremo commented Jun 5, 2023

squaremo left a comment

squaremo Jun 5, 2023

enekofb Jun 9, 2023

enekofb Jun 9, 2023

squaremo Jun 9, 2023

enekofb Jun 9, 2023

squaremo Jun 9, 2023

squaremo Jun 9, 2023

enekofb Jun 9, 2023

squaremo Jun 9, 2023

enekofb Jun 9, 2023

enekofb Jun 9, 2023

squaremo Jun 9, 2023 •

edited

Loading

enekofb Jun 9, 2023

Added collector cluster watcher prometheus metrics #2913

Added collector cluster watcher prometheus metrics #2913

Conversation

enekofb commented Jun 2, 2023 • edited Loading

squaremo commented Jun 5, 2023

enekofb commented Jun 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squaremo commented Jun 5, 2023

squaremo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squaremo Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enekofb commented Jun 2, 2023 •

edited

Loading

squaremo Jun 9, 2023 •

edited

Loading