-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added collector cluster watcher prometheus metrics #2913
Conversation
3e98a13
to
b63cada
Compare
pkg/query/objectscollector/objectscollector_integration_test.go
Outdated
Show resolved
Hide resolved
7d9c5cb
to
0991133
Compare
Would you explain the change in the commit message please. |
Hey @squaremo what kind of explanation you would like to see? |
} | ||
} | ||
|
||
func makeObjectsCollector(t *testing.T, cfg *rest.Config, clustersNamespace string, testLog logr.Logger) (*objectscollector.ObjectsCollector, store.Store, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're returning an error here, but passing in a *testing.T
you could just fail in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the integration test, it should come as part of #2606
thanks for the comment!
An explanation of the change that I wouldn't get from looking at the diff. For example: I can see that the DefaultWatcher struct is renamed in the diff, but there's nothing to explain why this was necessary as part of this change. I have long sworn by this guide https://cbea.ms/git-commit/, which gives a great argument for taking care with commit messages and explains how to write them well. |
pkg/query/objectscollector/objectscollector_integration_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some suggestions, but I think this is a useful change as it is and I'd be happy for things to be followed up after merging.
The one thing I think it's really worth examining is the race in *watcher.Start, which means a cluster could be marked as Started when it has failed.
pkg/query/collector/watcher.go
Outdated
Status() (string, error) | ||
} | ||
|
||
type DefaultWatcher struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against changing the name of this, but is it really part of this changeset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean that is not mentioned in the PR description nor the commit message ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to the PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it doesn't help with the purpose of the commit, or with the PR. This seems harmless, an improvement is an improvement right? But it has consequences:
- Including incidental changes makes it more likely to conflict with other PRs (especially those that explicitly set out to refactor, rename and rewrite);
- Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;
- Someone examining git blame or git history to understand this code will have to spend longer doing so, to establish that this particular change is only what it looks like.
It's easy to find examples where I've done the same thing -- I find it difficult to resist improving things when I see them, too. Here's the habit I've tried to get into:
- consider whether I really need to do the change now, or can I file an issue or put it on my todo list; or,
- do a quick PR that does the renaming I want, and branch from that until it's merged (then rebase); but,
- PRs are costly in turn-around time, so often I just do the rename in its own commit at the start of the branch, so at least it will be seen as its own change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing your thoughts and concerns. I guess that i take refactoring as a continuous activity that you do when you hit the codebase, the same as any other enhancement activity.
Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;
I could definitely see the scoping issue question that could raise. I definitely try to add it to my thinking process to provide either a single commit for the refactoring for that or further context in either the commit or the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further context in either the commit or the PR.
In commit messages, please. When one is looking through git history, via the GitHub interface or otherwise, having to track explanations down in PRs is several extra steps (and distressingly often, fruitless).
a065d3c
to
f111af9
Compare
pkg/query/collector/watcher.go
Outdated
ClusterWatchingStarting ClusterWatchingStatus = "Starting" | ||
ClusterWatchingStarted ClusterWatchingStatus = "Started" | ||
ClusterWatchingStopping ClusterWatchingStatus = "Stopping" | ||
ClusterWatchingStopped ClusterWatchingStatus = "Stopped" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value isn't new, I know, but .. if it's stopped, wouldn't it just be absent from the collector's records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ... we don't keep the watcher so its status once stopped it won't be shown. It makes sense to emit the metric but does not seem to have value beyond it.
do you think we should follow up in a particular way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to emit the metric
What does metric{state=Stopped"}
represent -- all watchers that ... don't exist? ... ever existed?
I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does metric{state=Stopped"} represent -- all watchers that ... don't exist? ... ever existed?
The way I think is that this metric has a double value as aggregation but also as resource state that i monitor as ops team.
As ops, i would have an alert for
cluster_watcher{cluster="my-important-cluster",state=Stopped"} = 1
that would trigger some further action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.
Created #2931
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My model of how the prometheus client and querying work might be out of date, but I think this will have a couple of problems:
- It will accumulate active time series, since you'll remember every watcher that was ever running;
- If you want to sum the number of active watchers (which is probably the thing you care about), you have to exclude this state.
I would remove this as a state, make sure the time series is no longer active when a watcher is removed, and keep a count of removed watchers instead. If people care about watchers being stopped, they can alert on the counter, or on the "stopping" state (albeit they might not notice it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow up Monday to align on the expected usage ... I wont merge today
bc52a6a
to
6dfbfb1
Compare
0c21927
to
4255d31
Compare
removed recorder test as the logic is being tested within the collector lifecycle
* Added prometheus metric to monitor cluster watchers collector_cluster_watcher * added collector type so we could monitor them independently removed recorder test as the logic is being tested within the collector lifecycle
* fix card ui * fix params title color in dark mode * add some styled components * fix table, and details page * Fix explorer filter icon in dark mode (#3010) * add set color to icon * explorer snap * fix: no manual promotion button in pipelines It seems the issue was a misconfigured Notification object. I assume it was created in a "pipelines dev" environment where it was a practice to deploy the service and related resources in pipeline-system. Fixes #3020 Signed-off-by: Balazs Nadasdi <[email protected]> * Small styling fixes (#2962) * Updates following the Core changes on reconciledObjects and flags * Fix sorting of gitrepos * Replace local layout with Core layout - WIP * Replace local layout with Core layout - WIP2 * Update lib * Solve conflicts - 2 * go mod tidy * Replace Page Template with Page - WIP * Replace Page Template with Page - updated and remove content wrapper * Fix connect a cls modal display * Clean up Content Wrapper to be a Notification Wrapper and bring it back - WIP * Clean up Content Wrapper to be a Notification Wrapper and bring it back * Fix height in logo * Update with lib - stable * Bring back footer * Adjust notif display * Update go mod * Run go mod tidy * Replace Page directly from Core with local and update lib * Update lib * Cleanup * Replace remaining instances of old Page * Update lib * Update layout props and remove local Nav * Update go mod and sum * Redo snapshots * Update test and remove unused Section Header * Solve conflicts with main - 2 * Redo snapshots * Update nav style * Pass footer to Page * Redo snapshots * Update lib to latest * Fix button border * Fix select & input field style * Remove unused import * Update lib * Redo snapshots & update lib * Redo snapshots & update lib - 2 * Update lib * Update styles for form inputs * Uncomment commented code in makefile * Impersonate as soon as possible This commit moves the construction of the impersonating client config to the Collector implementation, with the goal of having fewer points where it could be forgotten. In principle the SA could just be passed through to NewWatcher by the caller of NewCollector; however, that would leave open the possibilities of a caller of NewCollector forgetting to do this, or the implementation of the watcher not bothering to use it. This way, it's part of the protocol for the Collector, and no other component gets a chance to muck it up. Signed-off-by: Michael Bridgen <[email protected]> * Use callback for processing object updates Make the reconciler take a callback, rather than a channel. Why? A callback in more general -- if you want backpressure or synchronisation, you can always make the func push to a chan. This also gives the opportunity to retry transactions, which was not implemented before and anyway would be difficult using channels. The "objects channel" is still passed down as far as the manager, in this commit. It's fairly easy to remove it from the Options structs and constructor funcs further up, but it would be better to simplify these out entirely. Signed-off-by: Michael Bridgen <[email protected]> * Make watcher interface look like manager.Manager The watcher implementation is mostly a wrapper around the controller manager, so exists to - be something you can start and stop; and, - hold some values before passing them on to a NewManager callback There's no particular need to do this dance -- it might be useful to use a fake controller-runtime manager in tests, but all those tests really cover is whether the scaffolding supports itself. This commit makes space for removing Watcher altogether, by changing the signature returned by NewWatcherFunc to match manager.Manager#Start(ctx). The consequences are: - the watchingCollector starts a goroutine for each watcher This is because manager.Manager#Start blocks until its context is canceled, or it fails. There was a goroutine already -- this just moves it up the collector, and obviates some synchronisation that was necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`. Since watcher failures are now visible to the collector, it can take action on them, like restarting the watcher. - the watchingCollector keeps track of each watcher's status This expands the statuses to "started"|"stopped"|"failed". Signed-off-by: Michael Bridgen <[email protected]> * Remove Watcher and WatcherOptions altogether Originally, Watcher existed to intermediate between the manager.Manager, which does all the work, and what watchingCollector wanted to deal with (Start/Stop/Status). But, the previous commit shows this isn't necessary. So: remove Watcher and just start up manager.Manager directly. Aside from swapping some arguments around, `NewWatcher` (previously newManager) is the same signature as `NewDefaultWatcher` was. Signed-off-by: Michael Bridgen <[email protected]> * Reinstate hook for deleting cluster contents Previously, when the watcher implementation was asked to stop, it sent a "delete all" message to the object transaction channel. The effect of this was to delete all items from the cluster in question, from the store and index. Now that manager.Manager is constructed directly, this finalisation has to be done somewhere else. A reasonable place is the collector, since that deals with the watcher lifecycle; e.g., it's also the called of `NewWatcherFunc`. Signed-off-by: Michael Bridgen <[email protected]> * Rewrite/remove tests after removing Watcher This commit mainly removes tests that checked starting and stopping DefaultWatcher, since that has been removed. It also fixes up a few tests to account for changes in types. Notably, the two test cases in TestWatcher_NewWatcher ended up being the same after the service account argument was removed, so I've inlined the single test case. Signed-off-by: Michael Bridgen <[email protected]> * Tighten interface of Collector There's no need to have Watch and Unwatch exported, or in the interface, since these are called only internally. Signed-off-by: Michael Bridgen <[email protected]> * Use Starter interface for collector It's handy to be able to cancel a context to stop goroutines, rather than having to remember to call stop on each one; and, there's less synchronisation needed. So: use the `Start(context.Context) error` method for Collector as well, rather than Start/Stop. In this case, the loop in watchingCollector doesn't need to be its own goroutine, which makes it a bit easier to think about. The trade-off is that you need to explicitly construct a goroutine for the (blocking) Start, if you do want it to run in another goroutine. Signed-off-by: Michael Bridgen <[email protected]> * Adjust tests for asynchronous Start(ctx) Previously, some tests in watching_test.go relied on the watchingCollector having synchronously processed existing clusters before it returns from Start(). This is a bit of a coincidence; it's also no longer the case, because the collector is started in its own goroutine. So: use `g.Eventually` to check processing has happened. Signed-off-by: Michael Bridgen <[email protected]> * Use a workqueue to process cluster watches At present, when the collector fails to watch a cluster, that's it -- the cluster will never be watched. So as to be able to handle retries, this commit interposes a client-go workqueue (same as that used in controller-runtime) between the notification of a new cluster and it being watched. This way, if the watch fails, it can be retried. Since the watchers are now run in goroutines managed by the collector, it can see when one fails and retry it. There is a bit of delicate (brittle?) tip-toeing around clusters that are invalid; e.g., are missing a client config. These return an error before starting a watcher. I have chosen to mark them as failed, and not requeue them. If they come in on the clustersmngr channel again, they will be retried then. This commit adds a watcher status "starting", for the period when a cluster can be started, but hasn't been yet. This makes sure that if a cluster is recorded at all, it has a non-empty status. Signed-off-by: Michael Bridgen <[email protected]> * Make collector stop test non-flakey I wanted to test when you stop the collector, all the goroutines it starts will exit. But, counting the number of goroutines isn't reliable for this, since there may be running goroutines from other tests hanging around. (Quite often it was failing because the number after stopping was _less_ than the number before starting ...). To avoid this particular flake, this commit reduces the strength of the test to just checking that the collector returns from `Start(ctx)`. Signed-off-by: Michael Bridgen <[email protected]> * Test that an error-returning watcher is restarted This puts in a check that if a watcher is started, and it returns an error from Start(), it'll be restarted. Signed-off-by: Michael Bridgen <[email protected]> * More care with watcher status - distinguish in watcher status between "error" meaning "returned an error, will be retried", and "failed" meaning "could be not started (and won't be retried)". - guard access to the clusterWatchers map, and watcher status, with a mutex. This prevents races between e.g., unwatching a cluster and reporting status. Signed-off-by: Michael Bridgen <[email protected]> * Fix race conditions in watching_tests with atomics Signed-off-by: Michael Bridgen <[email protected]> * Add Form Design Notes (#3023) * form notes * add neutral20ToPrimary color for input borders, update colors namespace * test updates * added query behaviour to be tolerant to inconsistencies (#3014) * added query behaviour to be tolerant to inconsistencies between indexer and datastore so we dont bring the full service down. we return best effort results and log the errors. * removed logging assertion * Added collector cluster watcher prometheus metrics (#2913) * Added prometheus metric to monitor cluster watchers collector_cluster_watcher * added collector type so we could monitor them independently removed recorder test as the logic is being tested within the collector lifecycle * add RED metrics for explorer datastore read operations (#3024) * Add retention policy concept to query service. This will allow us to retain things like events for longer than Kubernetes does by default. Adds a scheduled job to clean them up at a scheduled interval. * Do not log in WatcherRetry test. Logging in a go routine after a test has completed can cause a panic. * added explorer indexer read metrics (#3028) * add RED metrics for explorer indexer read operations * using metrics assertion based on _count so it is predictable. * Impersonate as soon as possible This commit moves the construction of the impersonating client config to the Collector implementation, with the goal of having fewer points where it could be forgotten. In principle the SA could just be passed through to NewWatcher by the caller of NewCollector; however, that would leave open the possibilities of a caller of NewCollector forgetting to do this, or the implementation of the watcher not bothering to use it. This way, it's part of the protocol for the Collector, and no other component gets a chance to muck it up. Signed-off-by: Michael Bridgen <[email protected]> * Make watcher interface look like manager.Manager The watcher implementation is mostly a wrapper around the controller manager, so exists to - be something you can start and stop; and, - hold some values before passing them on to a NewManager callback There's no particular need to do this dance -- it might be useful to use a fake controller-runtime manager in tests, but all those tests really cover is whether the scaffolding supports itself. This commit makes space for removing Watcher altogether, by changing the signature returned by NewWatcherFunc to match manager.Manager#Start(ctx). The consequences are: - the watchingCollector starts a goroutine for each watcher This is because manager.Manager#Start blocks until its context is canceled, or it fails. There was a goroutine already -- this just moves it up the collector, and obviates some synchronisation that was necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`. Since watcher failures are now visible to the collector, it can take action on them, like restarting the watcher. - the watchingCollector keeps track of each watcher's status This expands the statuses to "started"|"stopped"|"failed". Signed-off-by: Michael Bridgen <[email protected]> * Delete yarn.lock * fix some conflicts * update yarn file * Update yarn.lock * update style for policConfig * fix test * fix linting * fix linting * fix review comment * fix font sizes * replace list with flex * add data-testid * fix unit test --------- Signed-off-by: Balazs Nadasdi <[email protected]> Signed-off-by: Michael Bridgen <[email protected]> Co-authored-by: Joshua Israel <[email protected]> Co-authored-by: Balazs Nadasdi <[email protected]> Co-authored-by: AlinaGoaga <[email protected]> Co-authored-by: Michael Bridgen <[email protected]> Co-authored-by: Eneko Fernández <[email protected]> Co-authored-by: Jordan Pellizzari <[email protected]>
Closes of #2872
Design Considerations here
What changed?
cluster_watcher
to monitor cluster watchers.NewWatcher
function.Why was this change made?
We would like to monitor cluster watcher statuses to determine whether there are issues with the cluster watching infrastructure. This PR addresses this part of the collector pipelines
How was this change implemented?
How did you validate the change?
Release notes
Added explorer collector cluster watcher prometheus metrics
Documentation Changes
weaveworks/weave-gitops#3753