Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added collector cluster watcher prometheus metrics #2913

Merged
merged 2 commits into from
Jul 3, 2023
Merged

Added collector cluster watcher prometheus metrics #2913

merged 2 commits into from
Jul 3, 2023

Conversation

enekofb
Copy link
Contributor

@enekofb enekofb commented Jun 2, 2023

Closes of #2872
Design Considerations here

What changed?

  • Added collector metric cluster_watcher to monitor cluster watchers.
  • Not exporting watcher to ensure the right encapsulation as only should be created through NewWatcher function.

Why was this change made?

We would like to monitor cluster watcher statuses to determine whether there are issues with the cluster watching infrastructure. This PR addresses this part of the collector pipelines

How was this change implemented?

How did you validate the change?

  • Added unit tests
  • Added integration test for collector
  • UAT screenshot added for prometheus and grafana

Screenshot 2023-06-23 at 17 57 02
Screenshot 2023-06-23 at 17 56 31

Release notes

Added explorer collector cluster watcher prometheus metrics

Documentation Changes

weaveworks/weave-gitops#3753

@enekofb enekofb force-pushed the wge-2872 branch 4 times, most recently from 3e98a13 to b63cada Compare June 2, 2023 14:50
@enekofb enekofb added the enhancement New feature or request label Jun 2, 2023
@enekofb enekofb changed the title Wge 2872 - collector metrics: add cluster watcher metrics Added collector cluster watcher prometheus metrics Jun 2, 2023
pkg/metrics/registry.go Outdated Show resolved Hide resolved
@enekofb enekofb force-pushed the wge-2872 branch 2 times, most recently from 7d9c5cb to 0991133 Compare June 5, 2023 14:10
@enekofb enekofb requested a review from jpellizzari June 5, 2023 14:23
@enekofb enekofb marked this pull request as ready for review June 5, 2023 14:23
@enekofb enekofb requested review from MohamedMSaeed and squaremo June 5, 2023 14:23
@squaremo
Copy link
Contributor

squaremo commented Jun 5, 2023

Would you explain the change in the commit message please.

@enekofb
Copy link
Contributor Author

enekofb commented Jun 5, 2023

Would you explain the change in the commit message please.

Hey @squaremo what kind of explanation you would like to see?

pkg/query/collector/metrics/recorder.go Outdated Show resolved Hide resolved
}
}

func makeObjectsCollector(t *testing.T, cfg *rest.Config, clustersNamespace string, testLog logr.Logger) (*objectscollector.ObjectsCollector, store.Store, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're returning an error here, but passing in a *testing.T you could just fail in this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the integration test, it should come as part of #2606

thanks for the comment!

pkg/query/collector/metrics/recorder_test.go Outdated Show resolved Hide resolved
@squaremo
Copy link
Contributor

squaremo commented Jun 5, 2023

Would you explain the change in the commit message please.

Hey @squaremo what kind of explanation you would like to see?

An explanation of the change that I wouldn't get from looking at the diff. For example: I can see that the DefaultWatcher struct is renamed in the diff, but there's nothing to explain why this was necessary as part of this change.

I have long sworn by this guide https://cbea.ms/git-commit/, which gives a great argument for taking care with commit messages and explains how to write them well.

pkg/query/collector/watchingcollector.go Outdated Show resolved Hide resolved
pkg/query/collector/watchingcollector_test.go Outdated Show resolved Hide resolved
pkg/query/objectscollector/suite_test.go Outdated Show resolved Hide resolved
pkg/query/collector/watcher.go Outdated Show resolved Hide resolved
Copy link
Contributor

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some suggestions, but I think this is a useful change as it is and I'd be happy for things to be followed up after merging.
The one thing I think it's really worth examining is the race in *watcher.Start, which means a cluster could be marked as Started when it has failed.

Status() (string, error)
}

type DefaultWatcher struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against changing the name of this, but is it really part of this changeset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean that is not mentioned in the PR description nor the commit message ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to the PR description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it doesn't help with the purpose of the commit, or with the PR. This seems harmless, an improvement is an improvement right? But it has consequences:

  • Including incidental changes makes it more likely to conflict with other PRs (especially those that explicitly set out to refactor, rename and rewrite);
  • Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;
  • Someone examining git blame or git history to understand this code will have to spend longer doing so, to establish that this particular change is only what it looks like.

It's easy to find examples where I've done the same thing -- I find it difficult to resist improving things when I see them, too. Here's the habit I've tried to get into:

  • consider whether I really need to do the change now, or can I file an issue or put it on my todo list; or,
  • do a quick PR that does the renaming I want, and branch from that until it's merged (then rebase); but,
  • PRs are costly in turn-around time, so often I just do the rename in its own commit at the start of the branch, so at least it will be seen as its own change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your thoughts and concerns. I guess that i take refactoring as a continuous activity that you do when you hit the codebase, the same as any other enhancement activity.

Anyone reviewing this PR will see it and have to consider whether it has some subtle relation to the other changes, and (if they are me, certainly) will be left with a lingering feeling they've missed something;

I could definitely see the scoping issue question that could raise. I definitely try to add it to my thinking process to provide either a single commit for the refactoring for that or further context in either the commit or the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further context in either the commit or the PR.

In commit messages, please. When one is looking through git history, via the GitHub interface or otherwise, having to track explanations down in PRs is several extra steps (and distressingly often, fruitless).

pkg/query/collector/metrics/recorder.go Outdated Show resolved Hide resolved
pkg/query/collector/metrics/recorder.go Outdated Show resolved Hide resolved
pkg/query/collector/metrics/recorder_test.go Outdated Show resolved Hide resolved
pkg/query/collector/watcher.go Outdated Show resolved Hide resolved
pkg/query/collector/watchingcollector.go Outdated Show resolved Hide resolved
pkg/query/collector/watchingcollector.go Outdated Show resolved Hide resolved
@enekofb enekofb force-pushed the wge-2872 branch 2 times, most recently from a065d3c to f111af9 Compare June 9, 2023 07:14
@enekofb enekofb requested a review from bigkevmcd June 9, 2023 07:18
ClusterWatchingStarting ClusterWatchingStatus = "Starting"
ClusterWatchingStarted ClusterWatchingStatus = "Started"
ClusterWatchingStopping ClusterWatchingStatus = "Stopping"
ClusterWatchingStopped ClusterWatchingStatus = "Stopped"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value isn't new, I know, but .. if it's stopped, wouldn't it just be absent from the collector's records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ... we don't keep the watcher so its status once stopped it won't be shown. It makes sense to emit the metric but does not seem to have value beyond it.

do you think we should follow up in a particular way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to emit the metric

What does metric{state=Stopped"} represent -- all watchers that ... don't exist? ... ever existed?
I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does metric{state=Stopped"} represent -- all watchers that ... don't exist? ... ever existed?

The way I think is that this metric has a double value as aggregation but also as resource state that i monitor as ops team.

As ops, i would have an alert for

cluster_watcher{cluster="my-important-cluster",state=Stopped"} = 1

that would trigger some further action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon post a follow-up issue to consider whether this state is a useful one to represent as a status, and in metrics.

Created #2931

Copy link
Contributor

@squaremo squaremo Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My model of how the prometheus client and querying work might be out of date, but I think this will have a couple of problems:

  1. It will accumulate active time series, since you'll remember every watcher that was ever running;
  2. If you want to sum the number of active watchers (which is probably the thing you care about), you have to exclude this state.

I would remove this as a state, make sure the time series is no longer active when a watcher is removed, and keep a count of removed watchers instead. If people care about watchers being stopped, they can alert on the counter, or on the "stopping" state (albeit they might not notice it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow up Monday to align on the expected usage ... I wont merge today

@enekofb enekofb force-pushed the wge-2872 branch 4 times, most recently from 0c21927 to 4255d31 Compare July 3, 2023 13:23
enekofb added 2 commits July 3, 2023 14:56
removed recorder test as the logic is being tested within the collector lifecycle
@enekofb enekofb merged commit d4fb449 into main Jul 3, 2023
@enekofb enekofb deleted the wge-2872 branch July 3, 2023 14:10
AsmaaNabilBakr pushed a commit that referenced this pull request Jul 8, 2023
* Added prometheus metric to monitor cluster watchers collector_cluster_watcher

* added collector type so we could monitor them independently

removed recorder test as the logic is being tested within the collector lifecycle
AsmaaNabilBakr added a commit that referenced this pull request Jul 12, 2023
* fix card ui

* fix params title color in dark mode

* add some styled components

* fix table, and details page

* Fix explorer filter icon in dark mode (#3010)

* add set color to icon

* explorer snap

* fix: no manual promotion button in pipelines

It seems the issue was a misconfigured Notification object. I assume it
was created in a "pipelines dev" environment where it was a practice to
deploy the service and related resources in pipeline-system.

Fixes #3020

Signed-off-by: Balazs Nadasdi <[email protected]>

* Small styling fixes (#2962)

* Updates following the Core changes on reconciledObjects and flags

* Fix sorting of gitrepos

* Replace local layout with Core layout - WIP

* Replace local layout with Core layout - WIP2

* Update lib

* Solve conflicts - 2

* go mod tidy

* Replace Page Template with Page - WIP

* Replace Page Template with Page - updated and remove content wrapper

* Fix connect a cls modal display

* Clean up Content Wrapper to be a Notification Wrapper and bring it back - WIP

* Clean up Content Wrapper to be a Notification Wrapper and bring it back

* Fix height in logo

* Update with lib - stable

* Bring back footer

* Adjust notif display

* Update go mod

* Run go mod tidy

* Replace Page directly from Core with local and update lib

* Update lib

* Cleanup

* Replace remaining instances of old Page

* Update lib

* Update layout props and remove local Nav

* Update go mod and sum

* Redo snapshots

* Update test and remove unused Section Header

* Solve conflicts with main - 2

* Redo snapshots

* Update nav style

* Pass footer to Page

* Redo snapshots

* Update lib to latest

* Fix button border

* Fix select & input field style

* Remove unused import

* Update lib

* Redo snapshots & update lib

* Redo snapshots & update lib - 2

* Update lib

* Update styles for form inputs

* Uncomment commented code in makefile

* Impersonate as soon as possible

This commit moves the construction of the impersonating client config
to the Collector implementation, with the goal of having fewer points
where it could be forgotten.

In principle the SA could just be passed through to NewWatcher by the
caller of NewCollector; however, that would leave open the
possibilities of a caller of NewCollector forgetting to do this, or
the implementation of the watcher not bothering to use it. This way,
it's part of the protocol for the Collector, and no other component
gets a chance to muck it up.

Signed-off-by: Michael Bridgen <[email protected]>

* Use callback for processing object updates

Make the reconciler take a callback, rather than a channel. Why? A
callback in more general -- if you want backpressure or
synchronisation, you can always make the func push to a chan. This
also gives the opportunity to retry transactions, which was not
implemented before and anyway would be difficult using channels.

The "objects channel" is still passed down as far as the manager, in
this commit. It's fairly easy to remove it from the Options structs
and constructor funcs further up, but it would be better to simplify
these out entirely.

Signed-off-by: Michael Bridgen <[email protected]>

* Make watcher interface look like manager.Manager

The watcher implementation is mostly a wrapper around the controller
manager, so exists to

 - be something you can start and stop; and,
 - hold some values before passing them on to a NewManager callback

There's no particular need to do this dance -- it might be useful to
use a fake controller-runtime manager in tests, but all those tests
really cover is whether the scaffolding supports itself.

This commit makes space for removing Watcher altogether, by changing
the signature returned by NewWatcherFunc to match
manager.Manager#Start(ctx).

The consequences are:

 - the watchingCollector starts a goroutine for each watcher

This is because manager.Manager#Start blocks until its context is
canceled, or it fails. There was a goroutine already -- this just
moves it up the collector, and obviates some synchronisation that was
necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`.

Since watcher failures are now visible to the collector, it can take
action on them, like restarting the watcher.

 - the watchingCollector keeps track of each watcher's status

This expands the statuses to "started"|"stopped"|"failed".

Signed-off-by: Michael Bridgen <[email protected]>

* Remove Watcher and WatcherOptions altogether

Originally, Watcher existed to intermediate between the
manager.Manager, which does all the work, and what watchingCollector
wanted to deal with (Start/Stop/Status). But, the previous commit
shows this isn't necessary. So: remove Watcher and just start up
manager.Manager directly.

Aside from swapping some arguments around, `NewWatcher` (previously
newManager) is the same signature as `NewDefaultWatcher` was.

Signed-off-by: Michael Bridgen <[email protected]>

* Reinstate hook for deleting cluster contents

Previously, when the watcher implementation was asked to stop, it sent
a "delete all" message to the object transaction channel. The effect
of this was to delete all items from the cluster in question, from the
store and index.

Now that manager.Manager is constructed directly, this finalisation
has to be done somewhere else. A reasonable place is the collector,
since that deals with the watcher lifecycle; e.g., it's also the
called of `NewWatcherFunc`.

Signed-off-by: Michael Bridgen <[email protected]>

* Rewrite/remove tests after removing Watcher

This commit mainly removes tests that checked starting and stopping
DefaultWatcher, since that has been removed. It also fixes up a few
tests to account for changes in types.

Notably, the two test cases in TestWatcher_NewWatcher ended up being
the same after the service account argument was removed, so I've
inlined the single test case.

Signed-off-by: Michael Bridgen <[email protected]>

* Tighten interface of Collector

There's no need to have Watch and Unwatch exported, or in the
interface, since these are called only internally.

Signed-off-by: Michael Bridgen <[email protected]>

* Use Starter interface for collector

It's handy to be able to cancel a context to stop goroutines, rather
than having to remember to call stop on each one; and, there's less
synchronisation needed. So: use the `Start(context.Context) error`
method for Collector as well, rather than Start/Stop.

In this case, the loop in watchingCollector doesn't need to be its own
goroutine, which makes it a bit easier to think about.

The trade-off is that you need to explicitly construct a goroutine for
the (blocking) Start, if you do want it to run in another goroutine.

Signed-off-by: Michael Bridgen <[email protected]>

* Adjust tests for asynchronous Start(ctx)

Previously, some tests in watching_test.go relied on the
watchingCollector having synchronously processed existing clusters
before it returns from Start(). This is a bit of a coincidence; it's
also no longer the case, because the collector is started in its own
goroutine. So: use `g.Eventually` to check processing has happened.

Signed-off-by: Michael Bridgen <[email protected]>

* Use a workqueue to process cluster watches

At present, when the collector fails to watch a cluster, that's it --
the cluster will never be watched. So as to be able to handle retries,
this commit interposes a client-go workqueue (same as that used in
controller-runtime) between the notification of a new cluster and it
being watched. This way, if the watch fails, it can be retried.

Since the watchers are now run in goroutines managed by the collector,
it can see when one fails and retry it.

There is a bit of delicate (brittle?) tip-toeing around clusters that
are invalid; e.g., are missing a client config. These return an error
before starting a watcher. I have chosen to mark them as failed, and
not requeue them. If they come in on the clustersmngr channel again,
they will be retried then.

This commit adds a watcher status "starting", for the period when a
cluster can be started, but hasn't been yet. This makes sure that if a
cluster is recorded at all, it has a non-empty status.

Signed-off-by: Michael Bridgen <[email protected]>

* Make collector stop test non-flakey

I wanted to test when you stop the collector, all the goroutines it
starts will exit. But, counting the number of goroutines isn't
reliable for this, since there may be running goroutines from other
tests hanging around. (Quite often it was failing because the number
after stopping was _less_ than the number before starting ...).

To avoid this particular flake, this commit reduces the strength of
the test to just checking that the collector returns from
`Start(ctx)`.

Signed-off-by: Michael Bridgen <[email protected]>

* Test that an error-returning watcher is restarted

This puts in a check that if a watcher is started, and it returns an
error from Start(), it'll be restarted.

Signed-off-by: Michael Bridgen <[email protected]>

* More care with watcher status

 - distinguish in watcher status between "error" meaning "returned an
   error, will be retried", and "failed" meaning "could be not started
   (and won't be retried)".

 - guard access to the clusterWatchers map, and watcher status, with a
   mutex. This prevents races between e.g., unwatching a cluster and
   reporting status.

Signed-off-by: Michael Bridgen <[email protected]>

* Fix race conditions in watching_tests with atomics

Signed-off-by: Michael Bridgen <[email protected]>

* Add Form Design Notes (#3023)

* form notes

* add neutral20ToPrimary color for input borders, update colors namespace

* test updates

* added query behaviour to be tolerant to inconsistencies (#3014)

* added query behaviour to be tolerant to inconsistencies between
indexer and datastore so we dont bring the full service down. we return
best effort results and log the errors.

* removed logging assertion

* Added collector cluster watcher prometheus metrics (#2913)

* Added prometheus metric to monitor cluster watchers collector_cluster_watcher

* added collector type so we could monitor them independently

removed recorder test as the logic is being tested within the collector lifecycle

* add RED metrics for explorer datastore read operations (#3024)

* Add retention policy concept to query service.

This will allow us to retain things like events for longer than Kubernetes does by default. Adds a scheduled job to clean them up at a scheduled interval.

* Do not log in WatcherRetry test.

Logging in a go routine after a test has completed can cause a panic.

* added explorer indexer read metrics (#3028)

* add RED metrics for explorer indexer read operations

* using metrics assertion based on _count so it is predictable.

* Impersonate as soon as possible

This commit moves the construction of the impersonating client config
to the Collector implementation, with the goal of having fewer points
where it could be forgotten.

In principle the SA could just be passed through to NewWatcher by the
caller of NewCollector; however, that would leave open the
possibilities of a caller of NewCollector forgetting to do this, or
the implementation of the watcher not bothering to use it. This way,
it's part of the protocol for the Collector, and no other component
gets a chance to muck it up.

Signed-off-by: Michael Bridgen <[email protected]>

* Make watcher interface look like manager.Manager

The watcher implementation is mostly a wrapper around the controller
manager, so exists to

 - be something you can start and stop; and,
 - hold some values before passing them on to a NewManager callback

There's no particular need to do this dance -- it might be useful to
use a fake controller-runtime manager in tests, but all those tests
really cover is whether the scaffolding supports itself.

This commit makes space for removing Watcher altogether, by changing
the signature returned by NewWatcherFunc to match
manager.Manager#Start(ctx).

The consequences are:

 - the watchingCollector starts a goroutine for each watcher

This is because manager.Manager#Start blocks until its context is
canceled, or it fails. There was a goroutine already -- this just
moves it up the collector, and obviates some synchronisation that was
necessary to adapt between `Start()` and `Stop()` and `Start(ctx)`.

Since watcher failures are now visible to the collector, it can take
action on them, like restarting the watcher.

 - the watchingCollector keeps track of each watcher's status

This expands the statuses to "started"|"stopped"|"failed".

Signed-off-by: Michael Bridgen <[email protected]>

* Delete yarn.lock

* fix some conflicts

* update yarn file

* Update yarn.lock

* update style for policConfig

* fix test

* fix linting

* fix linting

* fix review comment

* fix font sizes

* replace list with flex

* add data-testid

* fix unit test

---------

Signed-off-by: Balazs Nadasdi <[email protected]>
Signed-off-by: Michael Bridgen <[email protected]>
Co-authored-by: Joshua Israel <[email protected]>
Co-authored-by: Balazs Nadasdi <[email protected]>
Co-authored-by: AlinaGoaga <[email protected]>
Co-authored-by: Michael Bridgen <[email protected]>
Co-authored-by: Eneko Fernández <[email protected]>
Co-authored-by: Jordan Pellizzari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants