Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

prometheus: bundle Alertmanager and siteConfig sync #11832

Merged
merged 58 commits into from
Jul 8, 2020

Conversation

bobheadxi
Copy link
Member

@bobheadxi bobheadxi commented Jul 1, 2020

Context

For onlookers, some recent product-level context may help: this change is aiding in getting us to a point where we can push all site admins to turn on monitoring for their Sourcegraph instance by having them configure alerting through the Sourcegraph site configuration and having our monitoring stack automatically respond to that. This comment goes into a little more detail.

Changes

This PR completely removes the grafana-wrapper and functionality for deploying observability.alerts as Grafana alerts (effectively reverting it to vanilla Grafana), and replaces it with a new prom-wrapper that bundles Prometheus + Alertmanager in sourcegraph/prometheus alongside the same observability.alerts features added in #10641

The rationale for this change is covered in the comments of https://github.com/sourcegraph/sourcegraph/issues/11452, but in a nutshell:

Other advantages:

  • Alertmanager configuration just involves writing to a file and hitting a reload endpoint, which simplifies the implementation (monitoring: grafana email notifier support #11554 required adding some Grafana process management to reload some configuration)
  • The frontend is already configured to talk to the Prometheus container for information about alerts as part of https://github.com/sourcegraph/sourcegraph/pull/10704, which makes responsibilities a bit clearer than the Grafana approach where it seemed like getting information about alerting would start to involve talking to different services
  • Alertmanager has a nice UI which I've reverse-proxied on prometheus:9090/alertmanager - might come in handy

Caveats:

  • This PR introduces some breaking changes in configuration to align it with Alertmanager's options - after discussion with @slimsag, we decided the number of admins taking advantage of the alerting introduced in 3.17 should be small enough that we should be fine just noting the breaking changes in the changelog. Going forward, I think it would be best if we don't align configuration directly with the options of our alerting provider like I did in monitoring: configure alert notifications from site config #11427, and just add minimal notifier fields on a kind of need-to-have basis
  • Logs of 3 services in one container are difficult to read, but I'm unsure how often we would look at Prometheus logs in practice. I considered leveraging a library that can add prefixes to lines written to an io.Writer, but only managed to find one simple but seemingly abandoned library, go-prefix-writer, so I've opted not to introduce the dependency for now
  • Somewhat duplicated definitions - to trigger Alertmanager, we need "real" Prometheus alerts, so now we have alert_count and Prometheus alerts

Best reviewed as a whole and ignoring all the Grafana changes

Issues

TODOs

  • manually test each individual notifier - the documentation indicates I'm setting the fields correctly and the Alertmanager UI indicates that changes are applying as expected, but I've yet to successfully receive a notification. I'm currently doing this by manually triggering alerts via:
     curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"robertrobert","level":"critical"}}]' localhost:9090/alerts/api/v1/alerts
    
    update: the config structs provided by alertmanager never marshal any secrets, so they are not being written to disk correctly, which is very unfortunate (Feature request: override secret censoring during marshaling to YAML prometheus/alertmanager#1985) - trying to figure out a way around this short of copy-pasting the definitions opened config: add MarshalSecrets flag prometheus/alertmanager#2316
  • find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual) => set low thresholds temporarily for some metrics. See samples in #alerts
  • changelog item communicating breaking changes

Non-goals

  • Linking alerts back to Grafana panels for the corresponding metric - this requires some work on our generator, and might not be very straightforward (seems only incremental IDs are available, and not UIDs, for panels)
  • Amazing templates for each notification type - the most lacking one right now is the email template, which is just a plain-text thing at the moment, but I think those are best addressed in follow-ups. The goal for this iteration is to have them be consistent (link back to solutions docs, indicate basic metadata)

bobheadxi added 14 commits June 29, 2020 09:58
commit bf833fe
Merge: f32b042 f142c9c
Author: Robert Lin <[email protected]>
Date:   Tue Jun 23 10:59:00 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit f32b042
Merge: 2327cd3 34bd181
Author: Robert Lin <[email protected]>
Date:   Tue Jun 23 09:53:35 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 2327cd3
Author: Robert Lin <[email protected]>
Date:   Tue Jun 23 09:53:27 2020 +0800

    remove todos

commit 3d48e18
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 13:58:14 2020 +0800

    make problem format more consistent

commit c5ba7db
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 13:54:49 2020 +0800

    report if grafana is down as part of config-subscriber problems

commit ef824fc
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 10:32:58 2020 +0800

    add default for alerts

commit 180fd43
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 10:32:50 2020 +0800

    disable external snapshots

commit 9626e47
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 09:51:48 2020 +0800

    add docstring

commit c1c1545
Merge: ac5749e 716ceb7
Author: Robert Lin <[email protected]>
Date:   Mon Jun 22 09:49:29 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit ac5749e
Merge: 9e819c4 c0c220b
Author: Robert Lin <[email protected]>
Date:   Fri Jun 19 08:21:19 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 9e819c4
Author: Robert Lin <[email protected]>
Date:   Fri Jun 19 08:21:09 2020 +0800

    improve notifier change behaviour

commit 4c4a7d2
Author: Robert Lin <[email protected]>
Date:   Thu Jun 18 11:47:06 2020 +0800

    fix return discard

commit 9a9be6b
Author: Robert Lin <[email protected]>
Date:   Thu Jun 18 11:21:04 2020 +0800

    address linter warnings

commit 7242ff3
Merge: a2b07d5 ce911b0
Author: Robert Lin <[email protected]>
Date:   Thu Jun 18 10:37:56 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit a2b07d5
Author: Robert Lin <[email protected]>
Date:   Thu Jun 18 10:37:37 2020 +0800

    add tests for grafana smtp change

commit 7cc2fe0
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 14:38:32 2020 +0800

    refactor to remove integration points

commit 2f60a4d
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 13:23:01 2020 +0800

    improve docs

commit 2981a3d
Merge: 664ebe6 ed5a2ab
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 13:21:06 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 664ebe6
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 13:20:59 2020 +0800

    implement smtp sync

commit 2c10270
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 13:12:37 2020 +0800

    regenerate stringdata

commit 03da223
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 13:03:42 2020 +0800

    dump grafana build output if build fails

commit 267bca0
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 09:44:28 2020 +0800

    improve docs

commit 1d81518
Merge: 3ca4892 254e676
Author: Robert Lin <[email protected]>
Date:   Wed Jun 17 09:23:34 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 3ca4892
Author: Robert Lin <[email protected]>
Date:   Tue Jun 16 15:31:24 2020 +0800

    scaffold improvements to change application

commit 7162a4c
Author: Robert Lin <[email protected]>
Date:   Tue Jun 16 13:05:57 2020 +0800

    restore email schema support
The notice doesn't add a lot of information to anyone who might receive the alert. Moved the disclaimer to PossibleSolutions instead
@bobheadxi bobheadxi requested review from uwedeportivo, davejrt, emidoots and a team July 1, 2020 01:24
@emidoots
Copy link
Member

emidoots commented Jul 1, 2020

This is excellent, thanks for the write-up! I will take a look at the implementation and try this out tomorrow.

@bobheadxi
Copy link
Member Author

[...] but I've yet to successfully receive a notification. I'm going this by [???]

It looks like your message got cut off here

I've fixed the point now :)

@emidoots
Copy link
Member

emidoots commented Jul 1, 2020

find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual)

It would be nice to have this happen automatically after saving changes to the alerting configuration - or even just a GraphQL query one can run in e.g. sourcegraph.com/api/console (but neither of this is needed for this PR and I don't want to increase its scope)

One easy way to do it would be to declare a a Prometheus metric like test_alert and define alerts on that through our regular monitoring stack, then simply increment that metric whenever you want it to fire

@uwedeportivo
Copy link
Contributor

this is very cool. i'll take a closer look tomorrow.

@unknwon
Copy link
Member

unknwon commented Jul 1, 2020

Do you think cloud teams should review this PR? (come here via code owner)

@emidoots emidoots removed the request for review from a team July 1, 2020 07:46
@emidoots
Copy link
Member

emidoots commented Jul 1, 2020

@unknwon I would say not necessary as Distribution owns this - but feel free to review if you like as usual of course. Will make sure CODEOWNERS file reflects this.

@tsenart
Copy link
Contributor

tsenart commented Jul 1, 2020

Can we dog-food this for Sourcegraph Cloud?

@bobheadxi
Copy link
Member Author

Can we dog-food this for Sourcegraph Cloud?

Yep! Once I finish this up, I'll update https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/2839 for Cloud (Cloud == dot-com right?)

bobheadxi and others added 2 commits July 7, 2020 08:24
Co-authored-by: Stephen Gutekanst <[email protected]>
Co-authored-by: Keegan Carruthers-Smith <[email protected]>
@bobheadxi bobheadxi force-pushed the distribution/prom-alerting branch from 47c5a62 to 527eed6 Compare July 7, 2020 00:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants