Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert/node-crashed #8683

Merged
merged 7 commits into from
Apr 29, 2021
Merged

Alert/node-crashed #8683

merged 7 commits into from
Apr 29, 2021

Conversation

ghost-not-in-the-shell
Copy link
Contributor

This PR address part 1 of #8511. The warning I am adding gets triggered when one of the nodes' uptime drop to less than 1 min.

@ghost-not-in-the-shell ghost-not-in-the-shell changed the base branch from develop to compatible April 19, 2021 18:57
expr: count by (testnet) (Coda_Runtime_process_uptime_ms_total{testnet=~"mainnet|devnet2|snappnet"} < 600000) > 2
labels:
testnet: "{{ $labels.testnet }}"
severity: critical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this to Critical Alerts group?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrmr1993 did you mean to use snappnet in the testnet label here to generate pagerduty alerts? The variable rule_filter currently includes snappnet as well and you can see those in grafana alert notifications. If you wanted pagerduty alert for this we'll have to update pagerduty_alert_filter variable (but this will apply to all the alerts).

I made this PR to show mainnet|devnet warnings and other testnet alerts in a slack channel #8707. If that is not sufficient then I think we could add entries in testnet-alert-receivers.yml.tpl for matching specific alerts from specific testnets and sending them to specific receivers (pagerduty or slack channel)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghost-not-in-the-shell could you remove the hardcoded testnet names and use ${rule_filter}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did that.

Copy link
Member

@imeckler imeckler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me assuming the other reviewers approve

Copy link
Member

@deepthiskumar deepthiskumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert the ppx submodule update? Otherwise looks good

@ghost-not-in-the-shell ghost-not-in-the-shell added the ready-to-merge Adding this label will trigger mergify and trigger CI to run and merge the PR label Apr 29, 2021
@mrmr1993 mrmr1993 merged commit a309f3c into compatible Apr 29, 2021
@mrmr1993 mrmr1993 deleted the alert/node-crash-part1 branch April 29, 2021 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alerting and monitoring ready-to-merge Adding this label will trigger mergify and trigger CI to run and merge the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants