-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert/node-crashed #8683
Alert/node-crashed #8683
Conversation
automation/terraform/modules/testnet-alerts/templates/testnet-alert-rules.yml.tpl
Outdated
Show resolved
Hide resolved
automation/terraform/modules/testnet-alerts/templates/testnet-alert-rules.yml.tpl
Outdated
Show resolved
Hide resolved
automation/terraform/modules/testnet-alerts/templates/testnet-alert-rules.yml.tpl
Show resolved
Hide resolved
expr: count by (testnet) (Coda_Runtime_process_uptime_ms_total{testnet=~"mainnet|devnet2|snappnet"} < 600000) > 2 | ||
labels: | ||
testnet: "{{ $labels.testnet }}" | ||
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this to Critical Alerts
group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrmr1993 did you mean to use snappnet
in the testnet
label here to generate pagerduty alerts? The variable rule_filter
currently includes snappnet as well and you can see those in grafana alert notifications. If you wanted pagerduty alert for this we'll have to update pagerduty_alert_filter
variable (but this will apply to all the alerts).
I made this PR to show mainnet|devnet warnings and other testnet alerts in a slack channel #8707. If that is not sufficient then I think we could add entries in testnet-alert-receivers.yml.tpl
for matching specific alerts from specific testnets and sending them to specific receivers (pagerduty or slack channel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghost-not-in-the-shell could you remove the hardcoded testnet names and use ${rule_filter}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me assuming the other reviewers approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert the ppx submodule update? Otherwise looks good
This PR address part 1 of #8511. The warning I am adding gets triggered when one of the nodes' uptime drop to less than 1 min.