Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring/alerts for Rococo/Wococo relaying #1715

Closed
2 tasks
Tracked by #1706 ...
EmmanuellNorbertTulbure opened this issue Dec 13, 2022 · 5 comments · Fixed by #1966
Closed
2 tasks
Tracked by #1706 ...

Monitoring/alerts for Rococo/Wococo relaying #1715

EmmanuellNorbertTulbure opened this issue Dec 13, 2022 · 5 comments · Fixed by #1966
Assignees

Comments

@EmmanuellNorbertTulbure
Copy link

EmmanuellNorbertTulbure commented Dec 13, 2022

  • Verify if some additional Grafana/Prometheus setup is needed
  • setup some monitoring/alerts (it would be cool to see some list, what we are actually monitoring and alerting)
svyatonik 
I've also made dashboards + alerts for previous R<>W and have some description here: https://www.notion.so/paritytechnologies/Rococo-Wococo-bridge-62ac2d783d51447aa899def3932c3128
it is obsolete
svyatonik 
then two things to investigate: (1) why relay submits paracahins heads at all (given that there are no any messages in queue) and (2) why transactions are get lost (or at least that's what relay thinks)

will be copied for Kusama/Polkadot

@svyatonik
Copy link
Contributor

@svyatonik
Copy link
Contributor

svyatonik commented Dec 29, 2022

Made some dashboards and alerts on our Grafana. Let's not close this issue until we'll test it with new XCM messaging

UPD: also because of #1736 there are no panels for relay chain headers and associated alerts. It must be completed after relay update

@bkontur
Copy link
Contributor

bkontur commented Jan 24, 2023

@svyatonik @serban300
yesterday, BHR/BHW runtimes (9370) were redeployed,
and at some point complex-relayer failed because version guard with spec_version 9302 (trying to figure exact log: https://github.com/paritytech/devops/issues/1935#issuecomment-1401606414),
and we can see that restart does not work, which is ok,

I dont remember, maybe we talked about that, but shouldnt we add some alert for version guard to Alerts:​ R <-> W bridge channel?

2023-01-24 15:39:26 | 2023-01-24 07:39:26 +00 ERROR bridge-guard BridgeHubRococo runtime spec version has changed from 9302 to 9370. Aborting relay

the version guard was first triggered here:

  2023-01-23 13:42:21 2023-01-23 12:42:21 +00 ERROR bridge-guard BridgeHubWococo runtime spec version has changed from 9302 to 9370. Aborting relay
2023-01-23 13:42:20 [Wococo_to_BridgeHubRococo_Parachains] 2023-01-23 12:42:20 +00 WARN bridge Wococo client has failed to return its sync status: RpcError(RestartNeeded("Networking or low-level protocol error: WebSocket connection error: connection closed"))
2023-01-23 13:42:16 [BridgeHubRococo_to_BridgeHubWococo_MessageLane_00000001] 2023-01-23 12:42:16 +00 ERROR bridge Error retrieving state from BridgeHubWococo node: RpcError(ParseError(Error("invalid type: null, expected a (both 0x-prefixed or not) hex string or byte array containing 32 bytes", line: 0, column: 0))). Retrying in 23.91071968

@svyatonik
Copy link
Contributor

svyatonik commented Feb 1, 2023

What has been done so far (and what remains):

Dashboards:

  • Maintenance dashboard:
    • Relay build commit;
    • Relay build version;
    • Relay errors per minute. Alert: when relay restarts because of version guard
    • Rococo headers mismatch. Alert: when Rococo header at WococoBridgeHub doesn't match the same-number header at Rococo;
    • Wococo headers mismatch. Alert: when Wococo header at RococoBridgeHub doesn't match the same-number header at Wococo;
    • RococoBridgeHub headers mismatch. Alert: when RococoBridgeHub header at WococoBridgeHub doesn't match the same-number header at RococoBridgeHub;
    • WococoBridgeHub headers mismatch. Alert: when WococoBridgeHub header at RococoBridgeHub doesn't match the same-number header at WococoBridgeHub;
    • Relay balance at RococoBridgeHub. Alert: when it is too low (10 atm);
    • Relay balance at WococoBridgeHub. Alert: when it is too low (10 atm);
    • Relay reward at RococoBridgeHub. Alert when balance + reward decreases significantly;
    • Relay reward at WococoBridgeHub. Alert when balance + reward decreases significantly;
  • RococoBridgeHub -> WococoBridgeHub messages:
    • Best finalized Rococo headers. Alert: when we haven't synced less than 500 Rococo headers to BridgeHubWococo in last 70m;
    • Best finalized Wococo headers;
    • Best finalized RococoBridgeHub headers;
    • Best finalized WococoBridgeHub headers;
    • Delivery race;
    • Confirmation race;
    • Delivery lags. Alert: when RococoBridgeHub messages are not delivered to WococoBridgeHub or are delivered with large lags;
    • Confirmation lags. Alert: when RococoBridgeHub messages are not confirmed back to RococoBridgeHub or are confirmed with large lags;
    • Reward lags. Alert: when rewards for delivering RococoBridgeHub messages to WococoBridgeHub are not relayed or are relayed with lags;
  • WococoBridgeHub -> RococoBridgeHub messages:
    • Best finalized Wococo headers. Alert: when we haven't synced less than 500 Wococo headers to BridgeHubRococo in last 70m;
    • Best finalized Rococo headers;
    • Best finalized WococoBridgeHub headers;
    • Best finalized RococoBridgeHub headers;
    • Delivery race;
    • Confirmation race;
    • Delivery lags. Alert: when WococoBridgeHub messages are not delivered to RococoBridgeHub or are delivered with large lags;
    • Confirmation lags. Alert: when WococoBridgeHub messages are not confirmed back to WococoBridgeHub or are confirmed with large lags;
    • Reward lags. Alert: when rewards for delivering WococoBridgeHub messages to RococoBridgeHub are not relayed or are relayed with lags.

Known issues (needs to be fixed):

  • looks like Version guard has aborted RococoBridgeHub <> WococoBridgeHub relay looks at logs for last 6h instead of last minute should be fixed now;
  • looks like there are some connection issues between Grafana and Prometheus and all alerts may start failing. We need (do we?) to avoid firing alerts if there's such error (iirc change some setting like On error) changed all our alerts to ignore nodata and execution errors;
  • does the "Messages from {} to {} are either not delivered, or are delivered with lags" really works? We haven't been getting alerts when relay wasn't working and there were undelivered messages.

Some alerts will be available after we implement #1840

@EmmanuellNorbertTulbure
Copy link
Author

To export to our github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants