VC: Add BN block events monitoring #4832

cheatfate · 2023-04-18T00:41:32Z

This PR adds monitoring_service which performs 2 actions.

Monitoring of BN events server-sent stream for incoming blocks. This is done for every supplied BN connection.
Monitoring of time offset between VC and supplied BN.

…fter Genesis, but also before Genesis. Change VC pre-genesis behavior, add runPreGenesisWaitingLoop() and runGenesisWaitingLoop(). Add checkedWaitForSlot() and checkedWaitForNextSlot() to strictly check current time and print warnings. Fix VC main loop to use checkedWaitForNextSlot(). Fix attestation_service to run attestations processing only until the end of the duty slot. Change attestation_service main loop to use checkedWaitForNextSlot(). Change block_service to properly cancel all the pending proposer tasks. Use checkedWaitForSlot to wait for block proposal. Fix block_service waitForBlockPublished() to be compatible with BN. Fix sync_committee_service to avoid asyncSpawn. Fix sync_committee_service to run only until the end of the duty slot. Fix sync_committee_service to use checkedWaitForNextSlot().

Fix aggregated attestation publishing missing delay.

github-actions · 2023-04-18T03:28:18Z

Unit Test Results

        9 files +      3   1 074 suites +358 33m 50s ⏱️ + 12m 16s
  3 675 tests +  310   3 396 ✔️ +    52 279 💤 +258 0 ❌ ±0
15 668 runs +5 220 15 363 ✔️ +4 957 305 💤 +263 0 ❌ ±0

Results for commit ba625a9. ± Comparison against base commit 8f2f40f.

arnetheduck · 2023-04-18T06:19:20Z

Per earlier feature discussion, relying on the event stream (vs polling) introduces a new kind of complexity which in turn increases risk - as implemented, we should probably start by making this feature optional and default off to evaluate its impact.

I note there is time monitoring in this PR as well - this is a separate feature / protocol and we should probably put it in a separate PR for discussion (and possible upstreaming of the protocol itself to the beacon api)

arnetheduck · 2023-04-18T06:24:54Z

the time monitoring uses polling - the block monitoring uses the event stream - it seems odd to place these in the same service.. if we go with the event stream design (which has its own problems), it seems reasonable to have a service that does only event monitoring and provides this information to others.

Time monitoring via polling is a different concern entirely - it is not related to "chain activity" but rather is a quality-of-service discovery mechanism not much different from the config interface - ie it serves to detect "compatible" beacon nodes rather than providing ongoing information about the chain for validator duties to be carried out efficiently - it would be better to have it separate, design-wise, and likely part of the machinery that monitors the health of beacon node connections.

cheatfate · 2023-04-19T01:54:29Z

Per earlier feature discussion, relying on the event stream (vs polling) introduces a new kind of complexity which in turn increases risk - as implemented, we should probably start by making this feature optional and default off to evaluate its impact.

Right now BN has significant advantage over VC because it able to monitor blocks and start attesting before the block finish line, so attestation performance of running validators attached to BN is much more higher that performance of validators running through VC+BN.

If events API connection become broken, missing or got stuck - you will going to wait for full attestation_deadline, like it was done before.

Now why polling method is not working here:

What time interval you proposing? 1.second? or 1.seconds div SLOTS_PER_SECOND?
What call you going to use? I suspect you will propose to use /eth/v1/beacon/headers, which actually pretty broken and returns only head block header, but whatever to get at least minimal advantage which has BN you should request for headers at least 5-10 times per second. You have only attestation_deadline period every slot.

So what PROs you will have?
I dont see any except illusion of safety.

What CONs you will have?

Enormous amount of block headers requests which to BN.
Chance to get blocked by rate-limiting rules (if you will going to use some public service).
Chance to get block already late.
Number of BN significantly increases number of requests which should be sent and processed.

Right now while investigating pretty strange issue (which is made all our jenkins to fail - VC just can't be killed at all and i suspect its new while + exception handlers issue), i saw how minimal testnet is not able to deliver block event in time, and it still working perfectly.

I note there is time monitoring in this PR as well - this is a separate feature / protocol and we should probably put it in a separate PR for discussion (and possible upstreaming of the protocol itself to the beacon api)

I can put it in another PR but i dont see any reason to create special service for this thing.

cheatfate · 2023-04-19T01:58:08Z

the time monitoring uses polling - the block monitoring uses the event stream - it seems odd to place these in the same service.. if we go with the event stream design (which has its own problems), it seems reasonable to have a service that does only event monitoring and provides this information to others.

Time monitoring via polling is a different concern entirely - it is not related to "chain activity" but rather is a quality-of-service discovery mechanism not much different from the config interface - ie it serves to detect "compatible" beacon nodes rather than providing ongoing information about the chain for validator duties to be carried out efficiently - it would be better to have it separate, design-wise, and likely part of the machinery that monitors the health of beacon node connections.

I think you missed the idea but it does not watch for health of beacon node connection, it doesn't care how long it takes to deliver request and receive a response. It calculates REAL offset in time between BN's wall clock and VC's wall clock. Because if this clock offset become high - VC will always be late, while thinking that it performs good (according to metrics).

Also this time monitoring is actually time monitoring, we can't change BN's clock we can just print the warning in the logs about time offset (its what we are doing) and i dont see any reasons to makr BN's as bad or doing something like that.

arnetheduck · 2023-04-19T08:52:44Z

so attestation performance of running validators attached to BN is much more higher that performance of validators running through VC+BN.

It's important to remember that the other clients are all publishing their VC-created attestations at +4s, so publishing earlier on its own does not come with significant attestation performance penalties.

That said, it's a valuable change in and of its own that makes the network as a whole more efficient because we avoid a significant "burst" of data at +4s when everyone is broadcasting at the same time, instead spreading out the work over a longer time period. It also increases spec compliance.

If events API connection become broken, missing or got stuck - you will going to wait for full attestation_deadline, like it was done before.

Ok - the important property to maintain here is that any "degradation" of the event stream isn't reflected on the "general" status of the BN - basically, if the event stream breaks or provides bogus data, we should not "remove" that BN from any list - it should only be an "additional" help that gets used when possible.

Even so, I think we need an option to disable it entirely for setups that don't want to use it (cross-client setups, "private" setups that don't want to connect to a BN until it's time to publish blocks etc) - the event stream interface gets less testing in implementationsl because it doesn't see widespread use in general - we should not introduce a hard requirement for having it since all it provides is a small networking benefit.

What time interval you proposing?

250 or 500 millis would work well - it doesn't greatly matter - the "window" for publishing early is from +2 to +4 seconds.

What call you going to use?

headers looks fine indeed - it tells you what you need to know (the slot of the head block, essentially - you don't need any other information) - why is it "broken"?

So what PROs you will have?

We avoid the need to introduce event streams into the general pipeline, which receive less testing and add complexity, for an optional feature, making the nimbus vc more broadly compatible with BN:s that may have issues or not fully implement the event interface.

However, as long as it's optional and treated as non-critical, the event approach is fine as this risk is mitigated.

arnetheduck · 2023-04-19T08:59:07Z

It calculates REAL offset in time between BN's wall clock and VC's wall clock. Because if this clock offset become high - VC will always be late, while thinking that it performs good (according to metrics).

Well, it's also influenced by latency and delays in the BN (for example, what happens if the BN is processing some other, heavy request?).

Time is a complex topic that I think deserves its own PR and discussion - we do indeed need to find a balance where it doesn't generate too many false positives while at the same time providing value.

For example, the spec contains MAXIMUM_GOSSIP_CLOCK_DISPARITY which dictates the "tolerance" of clock differences between nodes on the network - we should probably relate whatever warnings we produce to this constant (basically, if your clock in either of BN or VC is wrong, this is where the network starts rejecting your work).

cheatfate · 2023-04-25T13:35:35Z

This PR consistently reproduces some weird bug. It freezes application on macos/linux while running local-testnet-minimal. Freeze happens at shutdown process when VC stops at pre-defined epoch.

I suspect that the problem is in nim-metrics and its about {.gcsafe.}. Everything become broken when metrics thread is exited.

cheatfate added 7 commits April 18, 2023 03:20

Refactor validator logging.

f07399d

Fix aggregated attestation publishing missing delay.

Initial commit.

3c814c6

Change waitForBlockPublished -> waitForBlock(expectBlock).

9266fe2

Time offset monitoring.

317188e

Add time monitoring loop into service.

6970062

Post-rebase fixes.

ba625a9

tersec mentioned this pull request Apr 21, 2023

automatic update of v1.3.0-rc.5 consensus spec URLs to v1.3.0 #4848

Merged

cheatfate marked this pull request as draft May 2, 2023 09:17

cheatfate closed this Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VC: Add BN block events monitoring #4832

VC: Add BN block events monitoring #4832

cheatfate commented Apr 18, 2023

github-actions bot commented Apr 18, 2023

arnetheduck commented Apr 18, 2023

arnetheduck commented Apr 18, 2023

cheatfate commented Apr 19, 2023

cheatfate commented Apr 19, 2023 •

edited

Loading

arnetheduck commented Apr 19, 2023

arnetheduck commented Apr 19, 2023

cheatfate commented Apr 25, 2023

VC: Add BN block events monitoring #4832

VC: Add BN block events monitoring #4832

Conversation

cheatfate commented Apr 18, 2023

github-actions bot commented Apr 18, 2023

Unit Test Results

arnetheduck commented Apr 18, 2023

arnetheduck commented Apr 18, 2023

cheatfate commented Apr 19, 2023

cheatfate commented Apr 19, 2023 • edited Loading

arnetheduck commented Apr 19, 2023

arnetheduck commented Apr 19, 2023

cheatfate commented Apr 25, 2023

cheatfate commented Apr 19, 2023 •

edited

Loading