Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC: Add BN block events monitoring #4832

Closed
wants to merge 7 commits into from
Closed

VC: Add BN block events monitoring #4832

wants to merge 7 commits into from

Conversation

cheatfate
Copy link
Contributor

This PR adds monitoring_service which performs 2 actions.

  1. Monitoring of BN events server-sent stream for incoming blocks. This is done for every supplied BN connection.
  2. Monitoring of time offset between VC and supplied BN.

…fter Genesis, but also before Genesis.

Change VC pre-genesis behavior, add runPreGenesisWaitingLoop() and runGenesisWaitingLoop().
Add checkedWaitForSlot() and checkedWaitForNextSlot() to strictly check current time and print warnings.
Fix VC main loop to use checkedWaitForNextSlot().
Fix attestation_service to run attestations processing only until the end of the duty slot.
Change attestation_service main loop to use checkedWaitForNextSlot().
Change block_service to properly cancel all the pending proposer tasks.
Use checkedWaitForSlot to wait for block proposal.
Fix block_service waitForBlockPublished() to be compatible with BN.
Fix sync_committee_service to avoid asyncSpawn.
Fix sync_committee_service to run only until the end of the duty slot.
Fix sync_committee_service to use checkedWaitForNextSlot().
Fix aggregated attestation publishing missing delay.
@github-actions
Copy link

Unit Test Results

         9 files  +       3    1 074 suites  +358   33m 50s ⏱️ + 12m 16s
  3 675 tests +   310    3 396 ✔️ +     52  279 💤 +258  0 ±0 
15 668 runs  +5 220  15 363 ✔️ +4 957  305 💤 +263  0 ±0 

Results for commit ba625a9. ± Comparison against base commit 8f2f40f.

@arnetheduck
Copy link
Member

Per earlier feature discussion, relying on the event stream (vs polling) introduces a new kind of complexity which in turn increases risk - as implemented, we should probably start by making this feature optional and default off to evaluate its impact.

I note there is time monitoring in this PR as well - this is a separate feature / protocol and we should probably put it in a separate PR for discussion (and possible upstreaming of the protocol itself to the beacon api)

@arnetheduck
Copy link
Member

the time monitoring uses polling - the block monitoring uses the event stream - it seems odd to place these in the same service.. if we go with the event stream design (which has its own problems), it seems reasonable to have a service that does only event monitoring and provides this information to others.

Time monitoring via polling is a different concern entirely - it is not related to "chain activity" but rather is a quality-of-service discovery mechanism not much different from the config interface - ie it serves to detect "compatible" beacon nodes rather than providing ongoing information about the chain for validator duties to be carried out efficiently - it would be better to have it separate, design-wise, and likely part of the machinery that monitors the health of beacon node connections.

@cheatfate
Copy link
Contributor Author

Per earlier feature discussion, relying on the event stream (vs polling) introduces a new kind of complexity which in turn increases risk - as implemented, we should probably start by making this feature optional and default off to evaluate its impact.

Right now BN has significant advantage over VC because it able to monitor blocks and start attesting before the block finish line, so attestation performance of running validators attached to BN is much more higher that performance of validators running through VC+BN.

If events API connection become broken, missing or got stuck - you will going to wait for full attestation_deadline, like it was done before.

Now why polling method is not working here:

  • What time interval you proposing? 1.second? or 1.seconds div SLOTS_PER_SECOND?
  • What call you going to use? I suspect you will propose to use /eth/v1/beacon/headers, which actually pretty broken and returns only head block header, but whatever to get at least minimal advantage which has BN you should request for headers at least 5-10 times per second. You have only attestation_deadline period every slot.

So what PROs you will have?
I dont see any except illusion of safety.

What CONs you will have?

  1. Enormous amount of block headers requests which to BN.
  2. Chance to get blocked by rate-limiting rules (if you will going to use some public service).
  3. Chance to get block already late.
  4. Number of BN significantly increases number of requests which should be sent and processed.

Right now while investigating pretty strange issue (which is made all our jenkins to fail - VC just can't be killed at all and i suspect its new while + exception handlers issue), i saw how minimal testnet is not able to deliver block event in time, and it still working perfectly.

I note there is time monitoring in this PR as well - this is a separate feature / protocol and we should probably put it in a separate PR for discussion (and possible upstreaming of the protocol itself to the beacon api)

I can put it in another PR but i dont see any reason to create special service for this thing.

@cheatfate
Copy link
Contributor Author

cheatfate commented Apr 19, 2023

the time monitoring uses polling - the block monitoring uses the event stream - it seems odd to place these in the same service.. if we go with the event stream design (which has its own problems), it seems reasonable to have a service that does only event monitoring and provides this information to others.

Time monitoring via polling is a different concern entirely - it is not related to "chain activity" but rather is a quality-of-service discovery mechanism not much different from the config interface - ie it serves to detect "compatible" beacon nodes rather than providing ongoing information about the chain for validator duties to be carried out efficiently - it would be better to have it separate, design-wise, and likely part of the machinery that monitors the health of beacon node connections.

I think you missed the idea but it does not watch for health of beacon node connection, it doesn't care how long it takes to deliver request and receive a response. It calculates REAL offset in time between BN's wall clock and VC's wall clock. Because if this clock offset become high - VC will always be late, while thinking that it performs good (according to metrics).

Also this time monitoring is actually time monitoring, we can't change BN's clock we can just print the warning in the logs about time offset (its what we are doing) and i dont see any reasons to makr BN's as bad or doing something like that.

@arnetheduck
Copy link
Member

so attestation performance of running validators attached to BN is much more higher that performance of validators running through VC+BN.

It's important to remember that the other clients are all publishing their VC-created attestations at +4s, so publishing earlier on its own does not come with significant attestation performance penalties.

That said, it's a valuable change in and of its own that makes the network as a whole more efficient because we avoid a significant "burst" of data at +4s when everyone is broadcasting at the same time, instead spreading out the work over a longer time period. It also increases spec compliance.

If events API connection become broken, missing or got stuck - you will going to wait for full attestation_deadline, like it was done before.

Ok - the important property to maintain here is that any "degradation" of the event stream isn't reflected on the "general" status of the BN - basically, if the event stream breaks or provides bogus data, we should not "remove" that BN from any list - it should only be an "additional" help that gets used when possible.

Even so, I think we need an option to disable it entirely for setups that don't want to use it (cross-client setups, "private" setups that don't want to connect to a BN until it's time to publish blocks etc) - the event stream interface gets less testing in implementationsl because it doesn't see widespread use in general - we should not introduce a hard requirement for having it since all it provides is a small networking benefit.

What time interval you proposing?

250 or 500 millis would work well - it doesn't greatly matter - the "window" for publishing early is from +2 to +4 seconds.

What call you going to use?

headers looks fine indeed - it tells you what you need to know (the slot of the head block, essentially - you don't need any other information) - why is it "broken"?

So what PROs you will have?

We avoid the need to introduce event streams into the general pipeline, which receive less testing and add complexity, for an optional feature, making the nimbus vc more broadly compatible with BN:s that may have issues or not fully implement the event interface.

However, as long as it's optional and treated as non-critical, the event approach is fine as this risk is mitigated.

@arnetheduck
Copy link
Member

It calculates REAL offset in time between BN's wall clock and VC's wall clock. Because if this clock offset become high - VC will always be late, while thinking that it performs good (according to metrics).

Well, it's also influenced by latency and delays in the BN (for example, what happens if the BN is processing some other, heavy request?).

Time is a complex topic that I think deserves its own PR and discussion - we do indeed need to find a balance where it doesn't generate too many false positives while at the same time providing value.

For example, the spec contains MAXIMUM_GOSSIP_CLOCK_DISPARITY which dictates the "tolerance" of clock differences between nodes on the network - we should probably relate whatever warnings we produce to this constant (basically, if your clock in either of BN or VC is wrong, this is where the network starts rejecting your work).

@cheatfate
Copy link
Contributor Author

This PR consistently reproduces some weird bug. It freezes application on macos/linux while running local-testnet-minimal. Freeze happens at shutdown process when VC stops at pre-defined epoch.

I suspect that the problem is in nim-metrics and its about {.gcsafe.}. Everything become broken when metrics thread is exited.

@cheatfate cheatfate marked this pull request as draft May 2, 2023 09:17
@cheatfate cheatfate closed this Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants