-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VC: Add BN block events monitoring #4832
Conversation
…fter Genesis, but also before Genesis. Change VC pre-genesis behavior, add runPreGenesisWaitingLoop() and runGenesisWaitingLoop(). Add checkedWaitForSlot() and checkedWaitForNextSlot() to strictly check current time and print warnings. Fix VC main loop to use checkedWaitForNextSlot(). Fix attestation_service to run attestations processing only until the end of the duty slot. Change attestation_service main loop to use checkedWaitForNextSlot(). Change block_service to properly cancel all the pending proposer tasks. Use checkedWaitForSlot to wait for block proposal. Fix block_service waitForBlockPublished() to be compatible with BN. Fix sync_committee_service to avoid asyncSpawn. Fix sync_committee_service to run only until the end of the duty slot. Fix sync_committee_service to use checkedWaitForNextSlot().
Fix aggregated attestation publishing missing delay.
Per earlier feature discussion, relying on the event stream (vs polling) introduces a new kind of complexity which in turn increases risk - as implemented, we should probably start by making this feature optional and default off to evaluate its impact. I note there is time monitoring in this PR as well - this is a separate feature / protocol and we should probably put it in a separate PR for discussion (and possible upstreaming of the protocol itself to the beacon api) |
the time monitoring uses polling - the block monitoring uses the event stream - it seems odd to place these in the same service.. if we go with the event stream design (which has its own problems), it seems reasonable to have a service that does only event monitoring and provides this information to others. Time monitoring via polling is a different concern entirely - it is not related to "chain activity" but rather is a quality-of-service discovery mechanism not much different from the config interface - ie it serves to detect "compatible" beacon nodes rather than providing ongoing information about the chain for validator duties to be carried out efficiently - it would be better to have it separate, design-wise, and likely part of the machinery that monitors the health of beacon node connections. |
Right now BN has significant advantage over VC because it able to monitor blocks and start attesting before the block finish line, so attestation performance of running validators attached to BN is much more higher that performance of validators running through VC+BN. If events API connection become broken, missing or got stuck - you will going to wait for full Now why polling method is not working here:
So what PROs you will have? What CONs you will have?
Right now while investigating pretty strange issue (which is made all our jenkins to fail - VC just can't be killed at all and i suspect its new
I can put it in another PR but i dont see any reason to create special service for this thing. |
I think you missed the idea but it does not watch for health of beacon node connection, it doesn't care how long it takes to deliver request and receive a response. It calculates REAL offset in time between BN's wall clock and VC's wall clock. Because if this clock offset become high - VC will always be late, while thinking that it performs good (according to metrics). Also this time monitoring is actually time monitoring, we can't change BN's clock we can just print the warning in the logs about time offset (its what we are doing) and i dont see any reasons to makr BN's as |
It's important to remember that the other clients are all publishing their VC-created attestations at +4s, so publishing earlier on its own does not come with significant attestation performance penalties. That said, it's a valuable change in and of its own that makes the network as a whole more efficient because we avoid a significant "burst" of data at +4s when everyone is broadcasting at the same time, instead spreading out the work over a longer time period. It also increases spec compliance.
Ok - the important property to maintain here is that any "degradation" of the event stream isn't reflected on the "general" status of the BN - basically, if the event stream breaks or provides bogus data, we should not "remove" that BN from any list - it should only be an "additional" help that gets used when possible. Even so, I think we need an option to disable it entirely for setups that don't want to use it (cross-client setups, "private" setups that don't want to connect to a BN until it's time to publish blocks etc) - the event stream interface gets less testing in implementationsl because it doesn't see widespread use in general - we should not introduce a hard requirement for having it since all it provides is a small networking benefit.
250 or 500 millis would work well - it doesn't greatly matter - the "window" for publishing early is from +2 to +4 seconds.
headers looks fine indeed - it tells you what you need to know (the slot of the
We avoid the need to introduce event streams into the general pipeline, which receive less testing and add complexity, for an optional feature, making the nimbus vc more broadly compatible with BN:s that may have issues or not fully implement the event interface. However, as long as it's optional and treated as non-critical, the event approach is fine as this risk is mitigated. |
Well, it's also influenced by latency and delays in the BN (for example, what happens if the BN is processing some other, heavy request?). Time is a complex topic that I think deserves its own PR and discussion - we do indeed need to find a balance where it doesn't generate too many false positives while at the same time providing value. For example, the spec contains |
This PR consistently reproduces some weird bug. It freezes application on macos/linux while running I suspect that the problem is in |
This PR adds
monitoring_service
which performs 2 actions.