Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VC: Remote BN clock offset monitoring. #4846

Merged
merged 9 commits into from
Jun 28, 2023
Merged

VC: Remote BN clock offset monitoring. #4846

merged 9 commits into from
Jun 28, 2023

Conversation

cheatfate
Copy link
Contributor

No description provided.

@github-actions
Copy link

github-actions bot commented Apr 21, 2023

Unit Test Results

         9 files  ±0    1 071 suites  ±0   35m 19s ⏱️ +8s
  3 712 tests ±0    3 433 ✔️ ±0  279 💤 ±0  0 ±0 
15 832 runs  ±0  15 527 ✔️ ±0  305 💤 ±0  0 ±0 

Results for commit 63a9f37. ± Comparison against base commit 970f5df.

♻️ This comment has been updated with latest results.

@tersec
Copy link
Contributor

tersec commented Apr 21, 2023

jenkins/nimbus-eth2/macos/aarch64 failure is out of disk on macm1-01-ms

@@ -9,6 +9,9 @@ import common

const
ServiceName = "fallback_service"
WARNING_TIME_OFFSET = TimeOffset.init(2.seconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comments elsewhere, the vc stops working if offset gets near MAXIMUM_GOSSIP_CLOCK_DISPARITY, so half or 2/3 of that is a reasonable warning level already

@cheatfate
Copy link
Contributor Author

I want to describe the problem in example, here is the case BN is running in local network on Linux device, while VC is running on Windows. I will show you 3 attestations timings. Please ignore Windows log timestamps because this timestamps in UTC+2 timezone with daylight saving, so right now it actually has UTC+3. Both PCs are synced with single NTP server which is also placed in local network. I dont know the reason but Windows system always has 300ms offset and all my attempts to fix it failed.

Windows running VC

INF 2023-04-26 04:28:00.001+03:00 Slot start                                 slot=5497040 epoch=171782 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=309ms623us692ns delay=1ms878us
NOT 2023-04-26 04:28:00.093+03:00 Time offset data                           timestamp1=1682472480001878000 timestamp2=1682472480356551065 timestamp3=1682472480356557355 timestamp4=1682472480093385000 delay14=7221600 delay23=849191 rtt=91ms507us
NOT 2023-04-26 04:28:04.172+03:00 Attestation published                      delay=172ms307us300ns service=attestation_service validator=819354e6@212553 attestation="(aggregation_bits: 0b00000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497040, index: 36, beacon_block_root: \"4bea56a1\", source: \"171781:c85e170e\", target: \"171782:a7844ad8\"), signature: \"858cb8ca\")"

Linux running BN

INF 2023-04-26 01:28:00.000+00:00 Slot start                                 topics="beacnde" slot=5497040 epoch=171782 sync=synced peers=159 head=076777ae:5497039 finalized=171780:b7250fb0 delay=849us191ns
INF 2023-04-26 01:28:00.059+00:00 Slot end                                   topics="beacnde" slot=5497040 nextActionWait=2m47s940ms766us608ns nextAttestationSlot=5497054 nextProposalSlot=-1 syncCommitteeDuties=none head=076777ae:5497039
INF 2023-04-26 01:28:04.435+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497040, index: 36, beacon_block_root: \"4bea56a1\", source: \"171781:c85e170e\", target: \"171782:a7844ad8\"), signature: \"858cb8ca\")" delay=435ms9us714ns subnet_id=36

First i want to explain some data in Windows logs:

  1. Start slot log statement has field time_offset=309ms623us692ns which indicates that there present offset in clocks between VC and BN. Also in this statement you can see that processing delay is insignificant and equal to delay=1ms878us.
  2. Time offset log statement is related to time monitoring request (it will not going to be seen in release). The most important value here is rtt = 91ms507us which means round trip time of single request.
  3. Attestation published statement shows us that it published attestation with delay delay=172ms307us300ns

Now we should check BN side :

  1. We should note that processing delay on BN side is also insignificant and shown in Slot start statement as delay=849us191ns.
  2. But Attestation sent log statement has pretty big delay of delay=435ms9us714ns.

So how we get such big delay on BN side while VC thinks that delay should be not bigger than 172ms307us300ns. This time also includes RTT because it get calculated after attestation has been published, so we could expect that BN will receive initial attestation publish request in 172ms - RTT / 2 ~ 127ms. Now we can add time_offset value and we will obtain BN delay value - 127ms + 309ms = 436ms ~ 435ms9us714ns.

So formula for BN delay would be
BN DELAY = (VC DELAY - RTT / 2) + TIME_OFFSET

To prove my calculations i will add one more attestation

Windows VC

INF 2023-04-26 04:32:36.003+03:00 Slot start                                 slot=5497063 epoch=171783 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=312ms142us853ns delay=3ms954us900ns
NOT 2023-04-26 04:32:36.094+03:00 Time offset data                           timestamp1=1682472756002438400 timestamp2=1682472756359660092 timestamp3=1682472756359665070 timestamp4=1682472756094869600 delay14=12816100 delay23=965073 rtt=92ms431us200ns
NOT 2023-04-26 04:32:40.142+03:00 Attestation published                      delay=142ms425us300ns service=attestation_service validator=8107ff6a@212539 attestation="(aggregation_bits: 0b00000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497063, index: 22, beacon_block_root: \"4880500a\", source: \"171782:a7844ad8\", target: \"171783:3be258a7\"), signature: \"8cf849eb\")"

Linux BN

INF 2023-04-26 01:32:36.000+00:00 Slot start                                 topics="beacnde" slot=5497063 epoch=171783 sync=synced peers=159 head=74a99b47:5497062 finalized=171781:c85e170e delay=965us73ns
INF 2023-04-26 01:32:36.029+00:00 Slot end                                   topics="beacnde" slot=5497063 nextActionWait=1m11s970ms125us27ns nextAttestationSlot=5497069 nextProposalSlot=-1 syncCommitteeDuties=none head=74a99b47:5497062
INF 2023-04-26 01:32:40.410+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497063, index: 22, beacon_block_root: \"4880500a\", source: \"171782:a7844ad8\", target: \"171783:3be258a7\"), signature: \"8cf849eb\")" delay=410ms398us844ns subnet_id=22

TIME_OFFSET = 312ms142us853ns
RTT = 92ms431us200ns
VC ATTESTATION DELAY = 142ms425us300ns
BN ATTESTATION DELAY = 410ms398us844ns
CALCULATED = 142 - 92 / 2 + 312 ~ 408ms (and you could not that we have VC processing delay delay=3ms954us900ns).

Windows VC

INF 2023-04-26 05:10:12.009+03:00 Slot start                                 slot=5497251 epoch=171789 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=325ms451us603ns delay=9ms471us700ns
NOT 2023-04-26 05:10:12.102+03:00 Time offset data                           timestamp1=1682475012012461400 timestamp2=1682475012400128122 timestamp3=1682475012400165377 timestamp4=1682475012102661400 delay14=9471700 delay23=719739 rtt=90ms200us
NOT 2023-04-26 05:10:16.158+03:00 Attestation published                      delay=158ms799us700ns service=attestation_service validator=8107ff6a@212539 attestation="(aggregation_bits: 0b00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497251, index: 5, beacon_block_root: \"e8bda37d\", source: \"171788:bdd9aeb7\", target: \"171789:0bdfbb89\"), signature: \"a18318bf\")"

Linux BN

INF 2023-04-26 02:10:12.000+00:00 Slot start                                 topics="beacnde" slot=5497251 epoch=171789 sync=synced peers=159 head=76da0f77:5497250 finalized=171787:b4bf42ea delay=719us739ns
INF 2023-04-26 02:10:12.069+00:00 Slot end                                   topics="beacnde" slot=5497251 nextActionWait=7m35s930ms206us645ns nextAttestationSlot=5497289 nextProposalSlot=-1 syncCommitteeDuties=none head=76da0f77:5497250
INF 2023-04-26 02:10:16.452+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497251, index: 5, beacon_block_root: \"e8bda37d\", source: \"171788:bdd9aeb7\", target: \"171789:0bdfbb89\"), signature: \"a18318bf\")" delay=452ms425us474ns subnet_id=5

TIME_OFFSET = 325ms451us603ns
RTT = 90ms200us
VC ATTESTATION DELAY = 158ms799us700ns
BN ATTESTATION DELAY = 452ms425us474ns
CALCULATED = 158 - 90 / 2 + 325 ~ 438ms (and you could not that we have VC processing delay delay=9ms471us700ns)

There of course present some error ~ 10-20ms, but it because we measure RTT for time-monitoring call and not for attestation publish request call, which is also fast but could provide another RTT value.

The core of the problem here is of course Windows time handling or some weird Windows behavior which could not properly correct their clocks according to NTP server and i'm not sure why we should start warn users when offset in time is bigger than
MAXIMUM_GOSSIP_CLOCK_DISPARITY which is 500ms. Even with such big clock difference VC could continue working.

@arnetheduck
Copy link
Member

I dont know the reason but Windows system always has 300ms offset and all my attempts to fix it failed.

nagle maybe?

MAXIMUM_GOSSIP_CLOCK_DISPARITY which is 500ms. Even with such big clock difference VC could continue working.

What happens here is that if a clock is off by more than MAXIMUM_GOSSIP_CLOCK_DISPARITY, in particular when it's early and performing the slot transition earlier than everyone else on the network, the network will simply reject it per the gossip propagation rules because the slot has not yet happened.

There is nuance here:

  • this effect happens above all for blocks which are sent on the slot transition edge (ideally) - if we send a block more than 500ms before the slot started (according to your neighbour nodes above all because of gossip), it will simply be dropped - this is the main scenario that can break a user setup
  • it applies to being early - if you're late, there is more tolerance (up to 4s)
  • there is also more tolerance for attestations being early - when the VC is early on sending the attestation, the risk of voting for an old head increases. A late attestation might not be included if blocks are full.

So, we have several paths forward:

  • be more lenient in warnings - decreases detection rate of problematic setup, but maybe it's better than having lots of false positives
    • if we consider a clock that's been set "by hand", it might be accurate within a second which is not accurate enough
  • we can be more aggressive when the VC clock is earlier than the BN - this is the most problematic case - ie we can use different limits for late and early
  • we can get to the bottom of why there is clock difference - is it lag? granularity? something else?

@arnetheduck arnetheduck merged commit c2c5d80 into unstable Jun 28, 2023
@arnetheduck arnetheduck deleted the time-monitor branch June 28, 2023 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants