VC: Remote BN clock offset monitoring. #4846

cheatfate · 2023-04-21T03:19:30Z

No description provided.

github-actions · 2023-04-21T06:00:07Z

Unit Test Results

        9 files ±0   1 071 suites ±0 35m 19s ⏱️ +8s
  3 712 tests ±0   3 433 ✔️ ±0 279 💤 ±0 0 ❌ ±0
15 832 runs ±0 15 527 ✔️ ±0 305 💤 ±0 0 ❌ ±0

Results for commit 63a9f37. ± Comparison against base commit 970f5df.

♻️ This comment has been updated with latest results.

tersec · 2023-04-21T12:01:35Z

jenkins/nimbus-eth2/macos/aarch64 failure is out of disk on macm1-01-ms

beacon_chain/rpc/rest_nimbus_api.nim

beacon_chain/nimbus_validator_client.nim

arnetheduck · 2023-04-25T06:05:23Z

beacon_chain/validator_client/fallback_service.nim

@@ -9,6 +9,9 @@ import common

 const
  ServiceName = "fallback_service"
+  WARNING_TIME_OFFSET = TimeOffset.init(2.seconds)


per comments elsewhere, the vc stops working if offset gets near MAXIMUM_GOSSIP_CLOCK_DISPARITY, so half or 2/3 of that is a reasonable warning level already

beacon_chain/validator_client/fallback_service.nim

cheatfate · 2023-04-26T02:26:24Z

I want to describe the problem in example, here is the case BN is running in local network on Linux device, while VC is running on Windows. I will show you 3 attestations timings. Please ignore Windows log timestamps because this timestamps in UTC+2 timezone with daylight saving, so right now it actually has UTC+3. Both PCs are synced with single NTP server which is also placed in local network. I dont know the reason but Windows system always has 300ms offset and all my attempts to fix it failed.

Windows running VC

INF 2023-04-26 04:28:00.001+03:00 Slot start                                 slot=5497040 epoch=171782 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=309ms623us692ns delay=1ms878us
NOT 2023-04-26 04:28:00.093+03:00 Time offset data                           timestamp1=1682472480001878000 timestamp2=1682472480356551065 timestamp3=1682472480356557355 timestamp4=1682472480093385000 delay14=7221600 delay23=849191 rtt=91ms507us
NOT 2023-04-26 04:28:04.172+03:00 Attestation published                      delay=172ms307us300ns service=attestation_service validator=819354e6@212553 attestation="(aggregation_bits: 0b00000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497040, index: 36, beacon_block_root: \"4bea56a1\", source: \"171781:c85e170e\", target: \"171782:a7844ad8\"), signature: \"858cb8ca\")"

Linux running BN

INF 2023-04-26 01:28:00.000+00:00 Slot start                                 topics="beacnde" slot=5497040 epoch=171782 sync=synced peers=159 head=076777ae:5497039 finalized=171780:b7250fb0 delay=849us191ns
INF 2023-04-26 01:28:00.059+00:00 Slot end                                   topics="beacnde" slot=5497040 nextActionWait=2m47s940ms766us608ns nextAttestationSlot=5497054 nextProposalSlot=-1 syncCommitteeDuties=none head=076777ae:5497039
INF 2023-04-26 01:28:04.435+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497040, index: 36, beacon_block_root: \"4bea56a1\", source: \"171781:c85e170e\", target: \"171782:a7844ad8\"), signature: \"858cb8ca\")" delay=435ms9us714ns subnet_id=36

First i want to explain some data in Windows logs:

Start slot log statement has field time_offset=309ms623us692ns which indicates that there present offset in clocks between VC and BN. Also in this statement you can see that processing delay is insignificant and equal to delay=1ms878us.
Time offset log statement is related to time monitoring request (it will not going to be seen in release). The most important value here is rtt = 91ms507us which means round trip time of single request.
Attestation published statement shows us that it published attestation with delay delay=172ms307us300ns

Now we should check BN side :

We should note that processing delay on BN side is also insignificant and shown in Slot start statement as delay=849us191ns.
But Attestation sent log statement has pretty big delay of delay=435ms9us714ns.

So how we get such big delay on BN side while VC thinks that delay should be not bigger than 172ms307us300ns. This time also includes RTT because it get calculated after attestation has been published, so we could expect that BN will receive initial attestation publish request in 172ms - RTT / 2 ~ 127ms. Now we can add time_offset value and we will obtain BN delay value - 127ms + 309ms = 436ms ~ 435ms9us714ns.

So formula for BN delay would be
BN DELAY = (VC DELAY - RTT / 2) + TIME_OFFSET

To prove my calculations i will add one more attestation

Windows VC

INF 2023-04-26 04:32:36.003+03:00 Slot start                                 slot=5497063 epoch=171783 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=312ms142us853ns delay=3ms954us900ns
NOT 2023-04-26 04:32:36.094+03:00 Time offset data                           timestamp1=1682472756002438400 timestamp2=1682472756359660092 timestamp3=1682472756359665070 timestamp4=1682472756094869600 delay14=12816100 delay23=965073 rtt=92ms431us200ns
NOT 2023-04-26 04:32:40.142+03:00 Attestation published                      delay=142ms425us300ns service=attestation_service validator=8107ff6a@212539 attestation="(aggregation_bits: 0b00000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497063, index: 22, beacon_block_root: \"4880500a\", source: \"171782:a7844ad8\", target: \"171783:3be258a7\"), signature: \"8cf849eb\")"

Linux BN

INF 2023-04-26 01:32:36.000+00:00 Slot start                                 topics="beacnde" slot=5497063 epoch=171783 sync=synced peers=159 head=74a99b47:5497062 finalized=171781:c85e170e delay=965us73ns
INF 2023-04-26 01:32:36.029+00:00 Slot end                                   topics="beacnde" slot=5497063 nextActionWait=1m11s970ms125us27ns nextAttestationSlot=5497069 nextProposalSlot=-1 syncCommitteeDuties=none head=74a99b47:5497062
INF 2023-04-26 01:32:40.410+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497063, index: 22, beacon_block_root: \"4880500a\", source: \"171782:a7844ad8\", target: \"171783:3be258a7\"), signature: \"8cf849eb\")" delay=410ms398us844ns subnet_id=22

TIME_OFFSET = 312ms142us853ns
RTT = 92ms431us200ns
VC ATTESTATION DELAY = 142ms425us300ns
BN ATTESTATION DELAY = 410ms398us844ns
CALCULATED = 142 - 92 / 2 + 312 ~ 408ms (and you could not that we have VC processing delay delay=3ms954us900ns).

Windows VC

INF 2023-04-26 05:10:12.009+03:00 Slot start                                 slot=5497251 epoch=171789 attestationIn=4s blockIn=<unknown> validators=2 node_status=synced time_offset=325ms451us603ns delay=9ms471us700ns
NOT 2023-04-26 05:10:12.102+03:00 Time offset data                           timestamp1=1682475012012461400 timestamp2=1682475012400128122 timestamp3=1682475012400165377 timestamp4=1682475012102661400 delay14=9471700 delay23=719739 rtt=90ms200us
NOT 2023-04-26 05:10:16.158+03:00 Attestation published                      delay=158ms799us700ns service=attestation_service validator=8107ff6a@212539 attestation="(aggregation_bits: 0b00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497251, index: 5, beacon_block_root: \"e8bda37d\", source: \"171788:bdd9aeb7\", target: \"171789:0bdfbb89\"), signature: \"a18318bf\")"

Linux BN

INF 2023-04-26 02:10:12.000+00:00 Slot start                                 topics="beacnde" slot=5497251 epoch=171789 sync=synced peers=159 head=76da0f77:5497250 finalized=171787:b4bf42ea delay=719us739ns
INF 2023-04-26 02:10:12.069+00:00 Slot end                                   topics="beacnde" slot=5497251 nextActionWait=7m35s930ms206us645ns nextAttestationSlot=5497289 nextProposalSlot=-1 syncCommitteeDuties=none head=76da0f77:5497250
INF 2023-04-26 02:10:16.452+00:00 Attestation sent                           attestation="(aggregation_bits: 0b00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, data: (slot: 5497251, index: 5, beacon_block_root: \"e8bda37d\", source: \"171788:bdd9aeb7\", target: \"171789:0bdfbb89\"), signature: \"a18318bf\")" delay=452ms425us474ns subnet_id=5

TIME_OFFSET = 325ms451us603ns
RTT = 90ms200us
VC ATTESTATION DELAY = 158ms799us700ns
BN ATTESTATION DELAY = 452ms425us474ns
CALCULATED = 158 - 90 / 2 + 325 ~ 438ms (and you could not that we have VC processing delay delay=9ms471us700ns)

There of course present some error ~ 10-20ms, but it because we measure RTT for time-monitoring call and not for attestation publish request call, which is also fast but could provide another RTT value.

The core of the problem here is of course Windows time handling or some weird Windows behavior which could not properly correct their clocks according to NTP server and i'm not sure why we should start warn users when offset in time is bigger than
MAXIMUM_GOSSIP_CLOCK_DISPARITY which is 500ms. Even with such big clock difference VC could continue working.

arnetheduck · 2023-04-26T06:14:13Z

I dont know the reason but Windows system always has 300ms offset and all my attempts to fix it failed.

nagle maybe?

MAXIMUM_GOSSIP_CLOCK_DISPARITY which is 500ms. Even with such big clock difference VC could continue working.

What happens here is that if a clock is off by more than MAXIMUM_GOSSIP_CLOCK_DISPARITY, in particular when it's early and performing the slot transition earlier than everyone else on the network, the network will simply reject it per the gossip propagation rules because the slot has not yet happened.

There is nuance here:

this effect happens above all for blocks which are sent on the slot transition edge (ideally) - if we send a block more than 500ms before the slot started (according to your neighbour nodes above all because of gossip), it will simply be dropped - this is the main scenario that can break a user setup
it applies to being early - if you're late, there is more tolerance (up to 4s)
there is also more tolerance for attestations being early - when the VC is early on sending the attestation, the risk of voting for an old head increases. A late attestation might not be included if blocks are full.

So, we have several paths forward:

be more lenient in warnings - decreases detection rate of problematic setup, but maybe it's better than having lots of false positives
- if we consider a clock that's been set "by hand", it might be accurate within a second which is not accurate enough
we can be more aggressive when the VC clock is earlier than the BN - this is the most problematic case - ie we can use different limits for late and early
we can get to the bottom of why there is clock difference - is it lag? granularity? something else?

beacon_chain/validator_client/fallback_service.nim

Remove delays. Fix logging statement issues. Change update from epoch to slot.

Fix updateStatus().

Fix issues when BN do not support Nimbus Extensions. Rename metric name and type change.

tersec mentioned this pull request Apr 21, 2023

automatic update of v1.3.0-rc.5 consensus spec URLs to v1.3.0 #4848

Merged

arnetheduck reviewed Apr 24, 2023

View reviewed changes

beacon_chain/rpc/rest_nimbus_api.nim Show resolved Hide resolved

arnetheduck reviewed Apr 25, 2023

View reviewed changes

beacon_chain/nimbus_validator_client.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Apr 25, 2023

View reviewed changes

beacon_chain/validator_client/fallback_service.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Apr 25, 2023

View reviewed changes

beacon_chain/validator_client/fallback_service.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Apr 25, 2023

View reviewed changes

beacon_chain/validator_client/fallback_service.nim Show resolved Hide resolved

cheatfate force-pushed the time-monitor branch from c556b4c to 0162716 Compare May 19, 2023 00:48

cheatfate force-pushed the time-monitor branch from 0162716 to 57997df Compare June 4, 2023 07:43

arnetheduck reviewed Jun 22, 2023

View reviewed changes

beacon_chain/validator_client/fallback_service.nim Outdated Show resolved Hide resolved

cheatfate force-pushed the time-monitor branch from 4ca1fee to 7be7484 Compare June 23, 2023 01:53

arnetheduck approved these changes Jun 26, 2023

View reviewed changes

cheatfate added 9 commits June 27, 2023 16:58

Initial commit.

b7d85f5

Add algorithm in comment.

9185e62

Remove delays. Fix logging statement issues. Change update from epoch to slot.

Obtain timestamp earlier.

382149a

Add processing delays into algorithm.

f23a061

Fix time offset logging to produce integers instead of strings.

8149407

Address review comments.

6119638

Fix copyright year.

b3ed239

Fix updateStatus().

Remove fields from Slot start log statement.

8f1e299

Fix issues when BN do not support Nimbus Extensions. Rename metric name and type change.

Add beacon role to disable time offset check manually.

63a9f37

cheatfate force-pushed the time-monitor branch from 6f3893c to 63a9f37 Compare June 27, 2023 13:58

cheatfate added the merge ready label Jun 28, 2023

arnetheduck merged commit c2c5d80 into unstable Jun 28, 2023

arnetheduck deleted the time-monitor branch June 28, 2023 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VC: Remote BN clock offset monitoring. #4846

VC: Remote BN clock offset monitoring. #4846

cheatfate commented Apr 21, 2023

github-actions bot commented Apr 21, 2023 •

edited

Loading

tersec commented Apr 21, 2023

arnetheduck Apr 25, 2023

cheatfate commented Apr 26, 2023

arnetheduck commented Apr 26, 2023

VC: Remote BN clock offset monitoring. #4846

VC: Remote BN clock offset monitoring. #4846

Conversation

cheatfate commented Apr 21, 2023

github-actions bot commented Apr 21, 2023 • edited Loading

Unit Test Results

tersec commented Apr 21, 2023

arnetheduck Apr 25, 2023

Choose a reason for hiding this comment

cheatfate commented Apr 26, 2023

Windows running VC

Linux running BN

Windows VC

Linux BN

Windows VC

Linux BN

arnetheduck commented Apr 26, 2023

github-actions bot commented Apr 21, 2023 •

edited

Loading