-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Fix possible deadloop in beacon #6451
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #6451 +/- ##
=======================================
Coverage 79.9% 79.9%
=======================================
Files 352 352
Lines 46099 46098 -1
=======================================
+ Hits 36850 36867 +17
+ Misses 7154 7143 -11
+ Partials 2095 2088 -7 ☔ View full report in Codecov by Sentry. |
What do you mean by "deadloop"? As far as I understand, it would spin on: case <-pd.clock.AwaitLayer(layer):
current := pd.clock.CurrentLayer()
if !current.FirstInEpoch() {
continue
} for a long time until the next epoch (when the |
Right, it will only loop for a full epoch. |
bors merge |
## Motivation I found a problem in the beacon protocol where if the node wakes up from sleep or syncs its clock via NTP at the wrong moment the go routine running the beacon protocol might end up in a deadloop without ever recovering.
Pull request successfully merged into develop. Build succeeded: |
Motivation
I found a problem in the beacon protocol where if the node wakes up from sleep or syncs its clock via NTP at the wrong moment the go routine running the beacon protocol might end up in a deadloop without ever recovering.
Description
listenEpochs
is supposed to wait until the start of an epoch to then start the beacon protocol for that epoch. When the new epoch starts the select case<-pd.clock.AwaitLayer(layer)
will unlock. This can happen any time after that layer was reached - usually ms, but if the runtime was very busy or the host was hibernating the time between reaching the layer and this case unlocking can extend much longer. The problem is the if a bit further below:if for any reason the signal from the select case was received after the 2nd layer of the epoch already started, this if statement will
continue
the for loop without updatingcurrent
orlayer
resulting in a deadloop.I fixed the issue by updating
layer
before checking, so that if the node is late it skips participating in the beacon protocol for this epoch and waits for the next.Test Plan
Existing tests pass
TODO