Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Fix possible deadloop in beacon #6451

Closed
wants to merge 2 commits into from

Conversation

fasmat
Copy link
Member

@fasmat fasmat commented Nov 13, 2024

Motivation

I found a problem in the beacon protocol where if the node wakes up from sleep or syncs its clock via NTP at the wrong moment the go routine running the beacon protocol might end up in a deadloop without ever recovering.

Description

listenEpochs is supposed to wait until the start of an epoch to then start the beacon protocol for that epoch. When the new epoch starts the select case <-pd.clock.AwaitLayer(layer) will unlock. This can happen any time after that layer was reached - usually ms, but if the runtime was very busy or the host was hibernating the time between reaching the layer and this case unlocking can extend much longer. The problem is the if a bit further below:

if !current.FirstInEpoch() {
	continue
}

if for any reason the signal from the select case was received after the 2nd layer of the epoch already started, this if statement will continue the for loop without updating current or layer resulting in a deadloop.

I fixed the issue by updating layer before checking, so that if the node is late it skips participating in the beacon protocol for this epoch and waits for the next.

Test Plan

Existing tests pass

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Update documentation as needed
  • Update changelog as needed

@fasmat fasmat self-assigned this Nov 13, 2024
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.9%. Comparing base (fbca87d) to head (7f8251a).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6451   +/-   ##
=======================================
  Coverage     79.9%   79.9%           
=======================================
  Files          352     352           
  Lines        46099   46098    -1     
=======================================
+ Hits         36850   36867   +17     
+ Misses        7154    7143   -11     
+ Partials      2095    2088    -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@poszu
Copy link
Contributor

poszu commented Nov 13, 2024

What do you mean by "deadloop"? As far as I understand, it would spin on:

		case <-pd.clock.AwaitLayer(layer):
			current := pd.clock.CurrentLayer()
			if !current.FirstInEpoch() {
				continue
			}

for a long time until the next epoch (when the current becomes the first layer finally) because layer is in the past and not updated.

@fasmat
Copy link
Member Author

fasmat commented Nov 13, 2024

What do you mean by "deadloop"? As far as I understand, it would spin on:

		case <-pd.clock.AwaitLayer(layer):
			current := pd.clock.CurrentLayer()
			if !current.FirstInEpoch() {
				continue
			}

for a long time until the next epoch (when the current becomes the first layer finally) because layer is in the past and not updated.

Right, it will only loop for a full epoch.

@fasmat
Copy link
Member Author

fasmat commented Nov 13, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Nov 13, 2024
## Motivation

I found a problem in the beacon protocol where if the node wakes up from sleep or syncs its clock via NTP at the wrong moment the go routine running the beacon protocol might end up in a deadloop without ever recovering.
@spacemesh-bors
Copy link

Pull request successfully merged into develop.

Build succeeded:

@spacemesh-bors spacemesh-bors bot changed the title Fix possible deadloop in beacon [Merged by Bors] - Fix possible deadloop in beacon Nov 13, 2024
@spacemesh-bors spacemesh-bors bot closed this Nov 13, 2024
@spacemesh-bors spacemesh-bors bot deleted the fix-possible-deadloop branch November 13, 2024 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants