-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ledger: fix possible dbRound unsynchronization for trackers and registry #3910
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3910 +/- ##
==========================================
- Coverage 50.04% 50.04% -0.01%
==========================================
Files 394 394
Lines 68465 68465
==========================================
- Hits 34266 34264 -2
- Misses 30494 30498 +4
+ Partials 3705 3703 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the bug exactly? I couldn't follow your explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, though I know very little about our locking discipline for trackers. But I think I'm seeing that the registry has a dbround, and the intent is to pass down the dbround to all the registered trackers. Previously, the round could change between when it was taken from the registry and passed down to produceCommittingTask, now it can't. That sounds good, but I can't say much more than that. Does produceCommittingRound fail if given an out of date round? I would have guessed it just did less work.
Edit: I think I get it. The tracker's can't always commit an old round. So if they've gone too far ahead, they fail. With the code change, they can't get ahead. It does worry me that they were getting so far ahead - does that mean the new code is holding the lock for a very long time? Should we be looking to understand why that happens? If we don't, when the same condition occurs, this lock will be held, so everything stalls?
|
the dbRound from outside gets compared with |
Summary
There is an issue with
trackerRegistry.dbRound
value and cached dbRound values stored in trackers.Although they are updated under the same lock,
produceCommittingTask
might use non-updateddbRound
and give it to trackers with updated state.The fix is simple: have
dbRound
usage andproduceCommittingTask
invocation under the same lock.Test Plan
Added a new test that dances with locks and simulates the race. Unfortunately after the wider locking it looks like impossible to recreate it.