ledger: fix possible dbRound unsynchronization for trackers and registry #3910

algorandskiy · 2022-04-23T01:57:54Z

Summary

There is an issue with trackerRegistry.dbRound value and cached dbRound values stored in trackers.
Although they are updated under the same lock, produceCommittingTask might use non-updated dbRound and give it to trackers with updated state.
The fix is simple: have dbRound usage and produceCommittingTask invocation under the same lock.

Test Plan

Added a new test that dances with locks and simulates the race. Unfortunately after the wider locking it looks like impossible to recreate it.

codecov-commenter · 2022-04-23T02:41:07Z

Codecov Report

Merging #3910 (09cc2c6) into master (163ad18) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3910      +/-   ##
==========================================
- Coverage   50.04%   50.04%   -0.01%     
==========================================
  Files         394      394              
  Lines       68465    68465              
==========================================
- Hits        34266    34264       -2     
- Misses      30494    30498       +4     
+ Partials     3705     3703       -2

Impacted Files	Coverage Δ
ledger/tracker.go	`74.67% <100.00%> (ø)`
ledger/blockqueue.go	`82.18% <0.00%> (-2.88%)`	⬇️
cmd/tealdbg/debugger.go	`72.69% <0.00%> (-0.81%)`	⬇️
ledger/acctupdates.go	`68.51% <0.00%> (-0.66%)`	⬇️
network/wsNetwork.go	`62.99% <0.00%> (ø)`
catchup/service.go	`68.88% <0.00%> (+0.24%)`	⬆️
catchup/peerSelector.go	`100.00% <0.00%> (+1.04%)`	⬆️
crypto/merkletrie/node.go	`93.48% <0.00%> (+1.86%)`	⬆️
crypto/merkletrie/trie.go	`68.61% <0.00%> (+2.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 163ad18...09cc2c6. Read the comment docs.

tolikzinovyev

So what is the bug exactly? I couldn't follow your explanation.

ledger/tracker.go

jannotti

I think this makes sense, though I know very little about our locking discipline for trackers. But I think I'm seeing that the registry has a dbround, and the intent is to pass down the dbround to all the registered trackers. Previously, the round could change between when it was taken from the registry and passed down to produceCommittingTask, now it can't. That sounds good, but I can't say much more than that. Does produceCommittingRound fail if given an out of date round? I would have guessed it just did less work.

Edit: I think I get it. The tracker's can't always commit an old round. So if they've gone too far ahead, they fail. With the code change, they can't get ahead. It does worry me that they were getting so far ahead - does that mean the new code is holding the lock for a very long time? Should we be looking to understand why that happens? If we don't, when the same condition occurs, this lock will be held, so everything stalls?

ledger/tracker.go

brianolson · 2022-04-26T17:01:12Z

catchpointtracker.go and acctupdates.go have non-trivial implementations of produceCommittingTask(), but I'm not seeing what in them cares if dbRound has moved on - or if anything refers to another moving piece of data

algorandskiy · 2022-04-26T17:05:05Z

anything refers to another moving piece of data

the dbRound from outside gets compared with au.deltas that might be already sliced to have data after au.cachedDbRound (it gets out of sync with the registry's dbRound)

algorandskiy added 3 commits April 22, 2022 20:24

Deadlock test

1737bcb

Data race demo test

b468207

Fix data race

09cc2c6

algorandskiy added Team Carbon-11 Bug-Fix labels Apr 23, 2022

algorandskiy requested review from cce, jannotti and tolikzinovyev April 23, 2022 01:57

algorandskiy self-assigned this Apr 23, 2022

algorandskiy requested review from winder and brianolson April 25, 2022 15:38

tolikzinovyev reviewed Apr 26, 2022

View reviewed changes

ledger/tracker.go Outdated Show resolved Hide resolved

jannotti reviewed Apr 26, 2022

View reviewed changes

ledger/tracker.go Show resolved Hide resolved

CR fix: extend lock

1e30d30

jannotti approved these changes Apr 26, 2022

View reviewed changes

algorandskiy merged commit bd1129e into algorand:master Apr 26, 2022

cce approved these changes Apr 26, 2022

View reviewed changes

This was referenced Apr 27, 2022

go-algorand 3.6.0-beta Release PR #3923

Merged

go-algorand 3.6.0-beta Release PR #3934

Merged

This was referenced May 2, 2022

go-algorand 3.6.1-beta Release PR #3943

Merged

go-algorand 3.6.2-beta Release PR #3947

Merged

Algo-devops-service mentioned this pull request May 7, 2022

go-algorand 3.6.2-stable Release PR #3960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger: fix possible dbRound unsynchronization for trackers and registry #3910

ledger: fix possible dbRound unsynchronization for trackers and registry #3910

algorandskiy commented Apr 23, 2022 •

edited

Loading

codecov-commenter commented Apr 23, 2022 •

edited

Loading

tolikzinovyev left a comment

jannotti left a comment •

edited

Loading

brianolson commented Apr 26, 2022

algorandskiy commented Apr 26, 2022 •

edited

Loading

ledger: fix possible dbRound unsynchronization for trackers and registry #3910

ledger: fix possible dbRound unsynchronization for trackers and registry #3910

Conversation

algorandskiy commented Apr 23, 2022 • edited Loading

Summary

Test Plan

codecov-commenter commented Apr 23, 2022 • edited Loading

Codecov Report

tolikzinovyev left a comment

Choose a reason for hiding this comment

jannotti left a comment • edited Loading

Choose a reason for hiding this comment

brianolson commented Apr 26, 2022

algorandskiy commented Apr 26, 2022 • edited Loading

algorandskiy commented Apr 23, 2022 •

edited

Loading

codecov-commenter commented Apr 23, 2022 •

edited

Loading

jannotti left a comment •

edited

Loading

algorandskiy commented Apr 26, 2022 •

edited

Loading