-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trie: Fix concurrent map access on trie.dirties #23674
Conversation
The dirties map of trie.Database was not read protected inside the commit function, since the trie.Database is called from many go-routines this sometimes lead to concurrent map read and write errors. The solution is to read lock over trie.dirties in commit
The |
Hi @holiman, I do have a stack trace, it's from https://github.com/celo-org/celo-blockchain (a fork) but I think it is still representative of the go-ethereum behaviour. I've added it below in case that helps. So I'm not sure whether the comment or the implementation of Lines 695 to 720 in b5a3d8e
|
Indeed, the read of dirties during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct. @karalabe PTAL
I also think it's correct, but gotta say the locking looks a bit messed-up in trie.Database. The method comment of IMHO we should just take the lock once in |
db.lock.RLock() | ||
node, ok := db.dirties[hash] | ||
if !ok { | ||
db.lock.RUnlock() | ||
return nil | ||
} | ||
db.lock.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db.lock.RLock() | |
node, ok := db.dirties[hash] | |
if !ok { | |
db.lock.RUnlock() | |
return nil | |
} | |
db.lock.RUnlock() | |
db.lock.RLock() | |
node, ok := db.dirties[hash] | |
db.lock.RUnlock() | |
if !ok { | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @gballet 👍
Btw, the dirty read access in |
This PR doesn't seem right to me. The
If your code crashes, it means that this requirement is violated somewhere. Your crash dump seems to be only half of the dump, it should ideally contain the other goroutine which accessed it at the same time? Perhaps there's something wrong with the shutdown and it doesn't wait for block imports to finish? Is this custom code or stock Geth? |
Hi @karalabe, the stack trace was from https://github.com/celo-org/celo-blockchain (a geth fork). Unfortunately golang doesn't guarantee showing both sides of a concurrent map access error in the stack trace (see here) there was only one side in the stack I previously posted. It may be that the changes in celo-blockchain are violating the use of For example here Lines 527 to 555 in 633e7ef
And then here in go-ethereum/core/blockchain.go Lines 1012 to 1065 in 633e7ef
And Lines 69 to 100 in 633e7ef
So it seems that these 2 go routines could be concurrently attempting to modify the Also I just noticed this ticket:
|
Regarding BlockChain.Stop, we have already merged another PR that should provide the proper exclusion between updates in Stop and other chain mutations. @piersy if you found this race during your testing with celo, could you retry with the changes in go-ethereum master? This issue may already be fixed due to the improved synchronization in BlockChain. |
Hi @fjl I'm assuming you meant #22853, I merged that change into https://github.com/celo-org/celo-blockchain and it does seem to have solved this issue, thank you :) However it seems that this problem could arise again, all it would take is for some code to try and access some combination methods of |
The only requirement is that you can't execute transactions and commit the state at the same time concurrently. I think that's a legit requirement: one mutates the state while another is saving it. There's no possible scenario where this can meaningfully happen outside of a programming error; so IMHO let it crash, it's a bug that needs fixing. You also can't just call commit in between two inserts as the ref counts would be invalid (trie node insertions goes from bottom upwards, so you can end up with dangling references until all the trie nodes are inserted). Calling commit in between - even if lock protected - would corrupt the garbage collector's internal state and will lead to losing trie nodes from memory and eventually from disk. |
If this issue happens in Geth, sure, we need to fix it, but the fix isn't in the trie database, it's at the callsite. |
Hi @karalabe thank you for the clarification. I'm also in favour of crashing for programming errors, but if possible I try to ensure that programming errors are very easy to detect. In this case I ran hundreds of iterations of a test before I encountered the concurrent map modification error. In the case of geth it seems that the programming error responsible for the concurrent map access existed since at least the 18th of May when #22892 was raised. So I'm wondering is this code safe when it is executed concurrently but the concurrent hash map error is not hit? |
While it would be nice to detect misuse of trie.Database, it isn't really possible. We don't want to add additional locking because the database is really not meant to be mutated concurrently. I am closing this PR now because it will not be merged in this form. We are very thankful that you pushed us toward fixing the shutdown races with your end-to-end test. Please do let us know if you find similar issues in the future! |
The dirties map of trie.Database was not read protected inside the
commit function, since the trie.Database is called from many go-routines
this sometimes lead to concurrent map read and write errors.
The solution is to read lock over trie.dirties in commit