Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(baseapp): fix race condition in state #11102

Closed
wants to merge 2 commits into from
Closed

Conversation

troian
Copy link
Contributor

@troian troian commented Feb 2, 2022

as the state is not threadsafe when rpc and grpc are both active
race condition occurs

WARNING: DATA RACE
Write at 0x00c002a6cb50 by goroutine 125:
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).BeginBlock()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/abci.go:187 +0x9b4
  github.com/ovrclk/akash/app.(*AkashApp).BeginBlock()
      <autogenerated>:1 +0x90
  github.com/tendermint/tendermint/abci/client.(*localClient).BeginBlockSync()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:280 +0x120
  github.com/tendermint/tendermint/proxy.(*appConnConsensus).BeginBlockSync()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:81 +0x8c
  github.com/tendermint/tendermint/state.execBlockOnProxyApp()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/state/execution.go:307 +0x480
  github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/state/execution.go:140 +0x180
  github.com/tendermint/tendermint/consensus.(*State).finalizeCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1635 +0xda8
  github.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1546 +0x468
  github.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1481 +0x11c
  github.com/tendermint/tendermint/consensus.(*State).enterCommit()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1519 +0x1264
  github.com/tendermint/tendermint/consensus.(*State).addVote()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:2132 +0x11fc
  github.com/tendermint/tendermint/consensus.(*State).tryAddVote()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:1930 +0x48
  github.com/tendermint/tendermint/consensus.(*State).handleMsg()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:838 +0x51c
  github.com/tendermint/tendermint/consensus.(*State).receiveRoutine()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/tendermint/tendermint/consensus/state.go:782 +0x5a0

Previous read at 0x00c002a6cb50 by goroutine 145:
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).createQueryContext()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/abci.go:648 +0x3c4
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).RegisterGRPCServer.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/grpcserver.go:50 +0x2c0
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x78
  github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/recovery/interceptors.go:33 +0xb8
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x78
  github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xf4
  github.com/cosmos/cosmos-sdk/x/bank/types._Query_AllBalances_Handler()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/x/bank/types/query.pb.go:943 +0x1bc
  github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).RegisterGRPCServer.func2()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/github.com/cosmos/cosmos-sdk/baseapp/grpcserver.go:80 +0x124
  google.golang.org/grpc.(*Server).processUnaryRPC()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:1210 +0x11b4
  google.golang.org/grpc.(*Server).handleStream()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:1533 +0xed8
  google.golang.org/grpc.(*Server).serveStreams.func1.2()
      /Users/amr/go/src/github.com/ovrclk/akash/vendor/google.golang.org/grpc/server.go:871 +0xa4

Signed-off-by: Artur Troian [email protected]

as state is not threadsafe when rpc and grpc are both active
race condition occurs

Signed-off-by: Artur Troian <[email protected]>
@troian
Copy link
Contributor Author

troian commented Feb 2, 2022

@AmauryM in continue to the discussion in #10997

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LTGM, did you try concurrent grpc queries along with the other PR?

@tac0turtle
Copy link
Member

Is this issue present on master? I think its present in the other pr because we added grpc routing to the client.

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't replicate this data race on 0.44.5, 0.45 or master.

Here's the simple script I used (run a node in another terminal):

#!/usr/bin/env bash
set -e

# Set variables
CFG_DIR=~/.simapp
BUILD_CMD=./build/simd
ALICE=alice
CHAIN_ID=my-chain

ALICE_ADDRESS=$($BUILD_CMD keys show $ALICE -a --keyring-backend test)

for i in {1..10000}
do
   echo $i
   grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done

And with some additional logging in the app, I can see that BeginBlock and grpc queries are handled concurrently.

Let's investigate a bit more before making this R4R. @troian Are you seeing data races on v0.44.5, or only after your PR?

@troian
Copy link
Contributor Author

troian commented Feb 3, 2022

@AmauryM yes. we see it on 0.44.5.
sorry, I don't have much time for now to detail the issue on master

this PR with #10997 fixes the race issue for us.

I'm not sure this runs in parallels, just sequentially 10k requests. isn't it?

for i in {1..10000}
do
   echo $i
   grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done

@amaury1093
Copy link
Contributor

yes. we see it on 0.44.5.

OK. Is there a way you can make a repro? e.g. a script or some code to make the data race happen

I'm not sure this runs in parallels,

It's in parallel, because of the & at the end.

@troian
Copy link
Contributor Author

troian commented Feb 3, 2022

It's in parallel, because of the & at the end.
it just detaches from the parent process. it may finish before the next test starts.

also, I'm not sure this test will reproduce the issue. take a look at my comment with stack trace in the #10997. the RPC and gRPC are trying to access non-thread-safe object

@amaury1093
Copy link
Contributor

OK. Could you explain/paste a script on how you reproduced the data race?

What I'm trying to understand is how we didn't find this data race in previous 0.44.x testing phases.

@amaury1093
Copy link
Contributor

amaury1093 commented Feb 4, 2022

Note: It seems that there's maybe even another data race in v0.44.5: #11114 :(

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #11117 I got convinced there are data race issues in v0.44+, so overall I am okay to merge this PR 👍, seems Query is indeed potentially accessing checkState the same time when BeginBlock is modifying it.

I never personally reproduced the data race though. @troian Do you think you can provide a small test like https://github.com/cosmos/cosmos-sdk/pull/11117/files#diff-9ab8b6b1ae348e450f51d4a110e504d9aee67d848a997128a39f914a0acfa7f7R807?

// WithContext update context of the state
func (st *state) WithContext(ctx sdk.Context) {
defer st.lock.Unlock()
st.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a changelog entry

baseapp/state.go Outdated Show resolved Hide resolved
Comment on lines +23 to +24
defer st.lock.RUnlock()
st.lock.RLock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer st.lock.RUnlock()
st.lock.RLock()
st.lock.RLock()
defer st.lock.RUnlock()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment why?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to lock the mutex before you can unlock it :)

@@ -17,5 +20,15 @@ func (st *state) CacheMultiStore() sdk.CacheMultiStore {

// Context returns the Context of the state.
func (st *state) Context() sdk.Context {
defer st.lock.RUnlock()
st.lock.RLock()

return st.ctx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually safe, as the Context type has fields which have reference semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterbourgon it seems the Context type needs some sort of deep copy.
thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tho more I dig into it - the more I am convinced this is a dirty hack and eventually it will blow
BeginBlock for example
if there are two servers listening (socket and grpc) they may (and will) call BeginBlock simultaneously.
correct if I'm getting it wrong

Copy link

@peterbourgon peterbourgon Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I think you're right.

As far as I can see, BaseApp's exported methods — including but not limited to BeginBlock — can absolutely be called by concurrent goroutines. This means that they must ensure that anything they read or write is synchronized. But that's not happening. BaseApp —and many, many, many other components in the SDK — permit unsynchronized reads and writes on their encapsulated values, and consequently violate Go's memory model. Many of these soundness errors remain undetected or overlooked because the current, specific execution paths happen to not trigger them most of the time.

There are a lot of pathological issues in the Cosmos SDK, as well as the sdk.Context type specifically, which make this kind of bug hard to fix.

In this case, at a high level: contexts are supposed to be request-scoped, but here — and in many other places, too — the context value is long-lived. I might be missing something, but that seems to be a clear design error. Neither a state nor a BaseApp nor anything else with a lifetime beyond an individual request should maintain a context value.

More concretely: as with most types in the SDK, methods on the sdk.Context are — probably incorrectly — defined on a value receiver. That means every method call creates and operates on a (shallow) copy of the original value. This thrashes the GC, but more importantly it makes synchronization of any field with reference semantics more or less impossible. Even if you have a mutex in the type — which the context doesn't — those mutexes would get copied, and so wouldn't actually provide mutual exclusion on the values they'd be meant to protect.

And then there's all of the basically unsolvable problems created by the SDK's endemic misuse of panic as an error handling mechanism. But that's an entirely different discussion.

I don't see how to fix the data race without addressing these problems. Probably at least a few more, too.

--

I certainly haven't done a deep-dive on this code, and so I'm not speaking from an informed place. But based on what I do understand, it seems that the right approach to fixing this problem is to eliminate the state type altogether, including the deliverState and checkState fields in the BaseApp, in order to eliminate the long-lived context value. Then, look at whatever was writing-to and reading-from those state values to figure out what stuff they actually needed from the contexts. Capture that information specifically, in a separate and synchronized type, in the BaseApp struct.

Or, I dunno. Maybe I got it all wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @peterbourgon for looking over and confirming
I'll give it some thinking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the SDK exists today, it's not meant to or designed to be executed concurrently -- we assume Tendermint places the relevant locks in it's reactors prior to executing ABCI calls, which it does. The issue arrises here, at least as far as I can tell, from direct client gRPC queries being executed while the state machine is executing ABCI call(s) that can contain various writes, which is really outside the scope or domain of Tendermint.

So while I see the idea proposed here with Context(), I don't think it's the correct approach, although I do appreciate the efforts @troian!

We need to take a step back and think of a different approach to allowing direct gRPC queries while the state machine is executing ABCI calls. For simplicity's sake, forget Tendermint even exists at this point. I think there are two ways we can protect reads and writes:

  1. Either by taking a revised approach to the use of the state context as you attempted (maybe it just needs a bit more thought), OR
  2. Using a RW mutex on BaseApp itself, where most ABCI calls use a write lock and we only obtain a read lock upon Query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Construct a standalone app instance for quering which only shares the low-level db handler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also might be an option, but im not sure it'll be an app. Rather We might have to refactor Baseapp#Query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexanderbez is this issue still a thing on later cosmos sdk 0.47+? or someone had opportunity to address it?

@troian
Copy link
Contributor Author

troian commented Feb 9, 2022

After #11117 I got convinced there are data race issues in v0.44+, so overall I am okay to merge this PR 👍, seems Query is indeed potentially accessing checkState the same time when BeginBlock is modifying it.

I never personally reproduced the data race though. @troian Do you think you can provide a small test like https://github.com/cosmos/cosmos-sdk/pull/11117/files#diff-9ab8b6b1ae348e450f51d4a110e504d9aee67d848a997128a39f914a0acfa7f7R807?

sure, i'll try to make one, cannot promise it very soon tho, quite a busy schedule :(

@robert-zaremba
Copy link
Collaborator

Is this PR still in draft? We want to make a release today (if possible) and we need this PR merged

Co-authored-by: Peter Bourgon <[email protected]>
@troian
Copy link
Contributor Author

troian commented Feb 9, 2022

@robert-zaremba, as @peterbourgon mentioned the mutex inContext() does not guard sdk.Context as it has quite a few values provided by reference. So the PR needs more work

without digging in to much, deep copy seems to be an option, tho there might be logic relying on references

@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 27, 2022
@troian troian removed the stale label Mar 27, 2022
return st.ctx
}

// WithContext update context of the state
func (st *state) WithContext(ctx sdk.Context) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithX methods typically leave the receiver unmodified and return a copy with the requested changes. Should this be e.g. SetContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@robert-zaremba
Copy link
Collaborator

@troian any hope you can continue working on this PR?

@troian
Copy link
Contributor Author

troian commented Mar 30, 2022

@robert-zaremba i don't this PR will be merged as there are some serious things about thread safety revealed.
checkout my conversation with @peterbourgon above

@troian troian mentioned this pull request Apr 5, 2022
3 tasks
@tac0turtle
Copy link
Member

@AmauryM could you turn this PR into an issue then we close it for the time being?

@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 24, 2022
@github-actions github-actions bot closed this Jul 31, 2022
@tac0turtle tac0turtle deleted the fix-race-queries branch February 16, 2023 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants