-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: fix panic by allowing future revision watcher from restore operation #9775
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9775 +/- ##
=========================================
Coverage ? 69.64%
=========================================
Files ? 375
Lines ? 35209
Branches ? 0
=========================================
Hits ? 24523
Misses ? 8925
Partials ? 1761
Continue to review full report at Codecov.
|
why would we restore to a store with old revision? restore should always end up with a newer revision, no? |
@xiang90 It happens:
Thus, node A ends up choosing the watcher with watch revision 5, but current revision 16. In this case, restore ends up with an old revision. |
ok. makes sense to me now. |
but then when that happened, we might not be able to send the updates between 5-16 if the newer version of the store has say revision 8 already compacted. |
@xiang90 That is handled via
|
mvcc/watcher_group.go
Outdated
if !w.restore { | ||
panic(fmt.Errorf("watcher minimum revision %d should not exceed current revision %d", w.minRev, curRev)) | ||
} | ||
if lg != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just remove the logging and add some comments on why this can happen on the restore case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
can you add a test case to test it more directly? |
33c0b02
to
764362a
Compare
Added a test case that fails 100% without the patch. Also this fixes the failure |
Signed-off-by: Gyuho Lee <[email protected]>
764362a
to
ebfd4c5
Compare
…ation This also happens without gRPC proxy. Fix panic when gRPC proxy leader watcher is restored: ``` go test -v -tags cluster_proxy -cpu 4 -race -run TestV3WatchRestoreSnapshotUnsync === RUN TestV3WatchRestoreSnapshotUnsync panic: watcher minimum revision 9223372036854775805 should not exceed current revision 16 goroutine 156 [running]: github.com/coreos/etcd/mvcc.(*watcherGroup).chooseAll(0xc4202b8720, 0x10, 0xffffffffffffffff, 0x1) /home/gyuho/go/src/github.com/coreos/etcd/mvcc/watcher_group.go:242 +0x3b5 github.com/coreos/etcd/mvcc.(*watcherGroup).choose(0xc4202b8720, 0x200, 0x10, 0xffffffffffffffff, 0xc420253378, 0xc420253378) /home/gyuho/go/src/github.com/coreos/etcd/mvcc/watcher_group.go:225 +0x289 github.com/coreos/etcd/mvcc.(*watchableStore).syncWatchers(0xc4202b86e0, 0x0) /home/gyuho/go/src/github.com/coreos/etcd/mvcc/watchable_store.go:340 +0x237 github.com/coreos/etcd/mvcc.(*watchableStore).syncWatchersLoop(0xc4202b86e0) /home/gyuho/go/src/github.com/coreos/etcd/mvcc/watchable_store.go:214 +0x280 created by github.com/coreos/etcd/mvcc.newWatchableStore /home/gyuho/go/src/github.com/coreos/etcd/mvcc/watchable_store.go:90 +0x477 exit status 2 FAIL github.com/coreos/etcd/integration 2.551s ``` gRPC proxy spawns a watcher with a key "proxy-namespace__lostleader" and watch revision "int64(math.MaxInt64 - 2)" to detect leader loss. But, when the partitioned node restores, this watcher triggers panic with "watcher minimum revision ... should not exceed current ...". This check was added a long time ago, by my PR, when there was no gRPC proxy: etcd-io#4043 (comment) > we can remove this checking actually. it is impossible for a unsynced watching to have a future rev. or we should just panic here. However, now it's possible that a unsynced watcher has a future revision, when it was moved from a synced watcher group through restore operation. This PR adds "restore" flag to indicate that a watcher was moved from the synced watcher group with restore operation. Otherwise, the watcher with future revision in an unsynced watcher group would still panic. Example logs with future revision watcher from restore operation: ``` {"level":"info","ts":1527196358.9057755,"caller":"mvcc/watcher_group.go:261","msg":"choosing future revision watcher from restore operation","watch-key":"proxy-namespace__lostleader","watch-revision":9223372036854775805,"current-revision":16} {"level":"info","ts":1527196358.910349,"caller":"mvcc/watcher_group.go:261","msg":"choosing future revision watcher from restore operation","watch-key":"proxy-namespace__lostleader","watch-revision":9223372036854775805,"current-revision":16} ``` Signed-off-by: Gyuho Lee <[email protected]>
ebfd4c5
to
0398ec7
Compare
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks!
It can happen without gRPC proxy as well.
Fix panic when gRPC proxy leader watcher is restored:
gRPC proxy spawns a watcher with a key "proxy-namespace__lostleader"
and watch revision "int64(math.MaxInt64 - 2)" to detect leader loss.
But, when the partitioned node restores, this watcher triggers
panic with "watcher minimum revision ... should not exceed current ...".
This check was added a long time ago, by my PR, when there was no gRPC proxy!
#4043 (comment)
However, now it's possible that a unsynced watcher has a future
revision, when it was moved from a synced watcher group through
restore operation.
This PR adds "restore" flag to indicate that a watcher was moved
from the synced watcher group with restore operation. Otherwise,
the watcher with future revision in an unsynced watcher group
would still panic.
Example logs with future revision watcher from restore operation:
This is reproducible via
TestV3WatchRestoreSnapshotUnsync
#9745 for #9281./cc @xiang90 @heyitsanthony