-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.3.0-rc.0 snapshot restore panic on raft #9096
Comments
@lyddragon Did you upgrade from previous version? Or just fresh cluster from v3.3.0-rc.0 and this happened? Haven't been able to reproduce yet. |
no update,but use old snapshot to restore. |
Snapshot from v3.3.0-rc.0 server, and you downloaded snapshot using v3.3.0-rc.0 etcdctl, right? |
Finally reproduced. I will try to fix shortly. @lyddragon basically, we are doing
Just curious, what is your use case on this? If you want to destroy member A completely, I would remove/add back to the cluster. |
add failed, look my other issue |
"v3.3.0-rc.0 endpoint health --cluster with auth requires password input twice #9094" |
@xiang90 I double checked our code and don't think it's a bug. Reproducible e2e test case: // TestCtlV3SnapshotRestoreMultiCluster ensures that restoring one member from snapshot
// does not panic when rejoining the cluster (fix https://github.com/coreos/etcd/issues/9096).
func TestCtlV3SnapshotRestoreMultiCluster(t *testing.T) {
testCtl(t, snapshotRestoreMultiClusterTest, withCfg(configNoTLS), withQuorum())
}
func snapshotRestoreMultiClusterTest(cx ctlCtx) {
if err := ctlV3Put(cx, "foo", "bar", ""); err != nil {
cx.t.Fatalf("ctlV3Put error (%v)", err)
}
fpath := filepath.Join(os.TempDir(), "test.snapshot")
defer os.RemoveAll(fpath)
if err := ctlV3SnapshotSave(cx, fpath); err != nil {
cx.t.Fatalf("snapshotTest ctlV3SnapshotSave error (%v)", err)
}
// shut down first member, restore, restart from snapshot
if err := cx.epc.procs[0].Close(); err != nil {
cx.t.Fatalf("failed to close (%v)", err)
}
ep, ok := cx.epc.procs[0].(*etcdServerProcess)
if !ok {
cx.t.Fatalf("expected *etcdServerProcess")
}
newDataDir := filepath.Join(os.TempDir(), "snap.etcd")
os.RemoveAll(newDataDir)
defer os.RemoveAll(newDataDir)
ep.cfg.dataDirPath = newDataDir
if err := spawnWithExpect(append(
cx.PrefixArgs(),
"snapshot", "restore", fpath,
"--name", ep.cfg.name,
"--initial-cluster", ep.cfg.initialCluster,
"--initial-cluster-token", ep.cfg.initialToken,
"--initial-advertise-peer-urls", ep.cfg.purl.String(),
"--data-dir", newDataDir),
"membership: added member"); err != nil {
cx.t.Fatalf("failed to restore (%v)", err)
}
for i := range ep.cfg.args {
if ep.cfg.args[i] == "--data-dir" {
ep.cfg.args[i+1] = newDataDir
break
}
}
ep.cfg.args = append(ep.cfg.args, "--initial-cluster-state", "existing")
var err error
ep.proc, err = spawnCmd(append([]string{ep.cfg.execPath}, ep.cfg.args...))
if err != nil {
cx.t.Fatalf("failed to spawn etcd (%v)", err)
}
// will error "read /dev/ptmx: input/output error" if process panicked
if err = ep.waitReady(); err != nil {
cx.t.Fatalf("failed to start from snapshot restore (%v)", err)
}
} ../bin/etcd-4461: 2018-01-07 09:19:50.258428 I | raft: ca50e9357181d758 became follower at term 2
../bin/etcd-4461: 2018-01-07 09:19:50.258446 C | raft: tocommit(9) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?
../bin/etcd-4402: 2018-01-07 09:19:50.258503 I | rafthttp: established a TCP streaming connection with peer ca50e9357181d758 (stream MsgApp v2 reader)
../bin/etcd-4461: panic: tocommit(9) is out of range [lastIndex(3)]. Was the raft log corrupted, truncated, or lost?
../bin/etcd-4461:
../bin/etcd-4461: goroutine 116 [running]:
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420161fa0, 0x102290f, 0x5d, 0xc421070140, 0x2, 0x2)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16d
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4200e00e0, 0x9)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1195 +0x54
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1141 +0x439
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4201fe200, 0x8, 0xca50e9357181d758, 0x5ac8aa22f1eb4c8f, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:869 +0x1465
../bin/etcd-4461: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*node).run(0xc4201e2540, 0xc4201fe200)
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x113e
../bin/etcd-4461: created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.RestartNode
../bin/etcd-4461: /home/gyuho/go/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:223 +0x321 3.2, 3.3, master all panic, and it's expected. The restored member joins the cluster with commit index 3 (equal to the number of nodes in the cluster) because snapshot file has no information about revision or any other raft fields from previous cluster. So, if other peers increments its indexes and newly joined member does not become the leader, it will ask for future index thus panics. I think we just need to document this clearly, snapshot restore only supports fresh cluster, not for new member to existing cluster. |
@lyddragon Closing because this is expected.
https://github.com/coreos/etcd/blob/master/Documentation/op-guide/recovery.md#restoring-a-cluster Snapshot restores a fresh cluster, thus cannot join the existing cluster unless all other members are restored from same snapshot file. |
#9094 (comment)
#9094 (comment)
#9094 (comment)
The text was updated successfully, but these errors were encountered: