Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd panic #9704

Closed
ysicing opened this issue May 5, 2018 · 10 comments
Closed

etcd panic #9704

ysicing opened this issue May 5, 2018 · 10 comments

Comments

@ysicing
Copy link

ysicing commented May 5, 2018

ETCD Version:

etcd Version: 3.2.13
Git SHA: 95a726a
Go Version: go1.8.5
Go OS/Arch: linux/amd64

Hardware configuration:

cpu: 4
memory:16GB
os: ubuntu 14.04

Here is the log message of etcd:

2018-05-05 18:35:52.832972 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2018-05-05 18:35:52.833048 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-05-05 18:35:52.833154 I | embed: listening for peers on http://10.10.10.12:2380
2018-05-05 18:35:52.833261 I | embed: listening for client requests on 10.10.10.12:2379
2018-05-05 18:35:52.833339 I | embed: listening for client requests on 10.10.10.12:4001
2018-05-05 18:35:52.833418 I | embed: listening for client requests on 127.0.0.1:2379
2018-05-05 18:35:52.833501 I | embed: listening for client requests on 127.0.0.1:4001
port is on listening
2018-05-05 18:36:00.050125 I | etcdserver: recovered store from snapshot at index 343670601
2018-05-05 18:36:00.056959 I | mvcc: restore compact to 23756961
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).pageNode(0xc43bbf9cc0, 0xcdea8b35b591b3f2, 0x7f32ff2cb000, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:724 +0x1bf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).first(0xc420201a30)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:187 +0x93
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).next(0xc420201a30, 0xf73d7a, 0x9, 0x0, 0x0, 0x0, 0x3d50, 0xc400000000)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:240 +0x97
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).Next(0xc420201a30, 0x11, 0x11, 0xc42f210040, 0x11, 0x12, 0xffffffffffffffff)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:75 +0x7d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.unsafeRange(0xc4201401c0, 0x14a703c, 0x3, 0x3, 0xc42f210020, 0x11, 0x12, 0xc42f210040, 0x11, 0x12, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:106 +0x34f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeRange(0xc42018f620, 0x14a703c, 0x3, 0x3, 0xc42f210020, 0x11, 0x12, 0xc42f210040, 0x11, 0x12, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:84 +0xc6
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.(*store).restore(0xc4340e7380, 0x14a7080, 0x4)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:278 +0x6c0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.NewStore(0x14bc760, 0xc420358c00, 0x14bddc0, 0x1525990, 0x14a9c80, 0xc42f924e50, 0xc3cd)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/kvstore.go:129 +0x3d8
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.newWatchableStore(0x14bc760, 0xc420358c00, 0x14bddc0, 0x1525990, 0x14a9c80, 0xc42f924e50, 0x1525990)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:75 +0x81
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc.New(0x14bc760, 0xc420358c00, 0x14bddc0, 0x1525990, 0x14a9c80, 0xc42f924e50, 0x1, 0xc420202090)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/watchable_store.go:70 +0x5d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.recoverSnapshotBackend(0xc420266000, 0x14bc760, 0xc420358c00, 0xc42035e000, 0xb107f48, 0xb108000, 0xc4201d83c0, 0x5, 0x8, 0x0, ...)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:74 +0xe0
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc420266000, 0x0, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:378 +0x2d96
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420186e00, 0xc420264000, 0x0, 0x0)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:157 +0x782
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420186e00, 0x6, 0xf70713, 0x6, 0x1)
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:186 +0x58
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:103 +0x15ba
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:39 +0x61
main.main()
	/home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20

I tried to rejoin the cluster and found that I could solve this problem.But why?Can you tell me how to avoid it?

@hexfusion
Copy link
Contributor

Hi @ysicing thank you for the report. Is this issue reproducable? If yes can you please provide basic steps? With these steps we can isolate the issues around bolt during restore.

@ysicing
Copy link
Author

ysicing commented May 5, 2018

@hexfusion I can't emersion this problem in production environment. Yesterday, I'm forced to restart this machine because of the load is too high. Later, it is normal to detect clusters and current nodes by etcdctl.From the monitoring data and logs, this happened too suddenly.

@hexfusion
Copy link
Contributor

hexfusion commented May 5, 2018

Ok just curious Is etcdctl version same as etcd binary?

ref: #8632

@ysicing
Copy link
Author

ysicing commented May 5, 2018

@hexfusion

Of course

root@alish-etcd02:~# etcdctl --version
etcdctl version: 3.2.13
API version: 2
root@alish-etcd02:~# etcd --version
etcd Version: 3.2.13
Git SHA: 95a726a
Go Version: go1.8.5
Go OS/Arch: linux/amd64

@disksing
Copy link
Contributor

disksing commented May 8, 2018

Hi, I met a similar problem yesterday. We are using embed etcd v3.2.18. After restart, the server keeps panic:

github.com/pingcap/pd/vendor/github.com/coreos/bbolt.(*DB).page(...)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/bbolt/db.go:793
github.com/pingcap/pd/vendor/github.com/coreos/bbolt.Open(0xc420210c00, 0x28, 0x180, 0xc420479e10, 0x4e734e, 0xc420210c00, 0x28)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/bbolt/db.go:237 +0x655
github.com/pingcap/pd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc420210c00, 0x28, 0x5f5e100, 0x2710, 0x280000000, 0x2)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:129 +0x9d
github.com/pingcap/pd/vendor/github.com/coreos/etcd/mvcc/backend.New(0xc420210c00, 0x28, 0x5f5e100, 0x2710, 0x280000000, 0xc42003c570, 0xc42003c500)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:113 +0x48
github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver.newBackend(0xc420484000, 0x4334a8, 0xfc6668)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver/backend.go:36 +0x17d
github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver.openBackend.func1(0xc4200b2780, 0xc420484000)
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver/backend.go:56 +0x2b
created by github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver.openBackend
        /home/jenkins/workspace/build_pd_2.0/go/src/github.com/pingcap/pd/vendor/github.com/coreos/etcd/etcdserver/backend.go:55 +0x9f
panic: runtime error: index out of range

After adding some logs, I found that the freelist field of meta looks abnormal:

Open: meta=&bolt.meta{magic:0xed0cdaed, version:0x2, pageSize:0x1000, flags:0x0, root:bolt.bucket{root:0xd, sequence:0x0}, freelist:0xffffffffffffffff, pgid:0x7a, txid:0x1c5f1, checksum:0x9e6a9150b5d97d27}

Since the meta can pass checksum verification, I guess it could be a bug of boltdb.
The snap/db file: db.tar.gz

@gyuho
Copy link
Contributor

gyuho commented May 8, 2018

@disksing Thanks for db file! We will look into this!

@ysicing If you can, could you also share your data file? Or email us at [email protected]

@gyuho
Copy link
Contributor

gyuho commented May 9, 2018

@disksing Hmm, cannot reproduce with that attached db file. Can you share more detailed reproducible steps?

@disksing
Copy link
Contributor

oops, I discovered a clue.

Because we didn't pay attention to the dependent version when we switch to dep, we are actually using etcd-v3.2.18 and bbolt-1.3.0. As I can see from here, etcd-v3.2.18 is using bbolt-v1.3.1-coreos.5.

As for the db file, it can be opened by bbolt-v1.3.1-coreos.5, while can't be opened by bbolt-v1.3.0. That should be the reason you were not able to reproduce it. Maybe the db file is created by bbolt-v1.3.1-coreos.5 somehow. I will continue to trace the problem, and new clue will be synced with you. Thanks for your help!

@disksing
Copy link
Contributor

After confirmation, the db file created by bbolt-v1.3.1 cannot be successfully opened by bbolt-v1.3.0, so this should be the compatibility issue.

@gyuho gyuho closed this as completed May 11, 2018
@you10906
Copy link

you10906 commented Mar 21, 2019

Hi @ysicing i encountered the same problem , Can you tell me how you solved it?

2019-03-21 07:36:21.468453 I | etcdmain: etcd Version: 3.2.24
2019-03-21 07:36:21.468538 I | etcdmain: Git SHA: 420a45226
2019-03-21 07:36:21.468544 I | etcdmain: Go Version: go1.8.7
2019-03-21 07:36:21.468549 I | etcdmain: Go OS/Arch: linux/amd64
2019-03-21 07:36:21.468554 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2019-03-21 07:36:21.468592 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-03-21 07:36:21.468613 I | embed: peerTLS: cert = /etc/etcd/ssl/etcd.pem, key = /etc/etcd/ssl/etcd-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/ca.pem, client-cert-auth = false
2019-03-21 07:36:21.469379 I | embed: listening for peers on https://192.168.20.140:2380
2019-03-21 07:36:21.469406 W | embed: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
2019-03-21 07:36:21.469471 I | embed: listening for client requests on 127.0.0.1:2379
2019-03-21 07:36:21.469554 I | embed: listening for client requests on 192.168.20.140:2379
2019-03-21 07:36:21.522873 I | etcdserver: recovered store from snapshot at index 15200152
2019-03-21 07:36:21.536599 I | mvcc: restore compact to 12313654
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).pageNode(0xc4204a1100, 0x237c000002d0, 0x7f2d2cb28000, 0x0)
	/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:724 +0x1bf
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).first(0xc4201dd7c0)
	/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:187 +0x93
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).next(0xc4201dd7c0, 0xf7d101, 0x9, 0x0, 0x0, 0x0, 0x86, 0xc400000000)
	/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:240 +0x97
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).Next(0xc4201dd7c0, 0x11, 0x11, 0xc4201a4f60, 0x11, 0x12, 0xffffffffffffffff)
	/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:75 +0x7d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.unsafeRange(0xc420212000, 0x14b803c, 0x3, 0x3, 0xc4201a4f40, 0x11, 0x12, 0xc4201a4f60, 0x11, 0x12, ...)
	/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:106 +0x34f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants