Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB panics when calling RegionStore.accessStore #20181

Closed
aylei opened this issue Sep 23, 2020 · 5 comments · Fixed by #20204
Closed

TiDB panics when calling RegionStore.accessStore #20181

aylei opened this issue Sep 23, 2020 · 5 comments · Fixed by #20204
Assignees
Labels
component/tikv severity/critical type/bug The issue is confirmed as a bug.

Comments

@aylei
Copy link

aylei commented Sep 23, 2020

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

I'm not able to reproduce this issue.

2. What did you expect to see? (Required)

TiDB runs normally

3. What did you see instead (Required)

TiDB panics when calling RegionStore.accessStore

[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:829] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=47144] [currIdx=0] [leaderStoreID=66]
[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [current="region ID: 50568, meta: id:50568 start_key:\"t\\200\\000\\000\\000\\000\\000\\0001_i\\200\\000\\000\\000\\000\\000\\000\\001\\00186864293\\377-f4f6-59\\377f6-9d07-\\37731903362\\3778caf\\000\\000\\000\\000\\373\" end_key:\"t\\200\\000\\000\\000\\000\\000\\0001_i\\200\\000\\000\\000\\000\\000\\000\\001\\001869cf186\\377-162e-5c\\377bd-a0f0-\\3777549f6c7\\3779f13\\000\\000\\000\\000\\373\" region_epoch:<conf_ver:5 version:36 > peers:<id:50570 store_id:4 > peers:<id:50571 store_id:66 > , peer: id:50571 store_id:66 , addr: db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160, idx: 1, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="wait recvLoop: context deadline exceeded"]
[2020/09/23 09:15:43.978 +00:00] [INFO] [region_cache.go:414] ["invalidate current region, because others failed on same store"] [region=5600] [store=db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160]
[2020/09/23 09:15:43.980 +00:00] [INFO] [region_cache.go:829] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=11412] [currIdx=0] [leaderStoreID=66]
[2020/09/23 09:15:43.981 +00:00] [INFO] [region_cache.go:1630] ["[liveness] request kv status fail"] [store=0.0.0.0:20180] [error="Get https://0.0.0.0:20180/status: dial tcp 0.0.0.0:20180: connect: connection refused"]
panic: runtime error: index out of range [2] with length 2

goroutine 156282637 [running]:
github.com/pingcap/tidb/store/tikv.(*RegionStore).accessStore(...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:117
github.com/pingcap/tidb/store/tikv.(*RegionCache).OnSendFail(0xc0001ba2a0, 0xc18dc16b80, 0xc13ca4ed70, 0xc17619a000, 0x36cf480, 0xc1f42b9e40)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:578 +0x93a
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onSendFail(0xc0b05ae050, 0xc18dc16b80, 0xc13ca4ed70, 0x36cf480, 0xc1f42b9e40, 0xc13c675220, 0x4a817c800)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:455 +0x196
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion(0xc0b05ae050, 0xc18dc16b80, 0xc13ca4ed70, 0xc13c675220, 0x4a817c800, 0x0, 0x0, 0x36cf480, 0xc1f42b9e40)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:404 +0x261
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx(0xc0b05ae050, 0xc18dc16b80, 0xc13c675220, 0x1878, 0x5, 0x23, 0x4a817c800, 0x15c3800, 0xc13c904f70, 0xc1a67ca410, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:293 +0x2d0
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq(...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:202
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1076 +0x285
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1bce2d580, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1bce2d500, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc13c761e88, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1346efa00, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1346ef980, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc16d12d2a8, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x1, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnBatches(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc0941a5700, 0x1, 0x1, 0xc1b1184254, 0x1)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:628 +0x110
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnGroupMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc0941a5680, 0x1, 0x1, 0x353, 0xc1b128adf8)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:609 +0x661
github.com/pingcap/tidb/store/tikv.(*twoPhaseCommitter).doActionOnMutations(0xc1afd88420, 0xc18dc16b80, 0x37180a0, 0xc0de7f6648, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, 0x353, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:478 +0x1c2
github.com/pingcap/tidb/store/tikv.actionCommit.handleSingleBatch(0x309b200, 0xc1afd88420, 0xc18dc16b80, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1100 +0x19aa
github.com/pingcap/tidb/store/tikv.(*batchExecutor).startWorker.func1(0xc2631814c0, 0xc016cb3080, 0x1878, 0x5, 0x23, 0xc1b1184254, 0x1, 0x353, 0xc1b1262df8, 0x1, ...)
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1547 +0x17c
created by github.com/pingcap/tidb/store/tikv.(*batchExecutor).startWorker
	/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/store/tikv/2pc.go:1529 +0x1b7

4. What is your TiDB version? (Required)

v4.0.6

@aylei aylei added the type/bug The issue is confirmed as a bug. label Sep 23, 2020
@lysu lysu self-assigned this Sep 23, 2020
@lysu
Copy link
Contributor

lysu commented Sep 23, 2020

@aylei does this cluster has TiFlash node or using the follow-read feature?

@aylei
Copy link
Author

aylei commented Sep 23, 2020

Related logs:

tidb.log.tar.gz
pd2.log.tar.gz
pd1.log.tar.gz
pd0.log.tar.gz

@lysu
Copy link
Contributor

lysu commented Sep 24, 2020

after split tidb log like this

[2020/09/23 09:15:43.977 +00:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [current="region ID: 50568, meta: id:50568 start_key:"t\200\000\000\000\000\000\0001_i\200\000\000\000\000\000\000\001\00186864293\377-f4f6-59\377f6-9d07-\37731903362\3778caf\000\000\000\000\373" end_key:"t\200\000\000\000\000\000\0001_i\200\000\000\000\000\000\000\001\001869cf186\377-162e-5c\377bd-a0f0-\3777549f6c7\3779f13\000\000\000\000\373" region_epoch:<conf_ver:5 version:36 > peers:<id:50570 store_id:4 > peers:<id:50571 store_id:66 > , peer: id:50571 store_id:66 , addr: db-tikv-2.db-tikv-peer.tidb1306611994220040192.svc:20160, idx: 1, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="wait recvLoop: context deadline exceeded"]

grep TiKvOnly tidb.log | awk -F'peers' '{print NF}' | sort -n | uniq -c
  84351 3
  15104 4

we can see many region only have 2 peers (so it has 3 in awk result)

this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache.

but the DownPeers changed didn't take effect on the region's Epoch, (kv collect down_peers before heartbeat PD, as the peer long time didn't send a heartbeat, https://github.com/tikv/tikv/blob/b7b0105d6e1bf889f35f67419c21dbbcdc041f07/components/raftstore/src/store/peer.rs#L818)

then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35.

so peers that used to build RpcCtx can be changed before OnSendFail(e.g. before sending request region1 has 3 alive peers, after send fail region1 has other 1 down peer) without region epoch change, so OnSendFail handle will get a new region with different length that not match to the RpcCtx's world.

and this cause panic

@sre-bot
Copy link
Contributor

sre-bot commented Sep 27, 2020

Integrity check:
component RCA symptom trigger_condition affect_version fix_version fields are empty
@lysu
Please comment /info to get template

@ti-srebot
Copy link
Contributor

ti-srebot commented Sep 27, 2020

Please edit this comment to complete the following information

Bug

1. Root Cause Analysis (RCA)

this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache.

but the DownPeers changed didn't take effect on the region's Epoch, (kv collect down_peers before heartbeat PD, as the peer long time didn't send a heartbeat, https://github.com/tikv/tikv/blob/b7b0105d6e1bf889f35f67419c21dbbcdc041f07/components/raftstore/src/store/peer.rs#L818)

then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35.

so peers that used to build RpcCtx can be changed before OnSendFail(e.g. before sending request region1 has 3 alive peers, after send fail region1 has other 1 down peer) without region epoch change, so OnSendFail handle will get a new region with different length that not match to the RpcCtx's world.

and this cause panic

2. Symptom

user will see panic with panic: runtime error: index out of range + RegionStore).accessStore in tidb-stderr.log

3. All Trigger Conditions

after kv node became down_peers, before pd remove peers remove from peer list, TiDB meet send fail error to tikv as last item in peer list, and another SQL trigger load new region from pd between send request and receive send fail result

4. Workaround (optional)

reboot

5. Affected versions

[v4.0.0:v4.0.7]

6. Fixed versions

v4.0.8

@pingcap pingcap deleted a comment from ti-srebot Sep 27, 2020
@pingcap pingcap deleted a comment from ti-srebot Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants