-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiDB panics when calling RegionStore.accessStore #20181
Comments
@aylei does this cluster has TiFlash node or using the follow-read feature? |
Related logs: tidb.log.tar.gz |
after split tidb log like this
we can see many region only have 2 peers (so it has 3 in awk result) this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache. but the then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35. so peers that used to build and this cause panic |
Integrity check: |
Please edit this comment to complete the following informationBug1. Root Cause Analysis (RCA)this question is introduced by #17337, after it, TiDB will filter unavailable peers after load region from PD and insert the new region with filtered peers into region cache. but the DownPeers changed didn't take effect on the region's Epoch, (kv collect down_peers before heartbeat PD, as the peer long time didn't send a heartbeat, https://github.com/tikv/tikv/blob/b7b0105d6e1bf889f35f67419c21dbbcdc041f07/components/raftstore/src/store/peer.rs#L818) then TiDB has chance to fetch different result with same regionID(id+confver+ver) from PD as different time https://github.com/pingcap/pd/blob/bcfa77a7a593d519575f6e9be8cf3b6793c65a40/client/client.go#L35. so peers that used to build RpcCtx can be changed before OnSendFail(e.g. before sending request region1 has 3 alive peers, after send fail region1 has other 1 down peer) without region epoch change, so OnSendFail handle will get a new region with different length that not match to the RpcCtx's world. and this cause panic 2. Symptomuser will see panic with 3. All Trigger Conditionsafter kv node became down_peers, before pd remove peers remove from peer list, TiDB meet send fail error to tikv as last item in peer list, and another SQL trigger load new region from pd between send request and receive send fail result 4. Workaround (optional)reboot 5. Affected versions[v4.0.0:v4.0.7] 6. Fixed versionsv4.0.8 |
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
I'm not able to reproduce this issue.
2. What did you expect to see? (Required)
TiDB runs normally
3. What did you see instead (Required)
TiDB panics when calling RegionStore.accessStore
4. What is your TiDB version? (Required)
v4.0.6
The text was updated successfully, but these errors were encountered: