-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestPrepareCheckerWithTransferLeader is unstable #8351
Comments
meet again https://github.com/tikv/pd/actions/runs/9762010374/job/26944878762 2024-07-02T13:54:15.0492409Z --- FAIL: TestPrepareCheckerWithTransferLeader (76.09s)
2024-07-02T13:54:15.0492734Z region_syncer_test.go:289:
2024-07-02T13:54:15.0493797Z Error Trace: /home/runner/work/pd/pd/tests/server/region_syncer/region_syncer_test.go:289
2024-07-02T13:54:15.0494204Z Error: Not equal:
2024-07-02T13:54:15.0494642Z expected: "pd2"
2024-07-02T13:54:15.0495194Z actual : ""
2024-07-02T13:54:15.0495495Z
2024-07-02T13:54:15.0496104Z Diff:
2024-07-02T13:54:15.0496743Z --- Expected
2024-07-02T13:54:15.0497390Z +++ Actual
2024-07-02T13:54:15.0498020Z @@ -1 +1 @@
2024-07-02T13:54:15.0498413Z -pd2
2024-07-02T13:54:15.0498676Z +
2024-07-02T13:54:15.0499001Z Test: TestPrepareCheckerWithTransferLeader
2024-07-02T13:54:15.0499178Z FAIL |
I found this error is in an extreme situation. And I created an issue to focus on the Waitleader failure situation :) #8359 pd/tests/server/region_syncer/region_syncer_test.go Lines 280 to 289 in 7e18a69
After resign leader to pd-2, which is happened in
Due to the slow disk, the leader changed 8 times in the 40s! and no leader check is in 40s which result error 2024-07-02T13:54:14.8776867Z [2024/07/02 13:53:21.728 +00:00] [WARN] [lease.go:175] ["lease keep alive failed"] [purpose="leader election"] [start=2024/07/02 13:53:21.726 +00:00] [error="context canceled"]
2024-07-02T13:54:14.8778148Z [2024/07/02 13:53:21.734 +00:00] [INFO] [lease.go:192] ["stop lease keep alive worker"] [purpose="leader election"] For this extreme situation, I think we can failed directly. The detailed content is 2024-07-02T13:54:14.7951174Z [2024/07/02 13:53:01.139 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1]
2024-07-02T13:54:14.8302504Z [2024/07/02 13:53:07.537 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1]
!!!resign triggered by API!!!
2024-07-02T13:54:14.8419029Z [2024/07/02 13:53:11.565 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd1] [to=]
2024-07-02T13:54:14.8523786Z [2024/07/02 13:53:12.651 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
2024-07-02T13:54:14.8715386Z [2024/07/02 13:53:18.726 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
2024-07-02T13:54:14.8900582Z [2024/07/02 13:53:24.777 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
!!!resign pd-2 ->0!!!
2024-07-02T13:54:14.9121126Z [2024/07/02 13:53:30.827 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd2] [to=]
2024-07-02T13:54:14.9167935Z [2024/07/02 13:53:31.329 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd2] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd2 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]
2024-07-02T13:54:14.9189703Z [2024/07/02 13:53:31.833 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1]
!!!resign pd-1 -> 0!!!
2024-07-02T13:54:14.9374977Z [2024/07/02 13:53:37.894 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd1] [to=]
2024-07-02T13:54:14.9440845Z [2024/07/02 13:53:38.395 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd1] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd1 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]
2024-07-02T13:54:14.9468215Z [2024/07/02 13:53:38.897 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
2024-07-02T13:54:14.9755943Z [2024/07/02 13:53:44.934 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
2024-07-02T13:54:15.0006334Z [2024/07/02 13:53:50.978 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
!!!resign pd-2 -> 0!!!
2024-07-02T13:54:15.0132961Z [2024/07/02 13:53:57.018 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd2] [to=]
2024-07-02T13:54:15.0152409Z [2024/07/02 13:53:57.520 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd2] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd2 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]
2024-07-02T13:54:15.0162553Z [2024/07/02 13:53:58.021 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1] And The WaitLeader times out because all pd's do not all recognize the leader. When Checkleader's etcd get is too slow(> leader lease)
The above event loop occurred for 50 seconds, causing the WaitLeader to time out. You can check !!! pd2 was elected leader at 13:53:12.651 !!!
[2024/07/02 13:53:12.651 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
!!! pd1 skip campaigning stops at 13:53:12.479, at which point it goes to getPersistentLeader in CheckLeader, and then it takes 6s to get the leader and start watching !!!
[2024/07/02 13:53:12.278 +00:00] [INFO] [server.go:1700] ["skip campaigning of pd leader and check later"] [server-name=pd1] [etcd-leader-id=13430743067140231612] [member-id=15634440769998151696]
[2024/07/02 13:53:12.479 +00:00] [INFO] [server.go:1700] ["skip campaigning of pd leader and check later"] [server-name=pd1] [etcd-leader-id=13430743067140231612] [member-id=15634440769998151696]
!!! CheckLeader !!!
[2024/07/02 13:53:18.684 +00:00] [WARN] [etcdutil.go:153] ["kv gets too slow"] [request-key=/pd/7387036149936305935/leader] [cost=6.004986363s] []
[2024/07/02 13:53:18.725 +00:00] [INFO] [server.go:1669] ["start to watch pd leader"] [pd-leader="name:\"pd2\" member_id:13430743067140231612 peer_urls:\"http://127.0.0.1:38109\" client_urls:\"http://127.0.0.1:40999\" "]
!!! pd1 watch found the leader changed after only 150ms, resulting in the WaitLeader not being successful !!!
[2024/07/02 13:53:18.862 +00:00] [INFO] [leadership.go:388] ["current leadership is deleted"] [revision=57] [leader-key=/pd/7387036149936305935/leader] [purpose="leader election"]
[2024/07/02 13:53:19.728 +00:00] [INFO] [server.go:1673] ["pd leader has changed, try to re-campaign a pd leader"] |
https://github.com/tikv/pd/actions/runs/9969178086/job/27545694334 |
Does this issue need reopen? @HuSharp |
Flaky Test
Which jobs are failing
CI link
https://github.com/tikv/pd/actions/runs/9737924850/job/26924893445
Reason for failure (if possible)
Anything else
The text was updated successfully, but these errors were encountered: