-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419
Conversation
Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:
|
Codecov Report
@@ Coverage Diff @@
## master #2419 +/- ##
==========================================
- Coverage 86.46% 86.31% -0.16%
==========================================
Files 649 648 -1
Lines 64374 66480 +2106
==========================================
+ Hits 55662 57381 +1719
- Misses 8712 9099 +387
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I only ran CTest in the centos environment, but not in ubuntu clang-9. I now see why the error occurred. |
The reason not start elections may be one of the following reasons:
Have you modify the |
Yes adjusted.
|
I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked. |
Yes, I did try to increase the heartbeat sending frequency, but because the disk load was too high, the heartbeat messages were all discarded with many logs: "Another AppendLogs request is ongoing, just return". So I made a slight adjustment to the reset of lastMsgRecvDur_ which will be submitted in another PR. |
OK, thanks again. I will merge this PR later. |
* The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out. * The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, and it won't accept further leader's log.
Co-authored-by: hs.zhang <[email protected]> Co-authored-by: Sophie <[email protected]>
What changes were proposed in this pull request?
Fix tow bugs in the RaftPart.
The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out.
The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, which further causes the fake leader to be stuck when fake leader appears.
Why are the changes needed?
machine1 is leader at term N and sending message with logid(N) to machine2 and machine3.
Due to pressure and other reasons, machine3 failed to see the news from machine1 in time and initiated the election with term N+1;
Machine1 voted for machine3 and changed its role to follower
Machine1 receives the response of logid(N), but because it is already a follower, it no longer processes the response. Note that at this time, the machine1's RaftPart lastLogId_ and Wal lastLogId_ are out of sync.
Machine3 also failed to send messages to machine1 in time due to high pressure and other reasons, so machine1 initiated the election with term N+2.
Machine2 votes for machine1, and machine1 becomes the leader again with term N+2.
Machine1 cannot send messages out at this time, because RaftPart lastLogId_ and wal's lastLogId_ are not synchronized

The message sent by machine3 to machine1 is rejected by machine1 because the term of machine3 is smaller than machine1, so machine1 enters the stuck phase:


Will break the compatibility? How if so?
No, just fix bugs.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Simulate a harsh environment:
Then observe the appearance of the fake leader before and after the merge code. It is expected that the fake leader will appear before the merge of my code, but will not appear after the merge code.
Checklist