Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Merged
merged 1 commit into from
Dec 21, 2020

Conversation

Crazyinging
Copy link
Contributor

@Crazyinging Crazyinging commented Dec 18, 2020

What changes were proposed in this pull request?

Fix tow bugs in the RaftPart.

  • The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out.

  • The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, which further causes the fake leader to be stuck when fake leader appears.

    • Fake leader: Become a leader but cannot send out messages and disagree with the received messages !

Why are the changes needed?

  • The first bug will cause abnormal communication between partitions when the raft pressure is high.
  • The second bug will produce fake leaders and not heal itself for a long time.
    • Supplement for how the fake leader generated:
      • Example with a raft use 3 machines. Timing:
  1. machine1 is leader at term N and sending message with logid(N) to machine2 and machine3.

  2. Due to pressure and other reasons, machine3 failed to see the news from machine1 in time and initiated the election with term N+1;

  3. Machine1 voted for machine3 and changed its role to follower

  4. Machine1 receives the response of logid(N), but because it is already a follower, it no longer processes the response. Note that at this time, the machine1's RaftPart lastLogId_ and Wal lastLogId_ are out of sync.

  5. Machine3 also failed to send messages to machine1 in time due to high pressure and other reasons, so machine1 initiated the election with term N+2.

  6. Machine2 votes for machine1, and machine1 becomes the leader again with term N+2.

  7. Machine1 cannot send messages out at this time, because RaftPart lastLogId_ and wal's lastLogId_ are not synchronized
    image

  8. The message sent by machine3 to machine1 is rejected by machine1 because the term of machine3 is smaller than machine1, so machine1 enters the stuck phase:
    image
    image

Will break the compatibility? How if so?

No, just fix bugs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Simulate a harsh environment:

  1. Adjust the rpc timeout time down (<=500ms, the lower the better);
  2. To make the message write load higher, it is better to keep the disk IO at a higher water level. In my case, the water level is always >=90% and often reaches 100%.

Then observe the appearance of the fake leader before and after the merge code. It is expected that the fake leader will appear before the merge of my code, but will not appear after the merge code.

Checklist

  • [ yes ] I've run the tests to see all new and existing tests pass
  • [ - ] If this Pull Request resolves an issue, I linked to the issue in the text above
  • [ - ] I've informed the technical writer about the documentation change if necessary

@Crazyinging Crazyinging changed the title Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming false leaders Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders Dec 18, 2020
@critical27 critical27 added the ready-for-testing PR: ready for the CI test label Dec 19, 2020
@critical27
Copy link
Contributor

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

  1. After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
  2. The line number is not consistent to master, which version do you use? Or did you modify some other logic?

@codecov-io
Copy link

Codecov Report

Merging #2419 (8f46b8c) into master (2249c03) will decrease coverage by 0.15%.
The diff coverage is 82.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2419      +/-   ##
==========================================
- Coverage   86.46%   86.31%   -0.16%     
==========================================
  Files         649      648       -1     
  Lines       64374    66480    +2106     
==========================================
+ Hits        55662    57381    +1719     
- Misses       8712     9099     +387     
Impacted Files Coverage Δ
src/common/filter/geo/GeoParams.h 100.00% <ø> (ø)
src/common/hdfs/HdfsCommandHelper.cpp 0.00% <0.00%> (-75.00%) ⬇️
src/common/stats/StatsManager.h 100.00% <ø> (ø)
src/common/time/Duration.cpp 75.86% <ø> (ø)
src/common/time/Duration.h 100.00% <ø> (ø)
src/common/time/detail/TscHelper.cpp 100.00% <ø> (ø)
src/graph/DownloadExecutor.cpp 0.00% <0.00%> (ø)
src/graph/DropSpaceExecutor.h 0.00% <ø> (ø)
src/graph/FetchEdgesExecutor.cpp 79.32% <0.00%> (-0.27%) ⬇️
src/graph/FindPathExecutor.h 50.00% <ø> (ø)
... and 145 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1e9ff4...8f46b8c. Read the comment docs.

@liuyu85cn liuyu85cn self-requested a review December 19, 2020 04:02
Copy link
Contributor

@liuyu85cn liuyu85cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Crazyinging
Copy link
Contributor Author

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

  1. After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
  2. The line number is not consistent to master, which version do you use? Or did you modify some other logic?
  1. Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.
  2. No other changes, the code is forked from the latest master.

@Crazyinging
Copy link
Contributor Author

I only ran CTest in the centos environment, but not in ubuntu clang-9. I now see why the error occurred.

@critical27
Copy link
Contributor

critical27 commented Dec 21, 2020

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

  1. After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
  2. The line number is not consistent to master, which version do you use? Or did you modify some other logic?
  1. Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.
  2. No other changes, the code is forked from the latest master.

The reason not start elections may be one of the following reasons:

  1. Lock contention under high pressures, especially the lock is not fine-grained.
  2. We only allow start one election at a time in current code, and the rpc timeout of vote will be quite long. (raft_heartbeat_interval_secs)

Have you modify the raft_heartbeat_interval_secs parameters?

@Crazyinging
Copy link
Contributor Author

Crazyinging commented Dec 21, 2020

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

  1. After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
  2. The line number is not consistent to master, which version do you use? Or did you modify some other logic?
  1. Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.
  2. No other changes, the code is forked from the latest master.

The reason not start elections may be one of the following reasons:

  1. Lock contention under high pressures, especially the lock is not fine-grained.
  2. We only allow start one election at a time in current code, and the rpc timeout of vote will be quite long. (raft_heartbeat_interval_secs)

Have you modify the raft_heartbeat_interval_secs parameters?

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

  1. After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
  2. The line number is not consistent to master, which version do you use? Or did you modify some other logic?
  1. Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.
  2. No other changes, the code is forked from the latest master.

The reason not start elections may be one of the following reasons:

  1. Lock contention under high pressures, especially the lock is not fine-grained.
  2. We only allow start one election at a time in current code, and the rpc timeout of vote will be quite long. (raft_heartbeat_interval_secs)

Have you modify the raft_heartbeat_interval_secs parameters?

Yes adjusted.
My last reply may not be clear.
In fact, I encountered two problems at the same time under high writing pressure(SSD Disk IO utils >= 90%):

  1. One is the problem of fake leader, this PR is to solve the fake leader;
  2. The other is that cluster elections are very frequent. The reasons are:
    2.1. Raft_rpc_timeout_ms is too small (use the default value of 500ms) because disk is too busy.
    2.2. The threads allocated to raft are not enough.
    Others: I have tried to only adjust raft_heartbeat_interval_secs from 5s to 30s, but the frequent election problem has not improved significantly.
    The main optimization is to adjust raft_rpc_timeout_ms bigger, and allocate more threads to raft.
    The lock is also related and I made a slight adjustment to the reset of lastMsgRecvDur_ will submit it in another PR.

@critical27
Copy link
Contributor

2. The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

@Crazyinging
Copy link
Contributor Author

Crazyinging commented Dec 21, 2020

  1. The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

Yes, I did try to increase the heartbeat sending frequency, but because the disk load was too high, the heartbeat messages were all discarded with many logs: "Another AppendLogs request is ongoing, just return". So I made a slight adjustment to the reset of lastMsgRecvDur_ which will be submitted in another PR.

@critical27
Copy link
Contributor

  1. The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

Yes, I did try to increase the heartbeat sending frequency, but because the disk load was too high, the heartbeat messages were all discarded with many logs: "Another AppendLogs request is ongoing, just return". So I made a slight adjustment to the reset of lastMsgRecvDur_ will submit it in another PR.

OK, thanks again. I will merge this PR later.

@critical27 critical27 merged commit a27827a into vesoft-inc:master Dec 21, 2020
tong-hao pushed a commit to tong-hao/nebula that referenced this pull request Jun 1, 2021
* The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out.

* The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, and it won't accept further leader's log.
codesigner pushed a commit to codesigner/nebula that referenced this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants