Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Crazyinging · 2020-12-18T11:20:12Z

What changes were proposed in this pull request?

Fix tow bugs in the RaftPart.

The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out.
The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, which further causes the fake leader to be stuck when fake leader appears.
- Fake leader: Become a leader but cannot send out messages and disagree with the received messages !

Why are the changes needed?

The first bug will cause abnormal communication between partitions when the raft pressure is high.
The second bug will produce fake leaders and not heal itself for a long time.
- Supplement for how the fake leader generated:
  - Example with a raft use 3 machines. Timing:

machine1 is leader at term N and sending message with logid(N) to machine2 and machine3.
Due to pressure and other reasons, machine3 failed to see the news from machine1 in time and initiated the election with term N+1;
Machine1 voted for machine3 and changed its role to follower
Machine1 receives the response of logid(N), but because it is already a follower, it no longer processes the response. Note that at this time, the machine1's RaftPart lastLogId_ and Wal lastLogId_ are out of sync.
Machine3 also failed to send messages to machine1 in time due to high pressure and other reasons, so machine1 initiated the election with term N+2.
Machine2 votes for machine1, and machine1 becomes the leader again with term N+2.
Machine1 cannot send messages out at this time, because RaftPart lastLogId_ and wal's lastLogId_ are not synchronized
The message sent by machine3 to machine1 is rejected by machine1 because the term of machine3 is smaller than machine1, so machine1 enters the stuck phase:

Will break the compatibility? How if so?

No, just fix bugs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Simulate a harsh environment:

Adjust the rpc timeout time down (<=500ms, the lower the better);
To make the message write load higher, it is better to keep the disk IO at a higher water level. In my case, the water level is always >=90% and often reaches 100%.

Then observe the appearance of the fake leader before and after the merge code. It is expected that the fake leader will appear before the merge of my code, but will not appear after the merge code.

Checklist

[ yes ] I've run the tests to see all new and existing tests pass
[ - ] If this Pull Request resolves an issue, I linked to the issue in the text above
[ - ] I've informed the technical writer about the documentation change if necessary

critical27 · 2020-12-19T03:14:24Z

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.
The line number is not consistent to master, which version do you use? Or did you modify some other logic?

codecov-io · 2020-12-19T03:19:13Z

Codecov Report

Merging #2419 (8f46b8c) into master (2249c03) will decrease coverage by 0.15%.
The diff coverage is 82.56%.

@@            Coverage Diff             @@
##           master    #2419      +/-   ##
==========================================
- Coverage   86.46%   86.31%   -0.16%     
==========================================
  Files         649      648       -1     
  Lines       64374    66480    +2106     
==========================================
+ Hits        55662    57381    +1719     
- Misses       8712     9099     +387

Impacted Files	Coverage Δ
src/common/filter/geo/GeoParams.h	`100.00% <ø> (ø)`
src/common/hdfs/HdfsCommandHelper.cpp	`0.00% <0.00%> (-75.00%)`	⬇️
src/common/stats/StatsManager.h	`100.00% <ø> (ø)`
src/common/time/Duration.cpp	`75.86% <ø> (ø)`
src/common/time/Duration.h	`100.00% <ø> (ø)`
src/common/time/detail/TscHelper.cpp	`100.00% <ø> (ø)`
src/graph/DownloadExecutor.cpp	`0.00% <0.00%> (ø)`
src/graph/DropSpaceExecutor.h	`0.00% <ø> (ø)`
src/graph/FetchEdgesExecutor.cpp	`79.32% <0.00%> (-0.27%)`	⬇️
src/graph/FindPathExecutor.h	`50.00% <ø> (ø)`
... and 145 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1e9ff4...8f46b8c. Read the comment docs.

liuyu85cn

LGTM

Crazyinging · 2020-12-21T05:11:53Z

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.

The line number is not consistent to master, which version do you use? Or did you modify some other logic?

Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.
No other changes, the code is forked from the latest master.

Crazyinging · 2020-12-21T05:14:03Z

I only ran CTest in the centos environment, but not in ubuntu clang-9. I now see why the error occurred.

critical27 · 2020-12-21T08:26:21Z

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.

The line number is not consistent to master, which version do you use? Or did you modify some other logic?

Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.

No other changes, the code is forked from the latest master.

The reason not start elections may be one of the following reasons:

Lock contention under high pressures, especially the lock is not fine-grained.
We only allow start one election at a time in current code, and the rpc timeout of vote will be quite long. (raft_heartbeat_interval_secs)

Have you modify the raft_heartbeat_interval_secs parameters?

Crazyinging · 2020-12-21T09:47:20Z

Thanks for the contribution. Good catch, this PR generally LGTM. I have some questions:

After machine1 will enters the stuck phase, what happen next? IMO, after machine 2 or 3 time out, and start election, machine 1 will eventually become a follower, right? And everything will become normal again, although the partition did't work after stuck phase.

The line number is not consistent to master, which version do you use? Or did you modify some other logic?

Yeah, according to the code, when machine 2 and 3 start elections again and term catches up with machine 1, the fake leader should disappear. But in my simulated case, the total leader number of the cluster has not returned to normal when cluster is under high load. In addition, machine 2 and machine 3 did not start elections after the write pressure stopped.

No other changes, the code is forked from the latest master.

The reason not start elections may be one of the following reasons:

Lock contention under high pressures, especially the lock is not fine-grained.

We only allow start one election at a time in current code, and the rpc timeout of vote will be quite long. (raft_heartbeat_interval_secs)

Have you modify the raft_heartbeat_interval_secs parameters?

Yes adjusted.
My last reply may not be clear.
In fact, I encountered two problems at the same time under high writing pressure(SSD Disk IO utils >= 90%):

One is the problem of fake leader, this PR is to solve the fake leader;
The other is that cluster elections are very frequent. The reasons are:
2.1. Raft_rpc_timeout_ms is too small (use the default value of 500ms) because disk is too busy.
2.2. The threads allocated to raft are not enough.
Others: I have tried to only adjust raft_heartbeat_interval_secs from 5s to 30s, but the frequent election problem has not improved significantly.
The main optimization is to adjust raft_rpc_timeout_ms bigger, and allocate more threads to raft.
The lock is also related and I made a slight adjustment to the reset of lastMsgRecvDur_ will submit it in another PR.

critical27 · 2020-12-21T10:15:19Z

2. The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

Crazyinging · 2020-12-21T10:23:10Z

The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

Yes, I did try to increase the heartbeat sending frequency, but because the disk load was too high, the heartbeat messages were all discarded with many logs: "Another AppendLogs request is ongoing, just return". So I made a slight adjustment to the reset of lastMsgRecvDur_ which will be submitted in another PR.

critical27 · 2020-12-21T10:24:29Z

The other is that cluster elections are very frequent.

I know this issue. But I remember as for one part, the election is not frequent, but from cluster view, this is frequent. The root cause is that our heartbeat is implemented by sending an "empty" log, so it need to write wal, rpc to peers and then commit. It is vulnerable when the write pressure is high, for example, rocksdb is writing and it takes way more time than expect (with lock), in this way, the heartbeat is blocked.

Yes, I did try to increase the heartbeat sending frequency, but because the disk load was too high, the heartbeat messages were all discarded with many logs: "Another AppendLogs request is ongoing, just return". So I made a slight adjustment to the reset of lastMsgRecvDur_ will submit it in another PR.

OK, thanks again. I will merge this PR later.

* The first bug is partition will reject the same candidate who has already voted use the same term to request a vote again. This problem may occur when the RPC times out. * The second bug is that when leader change its role from leader to follower after voting, it fails to call wal_->rollbackToLog(..), making RaftPart's lastLogId_ and Wal's lastLogId_ out of sync, and it won't accept further leader's log.

Co-authored-by: hs.zhang <[email protected]> Co-authored-by: Sophie <[email protected]>

fix raft bug

8f46b8c

Crazyinging changed the title ~~Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming false leaders~~ Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders Dec 18, 2020

critical27 added the ready-for-testing PR: ready for the CI test label Dec 19, 2020

critical27 mentioned this pull request Dec 19, 2020

to cherry pick vesoft-inc/nebula-storage#250

Closed

liuyu85cn self-requested a review December 19, 2020 04:02

liuyu85cn approved these changes Dec 19, 2020

View reviewed changes

critical27 approved these changes Dec 21, 2020

View reviewed changes

critical27 merged commit a27827a into vesoft-inc:master Dec 21, 2020

Crazyinging mentioned this pull request Dec 31, 2020

Improve raft stability #2429

Closed

critical27 mentioned this pull request Jan 12, 2021

Fix raft bugs vesoft-inc/nebula-storage#296

Merged

codesigner pushed a commit to codesigner/nebula that referenced this pull request Mar 30, 2023

support upgrade from V2 to V3.4 (vesoft-inc#2419)

a3a1292

Co-authored-by: hs.zhang <[email protected]> Co-authored-by: Sophie <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Crazyinging commented Dec 18, 2020 •

edited

Loading

critical27 commented Dec 19, 2020

codecov-io commented Dec 19, 2020

liuyu85cn left a comment

Crazyinging commented Dec 21, 2020

Crazyinging commented Dec 21, 2020

critical27 commented Dec 21, 2020 •

edited

Loading

Crazyinging commented Dec 21, 2020 •

edited

Loading

critical27 commented Dec 21, 2020

Crazyinging commented Dec 21, 2020 •

edited

Loading

critical27 commented Dec 21, 2020

Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Fix some raft bugs that can cause raft to be unstable and even some partitions become stuck after becoming fake leaders #2419

Conversation

Crazyinging commented Dec 18, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Will break the compatibility? How if so?

Does this PR introduce any user-facing change?

How was this patch tested?

Checklist

critical27 commented Dec 19, 2020

codecov-io commented Dec 19, 2020

Codecov Report

liuyu85cn left a comment

Choose a reason for hiding this comment

Crazyinging commented Dec 21, 2020

Crazyinging commented Dec 21, 2020

critical27 commented Dec 21, 2020 • edited Loading

Crazyinging commented Dec 21, 2020 • edited Loading

critical27 commented Dec 21, 2020

Crazyinging commented Dec 21, 2020 • edited Loading

critical27 commented Dec 21, 2020

Crazyinging commented Dec 18, 2020 •

edited

Loading

critical27 commented Dec 21, 2020 •

edited

Loading

Crazyinging commented Dec 21, 2020 •

edited

Loading

Crazyinging commented Dec 21, 2020 •

edited

Loading