Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent log #293

Closed
JackyWoo opened this issue Mar 1, 2022 · 8 comments
Closed

Inconsistent log #293

JackyWoo opened this issue Mar 1, 2022 · 8 comments

Comments

@JackyWoo
Copy link
Contributor

JackyWoo commented Mar 1, 2022

Case Description

NuRaft commit 0073f55 2021.11.19

10.199.141.8 ID: 13
10.199.141.7 ID: 14
10.199.141.6 ID: 15

To reproduce the issue make short hb interval and election timeout.

heart_beat_interval_ms 40
election_timeout_lower_bound_ms 50
election_timeout_upper_bound_ms 90

About 2022.02.25 12:26 node 13's data are different with 14 and 15. We make a tool checked the log and found that log of 13 were different with the others. The different log is [989286-989299].

Analysis

Near 2022.02.25 12:26 there were 3 leader switches : 13(Term=41, ), 14(Term=42, idx=989286), 13(Term=43,idx=989300), 14 (Term=44,idx =989303).

The following is the server log of 13: When 14 (Term=44, idx=989303) is leader, 13 receives the append log request from 14, and it accepts the request from index 989299.

2022.02.25 12:26:21.470207 [ 26429 ] {} <Debug> RaftInstance: Response back a append_entries_response message to 14 with Accepted=0, Term=44, NextIndex=989314
2022.02.25 12:26:21.470416 [ 26436 ] {} <Debug> RaftInstance: Receive a append_entries_request message from 14 with LastLogIndex=989301, LastLogTerm=42, EntriesLength=2, CommitIndex=989279 and Term=44
2022.02.25 12:26:21.470450 [ 26436 ] {} <Debug> RaftInstance: Response back a append_entries_response message to 14 with Accepted=0, Term=44, NextIndex=989314
2022.02.25 12:26:21.470668 [ 26438 ] {} <Debug> RaftInstance: Receive a append_entries_request message from 14 with LastLogIndex=989300, LastLogTerm=42, EntriesLength=3, CommitIndex=989279 and Term=44
2022.02.25 12:26:21.470704 [ 26438 ] {} <Debug> RaftInstance: Response back a append_entries_response message to 14 with Accepted=0, Term=44, NextIndex=989314
2022.02.25 12:26:21.470935 [ 26439 ] {} <Debug> RaftInstance: Receive a append_entries_request message from 14 with LastLogIndex=989299, LastLogTerm=42, EntriesLength=4, CommitIndex=989279 and Term=44
2022.02.25 12:26:21.470976 [ 26439 ] {} <Debug> RaftInstance: [INIT] log_idx: 989300, count: 0, log_store_->next_slot(): 989314, req.log_entries().size(): 4
2022.02.25 12:26:21.471001 [ 26439 ] {} <Debug> RaftInstance: [after SKIP] log_idx: 989300, count: 0
2022.02.25 12:26:21.471013 [ 26439 ] {} <Information> RaftInstance: rollback logs: 989300 - 989313, commit idx req 989279, quick 989285, sm 989285, num log entries 4, current count 0
2022.02.25 12:26:21.471034 [ 26439 ] {} <Information> RaftInstance: rollback log 989313, term 44
2022.02.25 12:26:21.471043 [ 26439 ] {} <Information> RaftInstance: rollback log 989312, term 44
2022.02.25 12:26:21.471051 [ 26439 ] {} <Information> RaftInstance: rollback log 989311, term 43
2022.02.25 12:26:21.471060 [ 26439 ] {} <Information> RaftInstance: rollback log 989310, term 43
2022.02.25 12:26:21.471069 [ 26439 ] {} <Information> RaftInstance: rollback log 989309, term 43
2022.02.25 12:26:21.471077 [ 26439 ] {} <Information> RaftInstance: rollback log 989308, term 43
2022.02.25 12:26:21.471086 [ 26439 ] {} <Information> RaftInstance: rollback log 989307, term 43
2022.02.25 12:26:21.471094 [ 26439 ] {} <Information> RaftInstance: rollback log 989306, term 43
2022.02.25 12:26:21.471102 [ 26439 ] {} <Information> RaftInstance: rollback log 989305, term 43
2022.02.25 12:26:21.471111 [ 26439 ] {} <Information> RaftInstance: rollback log 989304, term 43
2022.02.25 12:26:21.471119 [ 26439 ] {} <Information> RaftInstance: rollback log 989303, term 43
2022.02.25 12:26:21.471128 [ 26439 ] {} <Information> RaftInstance: rollback log 989302, term 43
2022.02.25 12:26:21.471141 [ 26439 ] {} <Information> RaftInstance: rollback log 989301, term 43
2022.02.25 12:26:21.471153 [ 26439 ] {} <Information> RaftInstance: revert from a prev config change to config at 934352
2022.02.25 12:26:21.471161 [ 26439 ] {} <Information> RaftInstance: overwrite at 989300, term 42
2022.02.25 12:26:21.471173 [ 26439 ] {} <Information> LogSegment: Truncating log_1_open_20220225113536, offset 989299, first_index 1, last_index from 989313 to 989299, truncate_size to 449073886

13 accepted the request of 14, indicating that 13 thinks that the logs of the two are the same with idx=989299. They have same term 42, but have different content.

Let's see the 3 leader switches.

image

13 reject all append log request of term 42, so 989299's term should be 41, but actually is 42.

Below is 13 writting log 989299

2022.02.25 12:26:21.216310 [ 26428 ] {} <Debug> RaftInstance: Receive a client_request message from 0 with LastLogIndex=0, LastLogTerm=0, EntriesLength=1, CommitIndex=0 and Term=0
2022.02.25 12:26:21.216473 [ 26426 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 15 with Result=0, Term=42, NextIndex=989285
2022.02.25 12:26:21.216723 [ 26426 ] {} <Information> RaftStateManager: save srv_state with term 42 and vote_for -1
2022.02.25 12:26:21.216721 [ 26428 ] {} <Debug> RaftInstance: append at log_idx 989299
2022.02.25 12:26:21.216790 [ 26428 ] {} <Debug> RaftInstance: commit_ret_cv 989299 0x7f476af04d20 sleep

We see 13 updated its term to 42 by append_entries_response and it should become follower. And then it use 42 as 989299's term which leading wrong term of log 989299.

image

So update_term method should put state->set_term() under cli_lick_.

@greensky00
Copy link
Contributor

I'm trying to understand the issue:

  1. Server 13 was a leader at term 41.
  2. Server 14 became a leader at term 42, but Server 13 was thinking it is still the leader.
  3. Server 14 sent some logs to Server 13.
  4. update_term was invoked in Server 13, and called state_->set_term(42).
  5. However, before calling become_follower(), Server 13 received some write requests.
  6. Server 13 wrote logs with the term from state_->get_term(), which was 42.
  7. So log 989299 was written with term 42 in Server 13, whose term should have been 41 actually.

Am I understanding it correctly?

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 1, 2022

  1. Server 13 received append_entry_resp from Server 15, then update_term was invoked
2022.02.25 12:26:21.216473 [ 26426 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 15 with Result=0, Term=42, NextIndex=989285
2022.02.25 12:26:21.216723 [ 26426 ] {} <Information> RaftStateManager: save srv_state with term 42 and vote_for -1

@greensky00
Copy link
Contributor

Ok, the outcome is the same in either way. This is really a great catch, thanks!
I think #294 will fix the issue, what do you think?

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 1, 2022

I think we should make set_term and changing role_ atomic both under lock_ and cli_lock_. What do you think of this commit JackyWoo@588513e

@greensky00
Copy link
Contributor

make set_term and changing role_ atomic both under lock_ and cli_lock_

#294 is already doing that.

Note that become_follower can be called without term change, for example,

if (req.get_term() == state_->get_term() && !catching_up_) {
if (role_ == srv_role::candidate) {
become_follower();

so setting a term should not be a part of become_follower.

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 1, 2022

Think about the second diagram. Maybe #294 does not resolve the case. We should put both set_term and changing role_ under cli_lock_

#294 is already doing that.

I mean they should be under a single lock scope.

@greensky00
Copy link
Contributor

https://github.com/greensky00/NuRaft/blob/3afe11976e6a5fa6cd21cd6575857f438a61eef8/src/raft_server.cxx#L1309-L1311

bool raft_server::update_term(ulong term) {
    if (term > state_->get_term()) {
        {
            ...
            std::lock_guard<std::mutex> ll(cli_lock_);
            role_ = srv_role::follower;
            state_->set_term(term);
        }

update_term is always called under lock_, so the above code is always executed under both lock_ and cli_lock_. Can you explain what exactly the problem is?

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 1, 2022

Sorry I missed role_ = srv_role::follower;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants