Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commit_ret_elem leak when client timeout #295

Closed
JackyWoo opened this issue Mar 2, 2022 · 5 comments · Fixed by #296
Closed

commit_ret_elem leak when client timeout #295

JackyWoo opened this issue Mar 2, 2022 · 5 comments · Fixed by #296

Comments

@JackyWoo
Copy link
Contributor

JackyWoo commented Mar 2, 2022

Descripe the issue

NuRaft commit 0073f55 commited at 2021.11.19

When node become follower it will clear commit_ret_elem. Below log shows there 28 elements but there were only 15 clients responses. Why ?

2022.02.25 12:26:21.454944 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989271, waited 234473 us
2022.02.25 12:26:21.454971 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989272, waited 234156 us
2022.02.25 12:26:21.454994 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989273, waited 234117 us
2022.02.25 12:26:21.455003 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989274, waited 234114 us
2022.02.25 12:26:21.455023 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989275, waited 234090 us
2022.02.25 12:26:21.455047 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989276, waited 233895 us
2022.02.25 12:26:21.455065 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989277, waited 233886 us
2022.02.25 12:26:21.455073 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989278, waited 233881 us
2022.02.25 12:26:21.455082 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989279, waited 233871 us
2022.02.25 12:26:21.455090 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989280, waited 233819 us
2022.02.25 12:26:21.455098 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989281, waited 233787 us
2022.02.25 12:26:21.455108 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989282, waited 233513 us
2022.02.25 12:26:21.455117 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989283, waited 233486 us
2022.02.25 12:26:21.455125 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989284, waited 233445 us
2022.02.25 12:26:21.455140 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989285, waited 233340 us
2022.02.25 12:26:21.455154 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989301, waited 1262 us
2022.02.25 12:26:21.455165 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989302, waited 1182 us
2022.02.25 12:26:21.455187 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989303, waited 1116 us
2022.02.25 12:26:21.455209 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989304, waited 1058 us
2022.02.25 12:26:21.455220 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989305, waited 997 us
2022.02.25 12:26:21.455246 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989306, waited 966 us
2022.02.25 12:26:21.455269 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989307, waited 921 us
2022.02.25 12:26:21.455292 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989308, waited 876 us
2022.02.25 12:26:21.455313 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989309, waited 800 us
2022.02.25 12:26:21.455323 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989310, waited 703 us
2022.02.25 12:26:21.455333 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989311, waited 617 us
2022.02.25 12:26:21.455344 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989312, waited 527 us
2022.02.25 12:26:21.455366 [ 26417 ] {} <Debug> RaftInstance: cancelled blocking client request 989313, waited 455 us
2022.02.25 12:26:21.455386 [ 26417 ] {} <Warning> RaftInstance: cancelled 28 blocking client requests from 989271 to 989313.
2022.02.25 12:26:21.455424 [ 26740 ] {} <Debug> RaftInstance: remaining elems in waiting queue: 0


2022.02.25 12:26:21.455467 [ 26740 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989301 wake up (1530 us), return value (nil), result code -1
2022.02.25 12:26:21.455546 [ 26738 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989302 wake up (1536 us), return value (nil), result code -1
2022.02.25 12:26:21.455625 [ 26733 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989304 wake up (1464 us), return value (nil), result code -1
2022.02.25 12:26:21.455612 [ 26741 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989303 wake up (1498 us), return value (nil), result code -1
2022.02.25 12:26:21.455768 [ 26407 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989305 wake up (1447 us), return value (nil), result code -1
2022.02.25 12:26:21.455851 [ 26737 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989306 wake up (1514 us), return value (nil), result code -1
2022.02.25 12:26:21.455920 [ 26409 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989307 wake up (1532 us), return value (nil), result code -1
2022.02.25 12:26:21.455962 [ 26736 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989308 wake up (1526 us), return value (nil), result code -1
2022.02.25 12:26:21.456067 [ 26406 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989309 wake up (1507 us), return value (nil), result code -1
2022.02.25 12:26:21.456099 [ 26408 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989310 wake up (1460 us), return value (nil), result code -1
2022.02.25 12:26:21.456231 [ 26732 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989313 wake up (1276 us), return value (nil), result code -1
2022.02.25 12:26:21.456186 [ 26734 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989312 wake up (1286 us), return value (nil), result code -1
2022.02.25 12:26:21.456301 [ 26739 ] {} <Warning> RaftInstance: [NOT OK] commit_ret_cv 989311 wake up (1539 us), return value (nil), result code -1

Analysis

When commit log leader will add unfound entry to commit_ret_elems in case of commit faster. But what if NuRaft is slow and client timeout first, NuRaft will add anymore which may lead commit_ret_elem leak.

Part of my log lost, so the analysis is more about guess.

How to fix

If the guess is right, maybe 2 resolution

  1. if client timeout, do not remove commit_ret_elem may resove the issue.
  2. Adding a timeout commit_ret_elem log index set in memory to avoid re-add.
@greensky00
Copy link
Contributor

greensky00 commented Mar 3, 2022

Yes, it seems requests timed out due to slow NuRaft, but even though the server couldn't commit logs in time, it didn't lose the leadership and eventually succeeded in committing them.

To me, option 1 is the right solution, as commit_ret_elem will be eventually removed

  • by the commit thread if the server didn't lose the leadership, or
  • when the server loses the leadership.

By the way, just curious, what is your heartbeat interval, election timer lower/upper bound, and client timeout limit?

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 3, 2022

Thanks. I will try to fix the issue today.

I noticed the case when I researched the previous issue #293 . I reduced the timeout and interval value to reproduce it.

heart_beat_interval_ms 40
election_timeout_lower_bound_ms 50
election_timeout_upper_bound_ms 90
client_req_timeout_ 3000

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 3, 2022

@greensky00 May I ask when you are available for NuRaft in a day?

@greensky00
Copy link
Contributor

I'm working on this repo on-demand basis when I have extra time. There is no fixed time slot.

@JackyWoo
Copy link
Contributor Author

JackyWoo commented Mar 3, 2022

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants