Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nodes panic during synchronization #5081

Merged
merged 1 commit into from
Dec 18, 2024
Merged

Conversation

pfi79
Copy link
Contributor

@pfi79 pfi79 commented Dec 18, 2024

#5076

What's going on?

During synchronisation, the synchroniser reads blocks from another node and writes them to itself.
Block writes are started asynchronously, and we have no control over when this goroutine will actually be executed.
Without waiting for the block to be written, the synchroniser proceeds to the next step.
When finished, the synchroniser exits and the code immediately detects that the node has not reached altitude and enters a new synchronisation cycle.

The startHeight := s.Support.Height() method is executed as the start of the range to synchronise. But it happens that the previous block has not been written yet (but will be soon) and we choose the number of the block that is about to be written as the start. Thus it is possible to write two identical blocks in a row.

Proposal
Set block recording to synchronous mode

@pfi79 pfi79 requested a review from a team as a code owner December 18, 2024 17:15
@yacovm
Copy link
Contributor

yacovm commented Dec 18, 2024

Thanks for looking into this.

  1. What about a unit test that reproduces the problem?
  2. I don't understand why we need such a complex solution, when we can just write the blocks synchronously?

@pfi79
Copy link
Contributor Author

pfi79 commented Dec 18, 2024

2. I don't understand why we need such a complex solution, when we can just write the blocks synchronously?

I was thinking about synchronous writing, but as I think the idea here was to parallele getting the next block from puller and writing the current one to state. That is a time saver.
But if other maintainers agree to convert to synchronous recording, I will.

@pfi79
Copy link
Contributor Author

pfi79 commented Dec 18, 2024

  1. What about a unit test that reproduces the problem?

I'm at work on a test.
Was the first to throw a pr with changes to get corrected right away.

@yacovm
Copy link
Contributor

yacovm commented Dec 18, 2024

  1. I don't understand why we need such a complex solution, when we can just write the blocks synchronously?

I was thinking about synchronous writing, but as I think the idea here was to parallele getting the next block from puller and writing the current one to state. That is a time saver. But if other maintainers agree to convert to synchronous recording, I will.

we do get the next block from the remote node though, whether you are doing a sync write or an async one.

Because when you retrieve a block from the sync buffer, a new block is fed into it.

@pfi79
Copy link
Contributor Author

pfi79 commented Dec 18, 2024

we do get the next block from the remote node though, whether you are doing a sync write or an async one.

Because when you retrieve a block from the sync buffer, a new block is fed into it.

That's right. I'll redo it.

@pfi79 pfi79 force-pushed the fix-sync-bft branch 2 times, most recently from fdb6cf8 to 315657f Compare December 18, 2024 18:33
@yacovm yacovm merged commit 131f9bc into hyperledger:main Dec 18, 2024
15 checks passed
@yacovm
Copy link
Contributor

yacovm commented Dec 18, 2024

I know that you didn't provide a test, but I am perfectly fine with using the sync block writing when we synchronize.

@pfi79 pfi79 deleted the fix-sync-bft branch December 18, 2024 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants