Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.0] Normally process blocks from the forkdb on startup #572

Merged
merged 5 commits into from
Aug 19, 2024

Conversation

heifner
Copy link
Member

@heifner heifner commented Aug 17, 2024

In Spring 1.0.0, unlike Leap, we process blocks into the fork database immediately. This can cause the fork database to grow very large when syncing and shutdown due to hitting the new max-reversible-blocks limit (before #545). When a node is shutdown, in Leap, it was assumed that when restarting a node if that node did not receive any blocks from the network it would be at the same height as when it was shutdown. The existing tests/nodeos_read_terminate_at_block_test.py integration test verifies this behavior.

However, if the node shutdown because of max-reversible-blocks you would like on restart for the node to process blocks out of the fork database potentially allowing LIB to progress and reversible blocks to be consumed and shrink in size. This is at odds with the expected behavior of a node starting up after a terminate-at-block. A user might find it very odd to terminate a node at a block, but find on restart that the node is actually beyond that block.

To fix the issue reported in #565, we would like to process blocks out of the fork database on restart. However, we want to maintain the existing expectations of starting a node after a shutdown via terminate-at-block being at the same block height on restart.

Therefore, this PR modifies nodeos to normally process blocks out of the fork database on startup unless the node has no configured peers. The idea being that if a user has terminated a node with terminate-at-block and wants the node to remain at that block, they will restart the node without any p2p-peer-address configured. This is a bit of a hack until #570 can be added. Since #570 is a new feature, it will not come until a future release and will not be part of Spring 1.0.0.

Resolves #565

…a node that was terminated with many blocks in the forkdb
… attempt to request the next range of block instead of ignoring the request.
…s from the fork database. This allows a block stuck in a condition where it has too many blocks in the forkdb to process new blocks to attempt to apply those blocks on startup.
// terminate-at-block, the current expectation is that the node can be restarted to examine the state at
// which it was shutdown. For now, we will only process these blocks if there are peers configured. This
// is a bit of a hack for Spring 1.0.0 until we can add a proper pause-at-block (issue #570) which could
// be used to explicitly request a node to not process beyond a specified block.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compromise came after a long conversation with @arhag on potential alternatives.

@@ -2224,6 +2224,8 @@ namespace eosio {
set_state( lib_catchup );
sync_last_requested_num = 0;
sync_next_expected_num = chain_info.lib_num + 1;
} else if (sync_next_expected_num >= sync_last_requested_num) {
// break
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change causes the net_plugin to request more blocks if possible instead of remaining in a mode where it thinks it is already syncing.

if success:
for nodeId, nodeArgs in {**regularNodeosArgs, **replayNodeosArgs}.items():
assert cluster.getNode(nodeId).relaunch(), f"Unable to relaunch {nodeId}"
assert cluster.getNode(nodeId).waitForLibToAdvance(), f"LIB did not advance for {nodeId}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra section on restarting nodes failed before the fixes in this PR.

@heifner heifner requested review from linh2931 and greg7mdp August 17, 2024 19:34
@heifner heifner added the OCI Work exclusive to OCI team label Aug 17, 2024
@heifner heifner linked an issue Aug 17, 2024 that may be closed by this pull request
1 : "--read-mode irreversible --terminate-at-block 100",
2 : "--read-mode head --terminate-at-block 125",
3 : "--read-mode speculative --terminate-at-block 150",
4 : "--read-mode irreversible --terminate-at-block 180"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These block values don't really matter, I could revert to old values if desired. They were changed when I was doing some initial testing to try and reproduce the issue in this test. The changes below are what is needed.

Base automatically changed from GH-528-limit-sync to release/1.0 August 18, 2024 02:34
Copy link
Contributor

@greg7mdp greg7mdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR modifies nodeos to normally process blocks out of the fork database on startup unless the node has no configured peers

Only in irreversible mode, right?

@heifner
Copy link
Member Author

heifner commented Aug 19, 2024

Only in irreversible mode, right?

Also non-irreversible mode:

https://github.com/AntelopeIO/spring/pull/572/files#diff-42e9f97eed543dd784b1b77a536c50ff8e2403d5bbd3b1fb7f5dd90201391061R1988

@ericpassmore
Copy link
Contributor

Note:start
group: STABILITY
category: INTERNALS
summary: Until pause at block height is available this fix enables use of forkdb to advance LIB on startup when connected to peers, but does not read from forkdb on startup when no peers are connected.
Note:end

@heifner heifner requested a review from spoonincode August 19, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P: Sync mode stuck after hitting max-reversible-blocks, after restart
4 participants