Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IF: Add the beginning of a savanna disaster recovery test #54

Merged
merged 15 commits into from
Apr 25, 2024

Conversation

heifner
Copy link
Member

@heifner heifner commented Apr 19, 2024

Integration test with 4 finalizers (A, B, C, and D).

  • The 4 nodes are cleanly shutdown in the following state:
    • A has LIB N. A has a finalizer safety information file that locks on a block after N.
    • B, C, and D have LIB less than N. They have finalizer safety information files that lock on N.

All nodes but A lose their reversible blocks and restart from an earlier snapshot.

A is restarted and replays up to block N after restarting from snapshot. Block N is sent to the other
nodes B, C, and D after they are also started up again.

Verify that LIB advances and that A, B, C, and D are eventually voting strong on new blocks.

@heifner heifner linked an issue Apr 19, 2024 that may be closed by this pull request
@heifner heifner added the OCI Work exclusive to OCI team label Apr 19, 2024
@heifner heifner requested review from linh2931 and greg7mdp April 19, 2024 12:50
assert not node2.verifyAlive(), "Node2 did not shutdown"
assert not node3.verifyAlive(), "Node3 did not shutdown"

# node0 will have higher lib than 1,2,3 since it can incorporate QCs in blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yet we can't use waitForLibToAdvance() at line 87?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to capture the LIB above and wait for LIB+1, but there is inherit race conditions on the get_info calls and where we are in the test. Waiting for head to advance should be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if "Node::kill" would return the lib right before the node is killed, so we could verify the assertion that node0 has higher lib than 1,2,3 when it is killed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to report in a log statement at exit or use leap util to look at the block log.

Copy link
Contributor

@greg7mdp greg7mdp Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do in Node.py?
currentLib = self.getIrreversibleBlockNum()

Looks like it does a get_info

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but LIB can change immediately after that call or right before that call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but still it is the best indication of what lib was right before the node is killed.

@heifner heifner merged commit dcda785 into savanna Apr 25, 2024
10 checks passed
@heifner heifner deleted the GH-13-disaster-test branch April 25, 2024 11:20
@ericpassmore
Copy link
Contributor

ericpassmore commented Apr 30, 2024

Note:start
group: IF
category: TEST
summary: Disaster recovery test with four finalizers. Ensure block N on one node may be recovered after losing reservable blocks and starting from snapshot.
Note:end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disaster recovery integration test
4 participants