Fix: correct burn view for miner block broadcast #5515

jcnelson · 2024-11-27T20:15:26Z

This fixes the burn view calculation for the Nakamoto miner when accepting and broadcasting a block.

It builds atop #5508, so don't bother merging until #5508 is in develop.

…ix/burn-view

aldur · 2024-12-03T16:00:11Z

Merge hotfix: remove chain stall race condition #5508 to master
After release merge master into develop.
Merge this to develop.

…tion

…tenure nor extend the ongoing tenure, and fail-out of continue_tenure

… possible

jcnelson · 2024-12-05T06:14:32Z

This now contains the hotfix to the miner @obycode @jferrant

…n tip

…cks-core into fix/burn-view

Signed-off-by: Jacinta Ferrant <[email protected]>

testnet/stacks-node/src/nakamoto_node/relayer.rs

testnet/stacks-node/src/tests/nakamoto_integrations.rs

jferrant

Just a few questions.

Also correct name of existing test case.

obycode · 2025-01-10T22:31:35Z

The flakiness in develop has been resolved on miner_forking. I have verified that the problem here is definitely related to these new changes.

jcnelson · 2025-01-13T17:57:00Z

Working on fixing tenure_extend_after_failed_miner

testnet/stacks-node/src/nakamoto_node/relayer.rs

…cks-core into fix/burn-view

…urn view-sensitive transactions get mined, and that a tenure extend happens

…rent sortition was won by the node's miner (even if continuing the prior tenure is possible)

testnet/stacks-node/src/nakamoto_node/relayer.rs

jcnelson · 2025-01-14T05:35:33Z

Alright folks, looks like everything is passing now except for Clippy. Thanks a bunch @obycode @jferrant!

obycode

🎉

kantai · 2025-01-14T14:55:53Z

testnet/stacks-node/src/nakamoto_node/signer_coordinator.rs

+        if BlockMinerThread::check_burn_view_changed(sortdb, chain_state, burn_block).is_err() {
+            // can't continue mining -- burn view changed, or a DB error occurred
+            return true;
+        }


Why is this check required?

Doesn't the line below (i.e., cur_burn_chain_tip.consensus_hash != burn_block.consensus_hash) already check this?

~~No, it doesn't.~~

BlockMinerThread::check_burn_view_changed() checks the ongoing Stacks tenure's consensus_hash against burn_block.consensus_hash, which may be equal, and also be different from cur_burn_chain_tip.consensus_hash. The burn view can be unchanged, but the burnchain tip may change. For example, if miner A wins sortition N and mines at least one Stacks block, then the ongoing Stacks tenure's consensus_hash would be N.consensus_hash. If sortition N+1 is empty, then (N+1).consensus_hash != N.consensus_hash (the burnchain tip changed) but the ongoing tenure's consensus hash would (still) be N.consensus_hash (no burn view change). So, this extra check is necessary.

burn_block is the sortition that started the miner thread -- if the miner thread is an extension, then it is equal to the burn view, if the miner thread is a blockfound tenure, then it is equal to the election burn block.

In what case could check_burn_view_changed return an error, but cur_burn_chain_tip == burn_block.consensus_hash is true?

I deleted the previous comment I had here, since it was wrong.

Instead, I spent some time looking some more at this function and why it was written the way it was.

The intent of this code path is to get the miner to recognize when the Stacks chain tip's ongoing tenure advances past its own burn view. If this happens, the miner should shut down.

However, I think you're right -- this code appears to be a no-op due to the way self.burn_block and self.burn_election_block are initialized. I'm running CI right now with this function disabled to see if it affects anything.

kantai · 2025-01-14T15:08:13Z

testnet/stacks-node/src/nakamoto_node/miner.rs

 #[cfg(test)]
-pub static TEST_MINE_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
+pub static TEST_MINE_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
 #[cfg(test)]
-pub static TEST_BROADCAST_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
+pub static TEST_BROADCAST_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
 #[cfg(test)]
-pub static TEST_BLOCK_ANNOUNCE_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
+pub static TEST_BLOCK_ANNOUNCE_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
 #[cfg(test)]
-pub static TEST_SKIP_P2P_BROADCAST: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
+pub static TEST_SKIP_P2P_BROADCAST: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);


Are TestFlag actually nicer here? I'm usually all for using a common type, but using TestFlag introduces the necessity of LazyLock and Arc to a Mutex that would otherwise just be a plain Mutex (i.e., the type is LazyLock<Arc<Mutex<bool>>> instead of just Mutex<bool>). Perhaps with the prevalance of TestFlag<bool> and TestFlag<Option<bool>> usage, we should just add a Mutex<bool> type as TestBool (because there's no need for lazylock or arc for types which have const constructors)?

I think this is a question for @jferrant. As far as I know, the overwhelmingly common use-case for TestFlag is to store either bool or Option<bool>. But considering this code compiles and runs only for tests, I'm not seeing what the gain would be to changing it? No external dependencies are used either way, and serialization would happen on .set() and .get() either way as well.

But considering this code compiles and runs only for tests, I'm not seeing what the gain would be to changing it?

This PR introduced this particular change

Yes; @jferrant asked me to make this change as part of her review.

kantai · 2025-01-14T15:09:01Z

testnet/stacks-node/src/nakamoto_node/miner.rs

 pub enum MinerDirective {
    /// The miner won sortition so they should begin a new tenure
    BeginTenure {
        parent_tenure_start: StacksBlockId,
        burnchain_tip: BlockSnapshot,
+        late: bool,


Can you add a line to the rustdoc for BeginTenure that describes what late is used for and what it indicates?

kantai · 2025-01-14T15:09:33Z

testnet/stacks-node/src/nakamoto_node/miner.rs

@@ -102,28 +110,27 @@ struct ParentStacksBlockInfo {
 #[derive(PartialEq, Clone, Debug)]
 pub enum MinerReason {
    /// The miner thread was spawned to begin a new tenure
-    BlockFound,
+    BlockFound { late: bool },


Same as above (describe late in the rustdoc`)

kantai · 2025-01-14T15:20:07Z

testnet/stacks-node/src/nakamoto_node/miner.rs

+        let parent_block_info =
+            NakamotoChainState::get_block_header(chain_state.db(), &block.header.parent_block_id)?
+                .ok_or_else(|| ChainstateError::NoSuchBlockError)?;
+        let burn_view_ch =
+            NakamotoChainState::get_block_burn_view(sort_db, &block, &parent_block_info)?;
+        let mut sortition_handle = sort_db.index_handle_at_ch(&burn_view_ch)?;


Does this mean that NakamotoChainState::accept_block expects the sortition_handle to point at the burn view block? Or does accept_block want its sortition_handle to point at the canonical sortition tip, and the burn view block is just closer to that than block.header.consensus_hash is?

If accept_block() would take just the canonical sortition tip, I think it'd be better to just pass that instead -- that way we don't need to recalculate the burn view block here.

Does this mean that NakamotoChainState::accept_block expects the sortition_handle to point at the burn view block?

Yes. This is because this is the BurnStateDB that ultimately gets passed to the Clarity VM. The Clarity VM expects that this BurnStateDB is "opened" to the sortition tip identified by the Stacks block being processed, since it queries the highest sortition relative from this BurnStateDB in certain places (such as in ClarityDB::get_current_burnchain_block_height()). So, we must open the sortition_handle to the burn view of the block, not the canonical tip.

This is because this is the BurnStateDB that ultimately gets passed to the Clarity VM.

I don't think that's the case. accept_block just performs the acceptance checks (like when downloading the block, and the nakamoto block downloader sets this to the canonical burn tip, not the burn view). It's not until the chains coordinator thread attempts to process the block that it sets the sortition handle for the Clarity VM.

Ah, yes, you're right. I suppose it doesn't matter either way, then.

testnet/stacks-node/src/nakamoto_node/miner.rs

kantai · 2025-01-14T15:29:43Z

testnet/stacks-node/src/nakamoto_node/miner.rs

+                    "nakamoto_burn_view" => %ongoing_tenure_id.burn_view_consensus_hash,
+                    "miner_burn_view" => %burn_view.consensus_hash);
+
+                return Err(NakamotoNodeError::BurnchainTipChanged);


Is this line covered in the integration tests?

Yes, indirectly. If I comment it out, then partial_tenure_fork fails.

Actually, I think partial_tenure_fork is just flaky. It passes even if this entire function is commented out.

kantai · 2025-01-14T15:36:22Z

stackslib/src/config/mod.rs

@@ -94,6 +94,8 @@ const DEFAULT_FIRST_REJECTION_PAUSE_MS: u64 = 5_000;
 const DEFAULT_SUBSEQUENT_REJECTION_PAUSE_MS: u64 = 10_000;
 const DEFAULT_BLOCK_COMMIT_DELAY_MS: u64 = 20_000;
 const DEFAULT_TENURE_COST_LIMIT_PER_BLOCK_PERCENTAGE: u8 = 25;
+const DEFAULT_TENURE_EXTEND_WAIT_SECS: u64 = 30;


This seems pretty aggressive to me. I think what this means is that miners will attempt to extend even if the next sortition has a valid winner after 30 seconds. What percentage of tenures mine their first block within this time period (and have the block processed by a follower node)?

The relevant value in the signer set defaults to 600 seconds: https://github.com/stacks-network/stacks-core/blob/master/stacks-signer/src/config.rs#L37

Ah right, sorry for the confusion earlier. I was thinking first_proposal_burn_block_timing_secs was the relevant config here, which defaults to 60s, but you're right.

I think what this means is that miners will attempt to extend even if the next sortition has a valid winner after 30 seconds.

I'm happy to change the timeout, but it doesn't prevent dueling miners from arising. It's possible that in $TIMEOUT + 1 seconds, the winning miner comes online and signers reject it.

We all good with 600 seconds?

My preference would be to remove this behavior from this PR, and then we should start an issue describing this behavior, and we can go from there. I have concerns about he particular timeout chosen, but maybe other concerns as well, and I think its worth separating this behavior from the fix that this PR is trying to address.

To clarify, which behavior do you want to remove from the PR?

kantai · 2025-01-14T15:47:52Z

testnet/stacks-node/src/nakamoto_node/relayer.rs

+                    // we can continue our ongoing tenure, but we should give the new winning miner
+                    // a chance to send their BlockFound first.
+                    debug!("Relayer: Did not win sortition, but am mining the ongoing tenure. Allowing the new miner some time to come online before trying to continue.");
+                    self.tenure_extend_timeout = Some(Instant::now());


Is this behavior desirable as part of this PR?

My understanding of this PR (from the description at least) is that it is attempting to address the specific case:

BTC block 1 occurs with miner A winning

BTC block 2 occurs before miner A gets a proposal out (i.e., a flash block) -- there's no winner of block 2.

Miner A should wake their thread, produce a tenure in BTC block 1 with just a coinbase block

Miner A should then create an extension thread.

But the behavior that I'm commenting on is something else -- it introduces contention between the old miner and the new miner and relies on the signer set to resolve that contention. I think that the signer set can handle this, but it seems unwise to make a default miner behavior which would cause them to produce conflicting proposals in many cases.

Also, should this be Instant::now() or Instant::now() + config.tenure_extend_timeout?

My understanding of this PR (from the description at least) is that it is attempting to address the specific case:

That is one thing addressed by this PR, but not the only thing. In general, the new code implements heuristics for the miner to start a tenure-extend. For example, the winner in BTC block 2 may also fail to produce a tenure-change block, in which case miner A would only need to issue a tenure-extend.

But the behavior that I'm commenting on is something else -- it introduces contention between the old miner and the new miner and relies on the signer set to resolve that contention. I think that the signer set can handle this, but it seems unwise to make a default miner behavior which would cause them to produce conflicting proposals in many cases.

The code does attempt to cause the miner to shut down if it detects that signers have signed off on blocks produced by the winner of BTC block 2, but that this miner simply hasn't seen (this is handled in check_burn_view_changed). However, there's no real way to stop dueling miners from arising -- the system's safety ultimately depends on signers' ability to coalesce around one winning miner.

Also, should this be Instant::now() or Instant::now() + config.tenure_extend_timeout?

No, because this is how self.tenure_extend_timeout is used (in try_continue_tenure()). Note the use of elapsed().

let deadline_passed = self .tenure_extend_timeout .map(|tenure_extend_timeout| { let deadline_passed = tenure_extend_timeout.elapsed() > self.config.miner.tenure_extend_wait_secs; if !deadline_passed { test_debug!( "Relayer: will not try to tenure-extend yet ({} <= {})", tenure_extend_timeout.elapsed().as_secs(), self.config.miner.tenure_extend_wait_secs.as_secs() ); } deadline_passed }) .unwrap_or(false); if !deadline_passed { return; }

No, because this is how self.tenure_extend_timeout is used (in try_continue_tenure()). Note the use of elapsed().

Does that mean that the tenure extend timeout also controls the behavior for empty sortitions?

Does that mean that the tenure extend timeout also controls the behavior for empty sortitions?

Yes, because the remedy for an empty sortition and for a crashed miner that fails to produce blocks is the same -- the last active miner tries to issue a tenure extension after a timeout.

The timeout for recovering from an empty sortition could be set to 0, but that's a refinement of the above behavior.

But that would be a regression from the existing behavior. Currently, the miner will immediately extend when there is no sortition or if the winning miner has committed to the wrong tenure (and is thus unable to mine a valid block).

I can make it so that the node will immediately mine an extension in these cases (btw, there's no test coverage for that).

testnet/stacks-node/src/nakamoto_node/relayer.rs

…ge() function

jcnelson added 2 commits November 27, 2024 14:51

Merge branch 'hotfix/proposal-loads-sortition-view-from-block' into f…

9f3f197

…ix/burn-view

chore: use get_block_burn_view()

91d0f96

jcnelson requested a review from a team as a code owner November 27, 2024 20:15

Merge branch 'develop' into fix/burn-view

b0ebc18

jferrant previously approved these changes Dec 4, 2024

View reviewed changes

jcnelson and others added 6 commits December 4, 2024 18:12

Merge branch 'develop' into fix/burn-view

0d7fbfa

chore: add new integration test

c9f72e4

chore: make MinerReason debug-printable, and factor out fault injec…

8533269

…tion

fix: consider the possibility that the miner can neither begin a new …

2f16742

…tenure nor extend the ongoing tenure, and fail-out of continue_tenure

chore: track the number of miner directives

3b81155

chore: integration test to verify that a continue-tenure might not be…

08fa52a

… possible

jcnelson dismissed jferrant’s stale review via 08fa52a December 5, 2024 06:13

jcnelson requested a review from a team as a code owner December 5, 2024 06:13

jcnelson requested review from jferrant and obycode December 5, 2024 06:14

jcnelson and others added 8 commits December 5, 2024 17:37

Merge branch 'develop' into fix/burn-view

0996fb1

chore: more fixes to differentiate the miner's burn view from the bur…

b110f66

…n tip

Merge branch 'fix/burn-view' of https://github.com/stacks-network/sta…

2d93f24

…cks-core into fix/burn-view

chore: more checks on burn view changes

9b53d70

Merge branch 'feat/time-based-tenure-extend' into fix/burn-view

c8974b2

Merge branch 'feat/time-based-tenure-extend' into fix/burn-view

8be6f90

Merge branch 'develop' into fix/burn-view

22a9815

Cargo fmt

4c9155b

Signed-off-by: Jacinta Ferrant <[email protected]>

jferrant reviewed Dec 10, 2024

View reviewed changes

testnet/stacks-node/src/nakamoto_node/relayer.rs Outdated Show resolved Hide resolved

jferrant reviewed Dec 10, 2024

View reviewed changes

testnet/stacks-node/src/nakamoto_node/relayer.rs Outdated Show resolved Hide resolved

jferrant reviewed Dec 10, 2024

View reviewed changes

testnet/stacks-node/src/tests/nakamoto_integrations.rs Outdated Show resolved Hide resolved

jferrant requested changes Dec 10, 2024

View reviewed changes

aldur added this to the 3.1.0.0.2 milestone Dec 11, 2024

test: add new test for tenure extend

1c31090

Also correct name of existing test case.

Merge branch 'develop' into fix/burn-view

24193f8

obycode reviewed Jan 13, 2025

View reviewed changes

testnet/stacks-node/src/nakamoto_node/relayer.rs Outdated Show resolved Hide resolved

obycode and others added 6 commits January 13, 2025 14:33

fix: won_sortition calculation in relayer

9de3f84

chore: get tenure_extend_after_failed_miner to pass

5590ec0

Merge branch 'fix/burn-view' of https://github.com/stacks-network/sta…

45028e4

…cks-core into fix/burn-view

chore: expand test_tenure_extend_from_flashblocks to check that all b…

27519c3

…urn view-sensitive transactions get mined, and that a tenure extend happens

fix: build issue; fix relayer to always start a new tenure if the cur…

17d6edc

…rent sortition was won by the node's miner (even if continuing the prior tenure is possible)

test: change VRF proof calculation to test a comment from @obycode

99d3eff

obycode reviewed Jan 14, 2025

View reviewed changes

testnet/stacks-node/src/nakamoto_node/relayer.rs Outdated Show resolved Hide resolved

obycode reviewed Jan 14, 2025

View reviewed changes

testnet/stacks-node/src/nakamoto_node/relayer.rs Outdated Show resolved Hide resolved

jcnelson and others added 2 commits January 13, 2025 22:56

Merge branch 'develop' into fix/burn-view

49d5d65

chore: revert to LazyStatic

262ee7d

jcnelson requested review from jferrant, obycode and kantai January 14, 2025 04:22

obycode previously approved these changes Jan 14, 2025

View reviewed changes

Merge branch 'develop' into fix/burn-view

792fcfc

obycode mentioned this pull request Jan 14, 2025

Add test for allowing reorg within first proposal burn block timing secs #5691

Open

kantai requested changes Jan 14, 2025

View reviewed changes

chore: add docstrings, and (to test) disable the check_burn_view_chan…

62c9f13

…ge() function

jcnelson dismissed obycode’s stale review via 62c9f13 January 15, 2025 04:32

jcnelson added 5 commits January 14, 2025 23:58

Merge branch 'develop' into fix/burn-view

cc5e2fd

chore: cargo fmt

618c3a0

test: disable check_burn_view_changed()

fa823b1

Merge branch 'develop' into fix/burn-view

a6a42cf

fix: remove compile warnings that prevent CI from running

8e9303a

Fix: correct burn view for miner block broadcast #5515

Are you sure you want to change the base?

Fix: correct burn view for miner block broadcast #5515

Conversation

jcnelson commented Nov 27, 2024

aldur commented Dec 3, 2024

jcnelson commented Dec 5, 2024

jferrant left a comment

Choose a reason for hiding this comment

obycode commented Jan 10, 2025

jcnelson commented Jan 13, 2025

jcnelson commented Jan 14, 2025

obycode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

kantai Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obycode Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcnelson Jan 14, 2025 •

edited

Loading

kantai Jan 14, 2025 •

edited

Loading

jcnelson Jan 14, 2025 •

edited

Loading

jcnelson Jan 14, 2025 •

edited

Loading

jcnelson Jan 14, 2025 •

edited

Loading

obycode Jan 14, 2025 •

edited

Loading