Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: correct burn view for miner block broadcast #5515

Open
wants to merge 54 commits into
base: develop
Choose a base branch
from

Conversation

jcnelson
Copy link
Member

This fixes the burn view calculation for the Nakamoto miner when accepting and broadcasting a block.

It builds atop #5508, so don't bother merging until #5508 is in develop.

@jcnelson jcnelson requested a review from a team as a code owner November 27, 2024 20:15
@aldur
Copy link
Contributor

aldur commented Dec 3, 2024

  1. Merge hotfix: remove chain stall race condition #5508 to master
  2. After release merge master into develop.
  3. Merge this to develop.

jferrant
jferrant previously approved these changes Dec 4, 2024
@jcnelson
Copy link
Member Author

jcnelson commented Dec 5, 2024

This now contains the hotfix to the miner @obycode @jferrant

Copy link
Collaborator

@jferrant jferrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions.

@aldur aldur added this to the 3.1.0.0.2 milestone Dec 11, 2024
Also correct name of existing test case.
@obycode
Copy link
Contributor

obycode commented Jan 10, 2025

The flakiness in develop has been resolved on miner_forking. I have verified that the problem here is definitely related to these new changes.

@jcnelson
Copy link
Member Author

Working on fixing tenure_extend_after_failed_miner

@jcnelson
Copy link
Member Author

Alright folks, looks like everything is passing now except for Clippy. Thanks a bunch @obycode @jferrant!

obycode
obycode previously approved these changes Jan 14, 2025
Copy link
Contributor

@obycode obycode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment on lines +374 to +377
if BlockMinerThread::check_burn_view_changed(sortdb, chain_state, burn_block).is_err() {
// can't continue mining -- burn view changed, or a DB error occurred
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check required?

Doesn't the line below (i.e., cur_burn_chain_tip.consensus_hash != burn_block.consensus_hash) already check this?

Copy link
Member Author

@jcnelson jcnelson Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't.

BlockMinerThread::check_burn_view_changed() checks the ongoing Stacks tenure's consensus_hash against burn_block.consensus_hash, which may be equal, and also be different from cur_burn_chain_tip.consensus_hash. The burn view can be unchanged, but the burnchain tip may change. For example, if miner A wins sortition N and mines at least one Stacks block, then the ongoing Stacks tenure's consensus_hash would be N.consensus_hash. If sortition N+1 is empty, then (N+1).consensus_hash != N.consensus_hash (the burnchain tip changed) but the ongoing tenure's consensus hash would (still) be N.consensus_hash (no burn view change). So, this extra check is necessary.

Copy link
Member

@kantai kantai Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

burn_block is the sortition that started the miner thread -- if the miner thread is an extension, then it is equal to the burn view, if the miner thread is a blockfound tenure, then it is equal to the election burn block.

In what case could check_burn_view_changed return an error, but cur_burn_chain_tip == burn_block.consensus_hash is true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the previous comment I had here, since it was wrong.

Instead, I spent some time looking some more at this function and why it was written the way it was.

The intent of this code path is to get the miner to recognize when the Stacks chain tip's ongoing tenure advances past its own burn view. If this happens, the miner should shut down.

However, I think you're right -- this code appears to be a no-op due to the way self.burn_block and self.burn_election_block are initialized. I'm running CI right now with this function disabled to see if it affects anything.

Comment on lines 63 to +70
#[cfg(test)]
pub static TEST_MINE_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
pub static TEST_MINE_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
#[cfg(test)]
pub static TEST_BROADCAST_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
pub static TEST_BROADCAST_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
#[cfg(test)]
pub static TEST_BLOCK_ANNOUNCE_STALL: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
pub static TEST_BLOCK_ANNOUNCE_STALL: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
#[cfg(test)]
pub static TEST_SKIP_P2P_BROADCAST: std::sync::Mutex<Option<bool>> = std::sync::Mutex::new(None);
pub static TEST_SKIP_P2P_BROADCAST: LazyLock<TestFlag<bool>> = LazyLock::new(TestFlag::default);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are TestFlag actually nicer here? I'm usually all for using a common type, but using TestFlag introduces the necessity of LazyLock and Arc to a Mutex that would otherwise just be a plain Mutex (i.e., the type is LazyLock<Arc<Mutex<bool>>> instead of just Mutex<bool>). Perhaps with the prevalance of TestFlag<bool> and TestFlag<Option<bool>> usage, we should just add a Mutex<bool> type as TestBool (because there's no need for lazylock or arc for types which have const constructors)?

Copy link
Member Author

@jcnelson jcnelson Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a question for @jferrant. As far as I know, the overwhelmingly common use-case for TestFlag is to store either bool or Option<bool>. But considering this code compiles and runs only for tests, I'm not seeing what the gain would be to changing it? No external dependencies are used either way, and serialization would happen on .set() and .get() either way as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But considering this code compiles and runs only for tests, I'm not seeing what the gain would be to changing it?

This PR introduced this particular change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; @jferrant asked me to make this change as part of her review.

pub enum MinerDirective {
/// The miner won sortition so they should begin a new tenure
BeginTenure {
parent_tenure_start: StacksBlockId,
burnchain_tip: BlockSnapshot,
late: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a line to the rustdoc for BeginTenure that describes what late is used for and what it indicates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -102,28 +110,27 @@ struct ParentStacksBlockInfo {
#[derive(PartialEq, Clone, Debug)]
pub enum MinerReason {
/// The miner thread was spawned to begin a new tenure
BlockFound,
BlockFound { late: bool },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above (describe late in the rustdoc`)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment on lines +698 to +703
let parent_block_info =
NakamotoChainState::get_block_header(chain_state.db(), &block.header.parent_block_id)?
.ok_or_else(|| ChainstateError::NoSuchBlockError)?;
let burn_view_ch =
NakamotoChainState::get_block_burn_view(sort_db, &block, &parent_block_info)?;
let mut sortition_handle = sort_db.index_handle_at_ch(&burn_view_ch)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that NakamotoChainState::accept_block expects the sortition_handle to point at the burn view block? Or does accept_block want its sortition_handle to point at the canonical sortition tip, and the burn view block is just closer to that than block.header.consensus_hash is?

If accept_block() would take just the canonical sortition tip, I think it'd be better to just pass that instead -- that way we don't need to recalculate the burn view block here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that NakamotoChainState::accept_block expects the sortition_handle to point at the burn view block?

Yes. This is because this is the BurnStateDB that ultimately gets passed to the Clarity VM. The Clarity VM expects that this BurnStateDB is "opened" to the sortition tip identified by the Stacks block being processed, since it queries the highest sortition relative from this BurnStateDB in certain places (such as in ClarityDB::get_current_burnchain_block_height()). So, we must open the sortition_handle to the burn view of the block, not the canonical tip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because this is the BurnStateDB that ultimately gets passed to the Clarity VM.

I don't think that's the case. accept_block just performs the acceptance checks (like when downloading the block, and the nakamoto block downloader sets this to the canonical burn tip, not the burn view). It's not until the chains coordinator thread attempts to process the block that it sets the sortition handle for the Clarity VM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, you're right. I suppose it doesn't matter either way, then.

testnet/stacks-node/src/nakamoto_node/miner.rs Outdated Show resolved Hide resolved
"nakamoto_burn_view" => %ongoing_tenure_id.burn_view_consensus_hash,
"miner_burn_view" => %burn_view.consensus_hash);

return Err(NakamotoNodeError::BurnchainTipChanged);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line covered in the integration tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indirectly. If I comment it out, then partial_tenure_fork fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think partial_tenure_fork is just flaky. It passes even if this entire function is commented out.

@@ -94,6 +94,8 @@ const DEFAULT_FIRST_REJECTION_PAUSE_MS: u64 = 5_000;
const DEFAULT_SUBSEQUENT_REJECTION_PAUSE_MS: u64 = 10_000;
const DEFAULT_BLOCK_COMMIT_DELAY_MS: u64 = 20_000;
const DEFAULT_TENURE_COST_LIMIT_PER_BLOCK_PERCENTAGE: u8 = 25;
const DEFAULT_TENURE_EXTEND_WAIT_SECS: u64 = 30;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty aggressive to me. I think what this means is that miners will attempt to extend even if the next sortition has a valid winner after 30 seconds. What percentage of tenures mine their first block within this time period (and have the block processed by a follower node)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant value in the signer set defaults to 600 seconds: https://github.com/stacks-network/stacks-core/blob/master/stacks-signer/src/config.rs#L37

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, sorry for the confusion earlier. I was thinking first_proposal_burn_block_timing_secs was the relevant config here, which defaults to 60s, but you're right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what this means is that miners will attempt to extend even if the next sortition has a valid winner after 30 seconds.

I'm happy to change the timeout, but it doesn't prevent dueling miners from arising. It's possible that in $TIMEOUT + 1 seconds, the winning miner comes online and signers reject it.

We all good with 600 seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to remove this behavior from this PR, and then we should start an issue describing this behavior, and we can go from there. I have concerns about he particular timeout chosen, but maybe other concerns as well, and I think its worth separating this behavior from the fix that this PR is trying to address.

Copy link
Member Author

@jcnelson jcnelson Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, which behavior do you want to remove from the PR?

Comment on lines +466 to +469
// we can continue our ongoing tenure, but we should give the new winning miner
// a chance to send their BlockFound first.
debug!("Relayer: Did not win sortition, but am mining the ongoing tenure. Allowing the new miner some time to come online before trying to continue.");
self.tenure_extend_timeout = Some(Instant::now());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior desirable as part of this PR?

My understanding of this PR (from the description at least) is that it is attempting to address the specific case:

  1. BTC block 1 occurs with miner A winning
  2. BTC block 2 occurs before miner A gets a proposal out (i.e., a flash block) -- there's no winner of block 2.
  3. Miner A should wake their thread, produce a tenure in BTC block 1 with just a coinbase block
  4. Miner A should then create an extension thread.

But the behavior that I'm commenting on is something else -- it introduces contention between the old miner and the new miner and relies on the signer set to resolve that contention. I think that the signer set can handle this, but it seems unwise to make a default miner behavior which would cause them to produce conflicting proposals in many cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should this be Instant::now() or Instant::now() + config.tenure_extend_timeout?

Copy link
Member Author

@jcnelson jcnelson Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this PR (from the description at least) is that it is attempting to address the specific case:

That is one thing addressed by this PR, but not the only thing. In general, the new code implements heuristics for the miner to start a tenure-extend. For example, the winner in BTC block 2 may also fail to produce a tenure-change block, in which case miner A would only need to issue a tenure-extend.

But the behavior that I'm commenting on is something else -- it introduces contention between the old miner and the new miner and relies on the signer set to resolve that contention. I think that the signer set can handle this, but it seems unwise to make a default miner behavior which would cause them to produce conflicting proposals in many cases.

The code does attempt to cause the miner to shut down if it detects that signers have signed off on blocks produced by the winner of BTC block 2, but that this miner simply hasn't seen (this is handled in check_burn_view_changed). However, there's no real way to stop dueling miners from arising -- the system's safety ultimately depends on signers' ability to coalesce around one winning miner.

Also, should this be Instant::now() or Instant::now() + config.tenure_extend_timeout?

No, because this is how self.tenure_extend_timeout is used (in try_continue_tenure()). Note the use of elapsed().

let deadline_passed = self
    .tenure_extend_timeout
    .map(|tenure_extend_timeout| {
        let deadline_passed =
            tenure_extend_timeout.elapsed() > self.config.miner.tenure_extend_wait_secs;
        if !deadline_passed {
            test_debug!(
                "Relayer: will not try to tenure-extend yet ({} <= {})",
                tenure_extend_timeout.elapsed().as_secs(),
                self.config.miner.tenure_extend_wait_secs.as_secs()
            );
        }
        deadline_passed
    })
    .unwrap_or(false);

if !deadline_passed {
    return;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this is how self.tenure_extend_timeout is used (in try_continue_tenure()). Note the use of elapsed().

Does that mean that the tenure extend timeout also controls the behavior for empty sortitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that the tenure extend timeout also controls the behavior for empty sortitions?

Yes, because the remedy for an empty sortition and for a crashed miner that fails to produce blocks is the same -- the last active miner tries to issue a tenure extension after a timeout.

The timeout for recovering from an empty sortition could be set to 0, but that's a refinement of the above behavior.

Copy link
Contributor

@obycode obycode Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would be a regression from the existing behavior. Currently, the miner will immediately extend when there is no sortition or if the winning miner has committed to the wrong tenure (and is thus unable to mine a valid block).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it so that the node will immediately mine an extension in these cases (btw, there's no test coverage for that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Status: 💻 In Progress
Development

Successfully merging this pull request may close these issues.

5 participants