Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't generate genesis proof at startup #8764

Merged
merged 5 commits into from
May 11, 2021

Conversation

mrmr1993
Copy link
Member

This PR

  • removes genesis proof generation from the startup code
  • adds support for generating genesis proofs in the prover process
  • adds logic to generate a genesis proof when a block producer will need one to be able to produce their block
    • if the next block for the producer is n slots in the future, the genesis proof will be produced at slot n-1, if it is still needed
    • if it's needed immediately, the genesis proof will be produced as soon as possible and then used to generate the block

Checklist:

  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them:

@mrmr1993 mrmr1993 added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Apr 29, 2021
@mrmr1993 mrmr1993 changed the title Don't generate genesis proofs at startup Don't generate genesis proof at startup Apr 29, 2021
@mrmr1993 mrmr1993 marked this pull request as ready for review April 29, 2021 18:18
@mrmr1993 mrmr1993 requested review from a team as code owners April 29, 2021 18:18
@lk86
Copy link
Contributor

lk86 commented Apr 29, 2021

how does this interact with the --generate-genesis-proof true flag?

@mrmr1993
Copy link
Member Author

how does this interact with the --generate-genesis-proof true flag?

That flag will still crash the daemon if it's passed as false, but it should already be defaulting to true. Should we disable it completely?

@psteckler
Copy link
Member

Should we disable it completely?

I vote "yes".

@mrmr1993
Copy link
Member Author

I vote "yes".

Done :)

@psteckler
Copy link
Member

Done :)

The instructions for running a node will need to reflect this change, when it's released.

@mrmr1993
Copy link
Member Author

The instructions for running a node will need to reflect this change, when it's released.

I've deprecated the flag rather than removing it, so it's now a no-op, in the interest of not breaking existing setups on upgrade. We should still update the docs, but I don't think it's a necessity.

Copy link
Member

@deepthiskumar deepthiskumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
Asked a question but otherwise looks great.

in
if
Consensus.Data.Consensus_state.is_genesis_state
consensus_state
Copy link
Member

@deepthiskumar deepthiskumar Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also want to check if the network is in genesis epoch and ignore otherwise? A node when restarted with clean config directory will have genesis state as its best tip. Might be better to not produce a block at all in this case.

Copy link
Member Author

@mrmr1993 mrmr1993 Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A node when restarted with clean config directory will have genesis state as its best tip. Might be better to not produce a block at all in this case.

I agree, but I think this falls outside of the scope of this PR. We should probably avoid checking vrfs/running the start function at all until we have an epoch ledger from a block that's within the current epoch recent, which would fix a lot of the reported bp weirdness during startup.

Currently, this code will only trigger if they've won a slot that's imminent (this or the next) but haven't been running for long enough to see their best tip progress. If it's this slot, they would produce a useless block on top of the genesis block anyway (waiting to produce it may even allow them time to get a plausible best tip); if it's the next, they'll waste some compute generating a genesis proof that currently every node is wasting at startup regardless.

@bkase
Copy link
Member

bkase commented Apr 30, 2021

@mrmr1993 is there any risk that this could crash folks' nodes right before they make their block? What are the failure scenarios of generating the genesis proof?

@mrmr1993
Copy link
Member Author

@mrmr1993 is there any risk that this could crash folks' nodes right be.fore they make their block? What are the failure scenarios of generating the genesis proof?

@bkase the risk is very low:

  • the genesis proof is only created for the first block, so every other bp will never even try to generate a proof
  • I haven't heard of any reports of people crashing while generating the genesis proof on startup -- currently true of every active node
  • I can't find any open issues due to prover process crashes, so it seems that the block production code is stable (or, at least, moreso than the daemon itself). Producing the genesis block is just calling this code with subtly different inputs.

The failure scenarios are, to my mind,

  • corrupt genesis ledger
    • we used it to start up, if the node is ready to produce the first block then it was known good at that point and is unlikely to be broken now
  • incorrect genesis state
    • catastrophic anyway, the subsequent produced block would be invalid on the main chain, and received blocks will be rejected as invalid
  • memory corruption
    • we haven't seen this take down the prover process (that I know of)

This also reduces memory usage in the daemon process by not loading block proving keys, so we'll see less memory pressure and thus fewer memory corruption crashes, which appear to trigger at GC time. This also reduces cold-start time from 100s to 8s (based on CI logs), so crash recovery may be significantly faster.

@lk86
Copy link
Contributor

lk86 commented Apr 30, 2021

Just to clarify some notes here: only debian users are re-generating the proof (and downloading keys) on startup, currently the baked docker images cache these value.

More importantly though, I'm concerned that users will not have the keys ready in under ~6 minutes when they need them. Depending on network conditions the s3 download can be really flakey/problematic, I would like to at least keep shipping the keys in docker via some mechanism but it doesnt have to be this --generate-genesis-proof flag or the baking setup. Is the standalone key-downloader tool in a usable state @mrmr1993 ?

@mrmr1993
Copy link
Member Author

mrmr1993 commented Apr 30, 2021

Is the standalone key-downloader tool in a usable state @mrmr1993 ?

@lk86 It's usable, but not very pretty or user friendly. Should we be thinking about a post-inst script to pull the keys automatically when the deb is installed? Or is it worth improving the tool and making that a required step for slow connections?

@aneesharaines aneesharaines linked an issue May 4, 2021 that may be closed by this pull request
@lk86
Copy link
Contributor

lk86 commented May 4, 2021

Is the standalone key-downloader tool in a usable state @mrmr1993 ?

@lk86 It's usable, but not very pretty or user friendly. Should we be thinking about a post-inst script to pull the keys automatically when the deb is installed? Or is it worth improving the tool and making that a required step for slow connections?

it doesn't have to be pretty or user friendly to replace the functionality in the baked dockerfile and in CI, so that's good enough for me, we can integrate it nicely with the debian package later given that debian users are already dealing with the s3 download logic.

@mrmr1993 mrmr1993 merged commit 6869b2c into compatible May 11, 2021
@mrmr1993 mrmr1993 deleted the feature/no-genesis-proof-at-start branch May 11, 2021 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build-me Add this label to trigger a circle+buildkite build for this branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't generate genesis proofs at startup
6 participants