Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blippy: Blueprint Checker Proposal #6987

Open
smklein opened this issue Nov 4, 2024 · 2 comments
Open

blippy: Blueprint Checker Proposal #6987

smklein opened this issue Nov 4, 2024 · 2 comments
Assignees
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Update System Replacing old bits with newer, cooler bits

Comments

@smklein
Copy link
Collaborator

smklein commented Nov 4, 2024

See: #6973 for some background context, as well as the ad-hoc meeting from 11/4 with @jgallagher , @davepacheco , @andrewjstone on this topic.

Background

  • The planner and reconfigurator-cli both use the blueprint builder to construct blueprints.
  • The planner would be used by Nexus, and likely has a more conservative bias towards constructing valid blueprints.
  • The reconfigurator-cli acts as something of a "system override", and wants to construct blueprints that are "valid enough", but which may deviate from the constructions that the planner might create.
  • Defining what abnormalities are valid / not valid is somewhat subtle. For example:
    • Assigning the same underlay address to distinct services is probably always invalid. This could be categorized as a hard error.
    • Deploying multiple services which have incompatible versions is invalid, but should it be prohibited from ever being constructed by the reconfigurator-cli?
    • Deploying a blueprint with "no Nexuses" - this could be viewed as a deviation from policy, but on production systems, it'll create an inoperable system. How are we categorizing the validity of a blueprint with this configuration?

Categorizing Validity

It will be important for us to define some of these error cases - aka, what are deviations from an "okay" blueprint, and what's acceptable - as we define:

  • What is valid for the blueprint builder API to produce?
  • What is valid for the reconfigurator-cli to emit?

We've discussed using at least the following categories, though there may be more:

  • Blueprint OK, matches policy: The blueprint is valid, and we cannot find any ways in which it deviates from the policy the planner would use.
  • Blueprint OK, but deviates from policy: The blueprint could be deployed, but does not match our policy. For example: If our policy is to deploy three nexus zones, a blueprint in this category might be attempting to deploy "two" or "four" Nexus zones.
  • Blueprint Erroneous: There are many flavors here, but this category includes:
    • The blueprint cannot be deployed (we know ahead of time that a sled agent could or should reject it)
    • The blueprint would render the system inoperable (e.g. delete all Nexus zones)
    • The blueprint contains an internal inconsistency (data modified without changing generation number, etc)

Identifying Validity

This issue proposes a blueprint checker (perhaps called blippy) which can inspect a blueprint and identify "how valid" the blueprint appears, with categorization of how far the blueprint deviates from the norm.

We could use blippy in the following spots:

  • As a standalone tool for inspecting blueprint
  • As a part of the blueprint builder, to help the planner validate it has not created a "known erroneous" blueprint
  • As a part of the reconfigurator-cli, to help users identify that their changes only deviate from a policy, and are not a violation of correctness guarantees (or, perhaps, we let people do this anyway, but with many warnings)
@smklein smklein added Update System Replacing old bits with newer, cooler bits development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better labels Nov 4, 2024
@davepacheco
Copy link
Collaborator

We also talked about using blippy in Nexus before setting a target blueprint to avoid setting one with serious problems. This could be overridden in the API call that's setting the target blueprint.

@karencfv
Copy link
Contributor

karencfv commented Nov 4, 2024

Just passing by to say I that whoever came up with "blippy" is a genius

jgallagher added a commit that referenced this issue Dec 20, 2024
This PR introduces Blippy for linting blueprints (see #6987). It
initially only reports `FATAL` errors associated with specific sleds,
but hopefully provides enough structure to see how that can expand to
include other severities and other components. (At a minimum, there will
be some blueprint- or policy-level component for things like "there
aren't enough Nexus zones" that aren't associated with any particular
sled.)

As of this PR, the only user of Blippy is the builder test's
`verify_blueprint`, from which I imported most of the checks that it
current performs. I made a few of these checks slightly more strict, and
from that I had to patch up a handful of tests that were doing weird
things (e.g., manually expunging a zone without expunging its datasets)
and also found one legitimate planner bug (I'll note in a separate
comment below).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Update System Replacing old bits with newer, cooler bits
Projects
None yet
Development

No branches or pull requests

3 participants