-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Reconfigurator] Blippy #7276
[Reconfigurator] Blippy #7276
Conversation
skip_zpools.insert(zpool); | ||
} | ||
if let Some(zpool) = &zone_config.filesystem_pool { | ||
skip_zpools.insert(zpool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the legit planner bug I mentioned in the PR description; without this fix, we have several tests that fail with a blippy report that looks something like:
blippy report for blueprint 30f46d73-5dd5-4125-800f-6c910bb24638: 2 notes
sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 2220bf46-72e3-4d88-baae-6929f1891219 and CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc)
sled 77370f38-8ef9-4dc1-a455-5b2f72149eed: FATAL note: zpool oxp_03a64b5d-c9c6-4edd-ace0-7d4bbc3293e2 has two zone filesystems of the same kind (CruciblePantry 319edb3b-f8a8-4881-8f28-e2d80ce861bc and CruciblePantry 6f6f16f9-3b33-408a-8508-2192249e3b0f)
This fix is also obvious in the blueprint diffs below in the PR, where we can see that before this change there were multiple zones of the same kind placed on the same zpool. I believe this bug does not affect production due to #7229.
let mut slf = Self { blueprint, notes: Vec::new() }; | ||
checks::perform_all_blueprint_only_checks(&mut slf); | ||
slf | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent here is to add at least one more method here that would let us perform additional checks that need the PlanningInput
(or at least the Policy
), but make that optional so it can still be used to check just a standalone blueprint. So something like:
let just_blueprint = Blippy::new(blueprint).into_report(); // today
let full_context = Blippy::new(blueprint).with_policy(policy).into_report(); // future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for taking this on!
}, | ||
/// Two sleds are using the same sled subnet. | ||
ConflictingSledSubnets { | ||
other_sled: SledUuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have both sled UUIDs here? Unclear which one is the "other"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I went back and forth on this. Every Note
has a Component
; this kind is assuming its note's component is a sled, and then this value is the "other" sled. I could put both sled IDs in here and just duplicate one with the Component? Or maybe this is pointing to a more serious structural problem (e.g., some Kind
s are only sensible with specific Component
s, so maybe breaking them up the way I have is not right)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh I didn't put two and two together to see the higher-level usage of Component
here.
I'm looking at the structure of:
pub struct Note<'a> {
pub component: Component,
pub severity: Severity,
pub kind: Kind<'a>,
}
I think I agree with your comment, this Kind
is actually a SledKind
- I wouldn't want to lump non-sled stuff into this Kind
enum.
Maybe we should do something like:
pub struct Note<'a> {
pub severity: Severity,
pub component_note: ComponentNote<'a>,
}
pub enum ComponentNote<'a> {
Sled {
id: SledUuid,
kind: SledKind<'a>,
},
OtherComponentKind {
id: OtherUuid,
kind: OtherKind<'a>,
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems better, thanks 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked in 15f04e8
/// Two zones with the same durable dataset kind are on the same zpool. | ||
ZoneDurableDatasetCollision { | ||
zone1: &'a BlueprintZoneConfig, | ||
zone2: &'a BlueprintZoneConfig, | ||
zpool: ZpoolName, | ||
}, | ||
/// Two zones with the same filesystem dataset kind are on the same zpool. | ||
ZpoolFilesystemDatasetCollision { | ||
zone1: &'a BlueprintZoneConfig, | ||
zone2: &'a BlueprintZoneConfig, | ||
zpool: ZpoolName, | ||
}, | ||
/// One zpool has two datasets of the same kind. | ||
ZpoolWithDuplicateDatasetKinds { | ||
dataset1: &'a BlueprintDatasetConfig, | ||
dataset2: &'a BlueprintDatasetConfig, | ||
zpool: ZpoolUuid, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to more specific error kinds, but ZpoolWithDuplicateDatasetKinds
will cover all three of these cases, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so:
Zone*DatasetCollision
come from scanningblueprint_zones
, from which we don't have aBlueprintDatasetConfig
needed to construct aZpoolWithDuplicateDatasetKinds
ZpoolWithDuplicateDatasetKinds
comes from scanningblueprint_datasets
(so the inner fields areBlueprintDatasetConfig
s, but these don't necessarily have a correspondingBlueprintZoneConfig
anywhere)
We could squish all of these down to one if we dropped the zone/dataset specifics to something like DatasetKind
? That seems like it's dropping a lot of information, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha, nevermind -- happy to keep all the cases distinct. I do think it's useful to give the most specific errors possible here.
I was (perhaps incorrectly?) thinking:
- If we check
blueprint_datasets
for per-zpool uniqueness (checked byZpoolWithDuplicateDatasetKinds
) - ... and all
blueprint_zones
have associated datasets (checked byZoneMissingFilesystemDataset
+ZoneMissingDurableDataset
) - ... then we can infer: all datasets that should exist do exist, and within the set of datasets, there are no collisions.
Granted - I don't have a beef with also attributing the dataset collisions to specific zones, so this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, and I debated collapsing these, but I decided to keep the separate cases for explicitness. It's okay if we get some "double notes", and I think it's easier to reason about each of the checks locally if they're checking their side of the condition (e.g., zones check that their corresponding dataset exists and datasets check that their corresponding zone exists, even though strictly speaking we could drop one of those and not miss anything).
// TODO-john Add a Severity::BackwardsCompatibility and note the | ||
// missing filesystem pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for taking this on @jgallagher!
I agree with all @smklein's comments and appreciate the thorough review.
There are a few TODOs around expunged and decommissioned disks in blueprints. I'll take that on when my fix for #7098 goes out for review.
|
||
// Any given sled should have IPs within at most one subnet. | ||
// | ||
// The blueprint doesn't store sled subnets explicitly, so we can't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the rack subnet is something we want to move into the blueprint eventually?
I think we do, mainly based on the "make everything in the blueprint explicit" POV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yeah. I don't think it's particularly urgent.
972e820
to
09a46f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look great!
fn test_bad_internnal_dns_subnet() { | ||
static TEST_NAME: &str = "test_bad_internnal_dns_subnet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn test_bad_internnal_dns_subnet() { | |
static TEST_NAME: &str = "test_bad_internnal_dns_subnet"; | |
fn test_bad_internal_dns_subnet() { | |
static TEST_NAME: &str = "test_bad_internal_dns_subnet"; |
} | ||
} | ||
} | ||
assert!(found_sled_missing_note, "found sled missing datasets note"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick; this isn't an expect -- we'll see this message if the note isn't found, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be super clear, I think the error message should be Did not find sled with missing datasets note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, yes, thanks.
} | ||
} | ||
} | ||
assert!(found_sled_missing_note, "found sled missing disks note"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here re: expect vs assert messages
Thanks, and thanks for the super fast review. I didn't even have a chance to ping with a "tests are here, feel free to look them over"! The tests are pretty repetitive; I was tempted to abstract the "mutate the blueprint, check for expected notes" pattern, but (a) I worry about abstracting tests too much and (b) in many of the tests the details of how the mutation / note creation happens are subtly different from each other in ways that might be hard to abstract. So I ended up with "just repeat a lot of similar things", which I think is okay for tests 🤷♂️. |
I'm totally on board with the pattern of "repeat a lot at first to make things explicit, de-dup only later once everyone is crystal clear on what's happening under the hood". |
This PR introduces Blippy for linting blueprints (see #6987). It initially only reports
FATAL
errors associated with specific sleds, but hopefully provides enough structure to see how that can expand to include other severities and other components. (At a minimum, there will be some blueprint- or policy-level component for things like "there aren't enough Nexus zones" that aren't associated with any particular sled.)As of this PR, the only user of Blippy is the builder test's
verify_blueprint
, from which I imported most of the checks that it current performs. I made a few of these checks slightly more strict, and from that I had to patch up a handful of tests that were doing weird things (e.g., manually expunging a zone without expunging its datasets) and also found one legitimate planner bug (I'll note in a separate comment below).There aren't many doc comments (yet?) - the interface should be pretty straightforward, but I'm happy to add docs if folks are okay with the general structure of this. I'm not at all wedded to the
Note
/Severity
/Kind
/Component
breakdown, so any feedback there is welcome.