"datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177

smklein · 2024-07-29T18:59:02Z

#6144 will introduce an API to manage datasets, called dataset_ensure.

Sled agent exposes a few endpoints like this, to "ensure the set of _____ exactly matches a provided configuration". Usually, these endpoints delete entries which don't match the input set.

However, with datasets, since many of our datasets are implicitly created by the sled agent, doing so would be a dangerous operation.

This issue tracks "eventually, we should remove datasets that don't match the expected configuration", once doing so is a safe operation.

The text was updated successfully, but these errors were encountered:

davepacheco · 2024-12-19T23:23:12Z

If I understand correctly, there's a pretty significant, non-obvious consequence of not having this: once a zone of a particular type is deployed with its persistent root filesystem on a particular disk, we can never put another instance of that type of zone on the same disk -- even if we expunge the first one. That's because today, persistent zone datasets are identified by the "name", which is a tuple of (zpool, dataset/zone type). With this issue, if we expunge the zone, the dataset remains around. That means we cannot create another one for a second zone, right?

A follow-on consequence is that there's an upper bound to the number of automatic upgrades that can be done on any system because eventually we will run out of disks on which we've never placed, say, a CockroachDB zone (or a DNS zone or any other zone with a persistent dataset).

I definitely could be misunderstanding!

smklein · 2024-12-19T23:27:56Z

Is the current plan-of-record with automated upgrades to delete the old datasets? Or is it to delete the zone, and later re-create it?

If we are intending to actually fully delete the durable datasets of e.g. CockroachDB during the upgrade process, I agree, this issue should be a blocker for automated upgrade.

davepacheco · 2024-12-19T23:43:43Z

Sorry, CockroachDB might be a bad example because we might do those as an in-place upgrade and keep the dataset.

Internal/External DNS are probably a better example because with those we definitely plan to deploy new zones (with associated datasets) and then remove the old zones (and datasets). We also plan to do that with most other zones, but I think most other zones don't have persistent datasets so this isn't a problem.

Another note: without extra care, I think fixing this might break zone expungement for these zones. Zone expungement removes both the dataset and the zone (in one blueprint step). During execution, this would attempt to delete the dataset before attempting to delete the zone. I assume that would fail, since the dataset is still in use. And that currently causes the rest of execution to fail (#6999).

The idea that's been floated of combining PUT /datasets with PUT /omicron-zones could address this.

davepacheco · 2025-01-03T03:37:31Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177

"datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177

smklein commented Jul 29, 2024

davepacheco commented Dec 19, 2024

smklein commented Dec 19, 2024

davepacheco commented Dec 19, 2024

davepacheco commented Jan 3, 2025

"datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177

"datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177

Comments

smklein commented Jul 29, 2024

davepacheco commented Dec 19, 2024

smklein commented Dec 19, 2024

davepacheco commented Dec 19, 2024

davepacheco commented Jan 3, 2025