Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A VMM in state 'starting' can receive requests #7398

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Jan 24, 2025

If a VMM is in state 'starting' and cannot activate a given disk, it will hang. Region replacement or region snapshot replacement may alter the VCR of that disk to one that could be activated, but the current drive saga will not attempt to send requests to a VMM in 'starting'. Fix this - any time a Propolis is expected to be there, it should be ok to receive these requests. Otherwise a repair can be stuck, and the VMM could fail to stop if it gets stuck deactivating too!

If a VMM is in state 'starting' and cannot activate a given disk, it
will hang. Region replacement or region snapshot replacement may alter
the VCR of that disk to one that *could* be activated, but the current
drive saga will not attempt to send requests to a VMM in 'starting'. Fix
this - any time a Propolis is expected to be there, it should be ok to
receive these requests. Otherwise a repair can be stuck, and the VMM
could fail to stop if it gets stuck deactivating too!
@gjcolombo
Copy link
Contributor

Have we verified empirically that region snapshot/replacement work as intended on a Propolis that hasn't yet reached Running? I'm not certain they will (and would need to go look carefully at the relevant Propolis code to check).

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gjcolombo --- it's worth making sure that propolis-server can actually handle region replacement requests in that state.

| VmmState::Rebooting
| VmmState::Starting => {
// Propolis server is expected to be there
// (eventually, in the case of "Starting"), and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a VMM isn't in the Starting state unless the Propolis server process does exist, FWIW --- if memory serves, that's the distinction between Starting and Creating. It may not yet be incarnating an instance, though.

@gjcolombo
Copy link
Contributor

I looked at this a bit. In Propolis there are at least two things we need to consider:

  • Sled agent maps Propolis's "Creating" state to Omicron's "Starting" state, so it's possible that the Propolis API call will return a "no instance" error (it hasn't gotten far enough along to believe it has an active VMM yet). This is probably fine; if you can detect this error you can just retry.
  • Probably more significantly: Propolis queues VCR change requests to the worker task that's otherwise responsible for initializing VMs and changing their state. (This is so that we don't have to reason about what happens if you try to mutate an instance's configuration while it's migrating.) If a VM is stuck in Starting because it can't activate a Crucible disk, the region replacement that might fix the situation will queue up behind the "start the VM" task and won't actually resolve.

It might be possible to fix the latter issue, but it's probably going to take a fair amount of effort on the Propolis side of the house. (We'd also need to be sure that if we have a Crucible upstairs that's stuck in a Volume::activate, and we call Volume::target_replace on it, then that succeeds and allows the activate call to resolve.)

@gjcolombo
Copy link
Contributor

Filed oxidecomputer/propolis#841 to track the relevant Propolis enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants