A VMM in state 'starting' can receive requests #7398

jmpesp · 2025-01-24T17:31:03Z

If a VMM is in state 'starting' and cannot activate a given disk, it will hang. Region replacement or region snapshot replacement may alter the VCR of that disk to one that could be activated, but the current drive saga will not attempt to send requests to a VMM in 'starting'. Fix this - any time a Propolis is expected to be there, it should be ok to receive these requests. Otherwise a repair can be stuck, and the VMM could fail to stop if it gets stuck deactivating too!

If a VMM is in state 'starting' and cannot activate a given disk, it will hang. Region replacement or region snapshot replacement may alter the VCR of that disk to one that *could* be activated, but the current drive saga will not attempt to send requests to a VMM in 'starting'. Fix this - any time a Propolis is expected to be there, it should be ok to receive these requests. Otherwise a repair can be stuck, and the VMM could fail to stop if it gets stuck deactivating too!

gjcolombo · 2025-01-24T17:34:18Z

Have we verified empirically that region snapshot/replacement work as intended on a Propolis that hasn't yet reached Running? I'm not certain they will (and would need to go look carefully at the relevant Propolis code to check).

hawkw

I agree with @gjcolombo --- it's worth making sure that propolis-server can actually handle region replacement requests in that state.

hawkw · 2025-01-24T17:43:30Z

nexus/src/app/sagas/region_replacement_drive.rs

+                            | VmmState::Rebooting
+                            | VmmState::Starting => {
+                                // Propolis server is expected to be there
+                                // (eventually, in the case of "Starting"), and


I believe that a VMM isn't in the Starting state unless the Propolis server process does exist, FWIW --- if memory serves, that's the distinction between Starting and Creating. It may not yet be incarnating an instance, though.

gjcolombo · 2025-01-24T17:48:55Z

I looked at this a bit. In Propolis there are at least two things we need to consider:

Sled agent maps Propolis's "Creating" state to Omicron's "Starting" state, so it's possible that the Propolis API call will return a "no instance" error (it hasn't gotten far enough along to believe it has an active VMM yet). This is probably fine; if you can detect this error you can just retry.
Probably more significantly: Propolis queues VCR change requests to the worker task that's otherwise responsible for initializing VMs and changing their state. (This is so that we don't have to reason about what happens if you try to mutate an instance's configuration while it's migrating.) If a VM is stuck in Starting because it can't activate a Crucible disk, the region replacement that might fix the situation will queue up behind the "start the VM" task and won't actually resolve.

It might be possible to fix the latter issue, but it's probably going to take a fair amount of effort on the Propolis side of the house. (We'd also need to be sure that if we have a Crucible upstairs that's stuck in a Volume::activate, and we call Volume::target_replace on it, then that succeeds and allows the activate call to resolve.)

gjcolombo · 2025-01-25T02:06:55Z

Filed oxidecomputer/propolis#841 to track the relevant Propolis enhancement.

hawkw reviewed Jan 24, 2025

View reviewed changes

gjcolombo mentioned this pull request Jan 25, 2025

want Crucible VCR replacement to be able to proceed before a VM has fully started oxidecomputer/propolis#841

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A VMM in state 'starting' can receive requests #7398

A VMM in state 'starting' can receive requests #7398

jmpesp commented Jan 24, 2025

gjcolombo commented Jan 24, 2025

hawkw left a comment

hawkw Jan 24, 2025

gjcolombo commented Jan 24, 2025

gjcolombo commented Jan 25, 2025

A VMM in state 'starting' can receive requests #7398

Are you sure you want to change the base?

A VMM in state 'starting' can receive requests #7398

Conversation

jmpesp commented Jan 24, 2025

gjcolombo commented Jan 24, 2025

hawkw left a comment

Choose a reason for hiding this comment

hawkw Jan 24, 2025

Choose a reason for hiding this comment

gjcolombo commented Jan 24, 2025

gjcolombo commented Jan 25, 2025