Fix infinite wait loop in OpenHCL error handling path #677

bhargavshah1988 · 2025-01-16T00:35:23Z

During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.

During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.

benhillis · 2025-01-16T01:30:47Z

The existing logic here seems very old. How is this now coming up?

bhargavshah1988 · 2025-01-16T01:50:15Z

How is this now coming up?

After reporting the save state failure to the host, OpenHCL waits indefinitely for a response. The host, however, only sends a response on save success and does not respond to save failures notifications. Consequently, VTL0 becomes unresponsive, even though the save failure is non-fatal and correctly reported to the host.
We recently introduced logic to enable the shared pool(#540).
However, saving is not supported when the shared pool is enabled(https://github.com/microsoft/openvmm/blob/release/2411/openhcl/underhill_core/src/dispatch/mod.rs#L621), leading to save operations failing.
During servicing operations, any failure after the save phase is treated as a point of no return, prompting the host to reset the VM. However, OpenHCL should recover autonomously during failures in the save phase, avoiding a host-initiated VM reset.

chris-oo · 2025-01-16T15:12:19Z

vm/devices/get/guest_emulation_transport/src/process_loop.rs

@@ -628,6 +628,16 @@ impl HostRequestPipeAccess {
            get_protocol::HeaderHostRequest::read_from_prefix(data.as_bytes()).unwrap();
        self.recv_response_fixed_size(req_header.message_id).await
    }
+
+    /// Sends a request to the host.
+    ///


remove the extra line here, and i'd add a note that you probably don't want want this, as it does not wait for a response.

Yeah, this needs more documentation on why you don't want to use this 99% of the time.

Maybe we should just name it based on the single case that it should be used for.

I have documented that in comments.

chris-oo · 2025-01-16T15:13:15Z

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

smalis-msft · 2025-01-16T15:51:04Z

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

mattkur · 2025-01-16T16:10:04Z

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

Sorry to pile on: do we have the infrastructure in place to add tests for this as part of this PR? We have some servicing tests in petri, but I don't know enough to know if that helps.

I'm grateful for the fix (thanks Bhargav!), but this seems like an important test gap to fill.

Also note that the failure path (fail to save) no longer exists in the active branch.

chris-oo · 2025-01-16T16:47:48Z

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

The whole failure path is weird - given we have to support the hosts that already exists, I'm unfortunately inclined to mark this as yet-another-protocol-wart that we have to live with until we can rev the protocol. The weirdness is that this uses the same type as a servicing request with saved state which has a response, but it seems like the original implementation never sent a response back for this message.

mebersol · 2025-01-16T16:48:05Z

How is this now coming up?

After reporting the save state failure to the host, OpenHCL waits indefinitely for a response. The host, however, only sends a response on save success and does not respond to save failures notifications. Consequently, VTL0 becomes unresponsive, even though the save failure is non-fatal and correctly reported to the host. We recently introduced logic to enable the shared pool(#540). However, saving is not supported when the shared pool is enabled(https://github.com/microsoft/openvmm/blob/release/2411/openhcl/underhill_core/src/dispatch/mod.rs#L621), leading to save operations failing. During servicing operations, any failure after the save phase is treated as a point of no return, prompting the host to reset the VM. However, OpenHCL should recover autonomously during failures in the save phase, avoiding a host-initiated VM reset.

This is really a host-side bug, based on the current protocol construction. Are there plans to address the host?

bhargavshah1988 · 2025-01-16T17:47:52Z

Are there plans to address the host?

Do we expect host to acknowledge reception of notification packet ? Why ?

bhargavshah1988 · 2025-01-16T17:53:32Z

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

I think i can change the name of this function to send_notification. Thoughts?
Also, on failure paths host always treats the failure from OHL as notification only. So i believe host support is not required.

smalis-msft · 2025-01-16T17:55:19Z

Notification vs Request is a type and protocol-level concept. SaveGuestVtl2StateRequest is a Request.

smalis-msft · 2025-01-16T18:41:01Z

Bhargav and I chatted on teams and I understand this a lot more now. Basically the problem is we've defined a Request, but the host doesn't send a reply on failure paths here, only success paths. This is a bug on the host, and should be fixed eventually. I agree that this is the right code change to work around the problem from OpenHCL's point of view, so long as we have sufficient comments on it documenting why we're doing this weirdness and breaking protocol. I think the actual code change is fine as is.

Also I am slightly concerned about the possibility of some future host change (or just a different code path) sending a reply to this request at some point. That would cause our recv queue to get confused, since it now has this response that was never waited for. This would cause whatever the next Request is to fail, since it would see a mismatched response. We might need some way to say "a response is optional, if we happen to get one throw it away" instead of "we will never get a response"?

bhargavshah1988 · 2025-01-16T19:57:10Z

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

Sorry to pile on: do we have the infrastructure in place to add tests for this as part of this PR? We have some servicing tests in petri, but I don't know enough to know if that helps.

I'm grateful for the fix (thanks Bhargav!), but this seems like an important test gap to fill.

Also note that the failure path (fail to save) no longer exists in the active branch.

Yes, we are considering to add this test.

During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.

During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response. Cherry pick from #677

jstarks · 2025-02-06T06:42:56Z

Backported in #684

bhargavshah1988 added 5 commits November 12, 2024 13:10

Merge branch 'main' of https://github.com/bhargavshah1988/openvmm

57c366b

Merge branch 'main' of https://github.com/bhargavshah1988/openvmm

2f4ad18

Merge branch 'main' of https://github.com/bhargavshah1988/openvmm

a9e19ee

Merge branch 'main' of https://github.com/bhargavshah1988/openvmm

9882a22

bhargavshah1988 requested a review from a team as a code owner January 16, 2025 00:35

chris-oo reviewed Jan 16, 2025

View reviewed changes

bhargavshah1988 added 3 commits January 16, 2025 11:57

Remove empty line

a5923e4

Add comment

e39e191

Change a function name

5a815d8

chris-oo approved these changes Jan 16, 2025

View reviewed changes

bhargavshah1988 merged commit e300a41 into microsoft:main Jan 16, 2025
25 checks passed

chris-oo mentioned this pull request Jan 16, 2025

Fix infinite wait loop in OpenHCL error handling path #684

Merged

benhillis added the backport_2411 Change should be backported to the release/2411 branch label Feb 5, 2025

jstarks removed the backport_2411 Change should be backported to the release/2411 branch label Feb 6, 2025

jstarks added the backported_2411 PR that has been backported to release/2411 label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infinite wait loop in OpenHCL error handling path #677

Fix infinite wait loop in OpenHCL error handling path #677

bhargavshah1988 commented Jan 16, 2025

benhillis commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

chris-oo Jan 16, 2025

smalis-msft Jan 16, 2025

jstarks Jan 16, 2025

bhargavshah1988 Jan 16, 2025

chris-oo commented Jan 16, 2025

smalis-msft commented Jan 16, 2025

mattkur commented Jan 16, 2025 •

edited

Loading

chris-oo commented Jan 16, 2025

mebersol commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

smalis-msft commented Jan 16, 2025

smalis-msft commented Jan 16, 2025 •

edited

Loading

bhargavshah1988 commented Jan 16, 2025

jstarks commented Feb 6, 2025

Fix infinite wait loop in OpenHCL error handling path #677

Fix infinite wait loop in OpenHCL error handling path #677

Conversation

bhargavshah1988 commented Jan 16, 2025

benhillis commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

chris-oo Jan 16, 2025

Choose a reason for hiding this comment

smalis-msft Jan 16, 2025

Choose a reason for hiding this comment

jstarks Jan 16, 2025

Choose a reason for hiding this comment

bhargavshah1988 Jan 16, 2025

Choose a reason for hiding this comment

chris-oo commented Jan 16, 2025

smalis-msft commented Jan 16, 2025

mattkur commented Jan 16, 2025 • edited Loading

chris-oo commented Jan 16, 2025

mebersol commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

bhargavshah1988 commented Jan 16, 2025

smalis-msft commented Jan 16, 2025

smalis-msft commented Jan 16, 2025 • edited Loading

bhargavshah1988 commented Jan 16, 2025

jstarks commented Feb 6, 2025

mattkur commented Jan 16, 2025 •

edited

Loading

smalis-msft commented Jan 16, 2025 •

edited

Loading