Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infinite wait loop in OpenHCL error handling path #677

Merged
merged 8 commits into from
Jan 16, 2025

Conversation

bhargavshah1988
Copy link
Contributor

During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.

During servicing, the host does not send a response for
failure notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an
infinite wait loop. Updated the error handling path to
avoid expecting and waiting for a host response.
@bhargavshah1988 bhargavshah1988 requested a review from a team as a code owner January 16, 2025 00:35
@benhillis
Copy link
Member

The existing logic here seems very old. How is this now coming up?

@bhargavshah1988
Copy link
Contributor Author

How is this now coming up?

After reporting the save state failure to the host, OpenHCL waits indefinitely for a response. The host, however, only sends a response on save success and does not respond to save failures notifications. Consequently, VTL0 becomes unresponsive, even though the save failure is non-fatal and correctly reported to the host.
We recently introduced logic to enable the shared pool(#540).
However, saving is not supported when the shared pool is enabled(https://github.com/microsoft/openvmm/blob/release/2411/openhcl/underhill_core/src/dispatch/mod.rs#L621), leading to save operations failing.
During servicing operations, any failure after the save phase is treated as a point of no return, prompting the host to reset the VM. However, OpenHCL should recover autonomously during failures in the save phase, avoiding a host-initiated VM reset.

@@ -628,6 +628,16 @@ impl HostRequestPipeAccess {
get_protocol::HeaderHostRequest::read_from_prefix(data.as_bytes()).unwrap();
self.recv_response_fixed_size(req_header.message_id).await
}

/// Sends a request to the host.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the extra line here, and i'd add a note that you probably don't want want this, as it does not wait for a response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this needs more documentation on why you don't want to use this 99% of the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just name it based on the single case that it should be used for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have documented that in comments.

@chris-oo
Copy link
Member

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

@smalis-msft
Copy link
Contributor

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

@mattkur
Copy link
Contributor

mattkur commented Jan 16, 2025

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

Sorry to pile on: do we have the infrastructure in place to add tests for this as part of this PR? We have some servicing tests in petri, but I don't know enough to know if that helps.

I'm grateful for the fix (thanks Bhargav!), but this seems like an important test gap to fill.

Also note that the failure path (fail to save) no longer exists in the active branch.

@chris-oo
Copy link
Member

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

The whole failure path is weird - given we have to support the hosts that already exists, I'm unfortunately inclined to mark this as yet-another-protocol-wart that we have to live with until we can rev the protocol. The weirdness is that this uses the same type as a servicing request with saved state which has a response, but it seems like the original implementation never sent a response back for this message.

@mebersol
Copy link
Collaborator

How is this now coming up?

After reporting the save state failure to the host, OpenHCL waits indefinitely for a response. The host, however, only sends a response on save success and does not respond to save failures notifications. Consequently, VTL0 becomes unresponsive, even though the save failure is non-fatal and correctly reported to the host. We recently introduced logic to enable the shared pool(#540). However, saving is not supported when the shared pool is enabled(https://github.com/microsoft/openvmm/blob/release/2411/openhcl/underhill_core/src/dispatch/mod.rs#L621), leading to save operations failing. During servicing operations, any failure after the save phase is treated as a point of no return, prompting the host to reset the VM. However, OpenHCL should recover autonomously during failures in the save phase, avoiding a host-initiated VM reset.

This is really a host-side bug, based on the current protocol construction. Are there plans to address the host?

@bhargavshah1988
Copy link
Contributor Author

Are there plans to address the host?

Do we expect host to acknowledge reception of notification packet ? Why ?

@bhargavshah1988
Copy link
Contributor Author

This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird.

I think i can change the name of this function to send_notification. Thoughts?
Also, on failure paths host always treats the failure from OHL as notification only. So i believe host support is not required.

@smalis-msft
Copy link
Contributor

Notification vs Request is a type and protocol-level concept. SaveGuestVtl2StateRequest is a Request.

@smalis-msft
Copy link
Contributor

smalis-msft commented Jan 16, 2025

Bhargav and I chatted on teams and I understand this a lot more now. Basically the problem is we've defined a Request, but the host doesn't send a reply on failure paths here, only success paths. This is a bug on the host, and should be fixed eventually. I agree that this is the right code change to work around the problem from OpenHCL's point of view, so long as we have sufficient comments on it documenting why we're doing this weirdness and breaking protocol. I think the actual code change is fine as is.

Also I am slightly concerned about the possibility of some future host change (or just a different code path) sending a reply to this request at some point. That would cause our recv queue to get confused, since it now has this response that was never waited for. This would cause whatever the next Request is to fail, since it would see a mismatched response. We might need some way to say "a response is optional, if we happen to get one throw it away" instead of "we will never get a response"?

@bhargavshah1988
Copy link
Contributor Author

To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before.

Sorry to pile on: do we have the infrastructure in place to add tests for this as part of this PR? We have some servicing tests in petri, but I don't know enough to know if that helps.

I'm grateful for the fix (thanks Bhargav!), but this seems like an important test gap to fill.

Also note that the failure path (fail to save) no longer exists in the active branch.

Yes, we are considering to add this test.

@bhargavshah1988 bhargavshah1988 merged commit e300a41 into microsoft:main Jan 16, 2025
25 checks passed
bhargavshah1988 added a commit to bhargavshah1988/openvmm that referenced this pull request Jan 16, 2025
During servicing, the host does not send a response for failure
notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an infinite wait loop.
Updated the error handling path to avoid expecting and waiting for a
host response.
chris-oo pushed a commit that referenced this pull request Jan 16, 2025
During servicing, the host does not send a response for failure
notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an infinite wait loop.
Updated the error handling path to avoid expecting and waiting for a
host response.

Cherry pick from #677
@benhillis benhillis added the backport_2411 Change should be backported to the release/2411 branch label Feb 5, 2025
@jstarks
Copy link
Member

jstarks commented Feb 6, 2025

Backported in #684

@jstarks jstarks removed the backport_2411 Change should be backported to the release/2411 branch label Feb 6, 2025
@jstarks jstarks added the backported_2411 PR that has been backported to release/2411 label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported_2411 PR that has been backported to release/2411
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants