-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix infinite wait loop in OpenHCL error handling path #677
Fix infinite wait loop in OpenHCL error handling path #677
Conversation
During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.
The existing logic here seems very old. How is this now coming up? |
After reporting the save state failure to the host, OpenHCL waits indefinitely for a response. The host, however, only sends a response on save success and does not respond to save failures notifications. Consequently, VTL0 becomes unresponsive, even though the save failure is non-fatal and correctly reported to the host. |
@@ -628,6 +628,16 @@ impl HostRequestPipeAccess { | |||
get_protocol::HeaderHostRequest::read_from_prefix(data.as_bytes()).unwrap(); | |||
self.recv_response_fixed_size(req_header.message_id).await | |||
} | |||
|
|||
/// Sends a request to the host. | |||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the extra line here, and i'd add a note that you probably don't want want this, as it does not wait for a response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this needs more documentation on why you don't want to use this 99% of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just name it based on the single case that it should be used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have documented that in comments.
To clarify, this save failure path where OHCL is expected to recover, isn't well tested as we didn't have any paths that exercised this before. |
This isn't correctly following the GET semantics. A request should always wait for a response. A notification is what should be used when no response is expected. Of course introducing a new notification would require host support. But this feels really weird. |
Sorry to pile on: do we have the infrastructure in place to add tests for this as part of this PR? We have some servicing tests in petri, but I don't know enough to know if that helps. I'm grateful for the fix (thanks Bhargav!), but this seems like an important test gap to fill. Also note that the failure path (fail to save) no longer exists in the active branch. |
The whole failure path is weird - given we have to support the hosts that already exists, I'm unfortunately inclined to mark this as yet-another-protocol-wart that we have to live with until we can rev the protocol. The weirdness is that this uses the same type as a servicing request with saved state which has a response, but it seems like the original implementation never sent a response back for this message. |
This is really a host-side bug, based on the current protocol construction. Are there plans to address the host? |
Do we expect host to acknowledge reception of notification packet ? Why ? |
I think i can change the name of this function to send_notification. Thoughts? |
Notification vs Request is a type and protocol-level concept. SaveGuestVtl2StateRequest is a Request. |
Bhargav and I chatted on teams and I understand this a lot more now. Basically the problem is we've defined a Request, but the host doesn't send a reply on failure paths here, only success paths. This is a bug on the host, and should be fixed eventually. I agree that this is the right code change to work around the problem from OpenHCL's point of view, so long as we have sufficient comments on it documenting why we're doing this weirdness and breaking protocol. I think the actual code change is fine as is. Also I am slightly concerned about the possibility of some future host change (or just a different code path) sending a reply to this request at some point. That would cause our recv queue to get confused, since it now has this response that was never waited for. This would cause whatever the next Request is to fail, since it would see a mismatched response. We might need some way to say "a response is optional, if we happen to get one throw it away" instead of "we will never get a response"? |
Yes, we are considering to add this test. |
During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.
During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response. Cherry pick from #677
Backported in #684 |
During servicing, the host does not send a response for failure notifications sent by OpenHCL. However, OpenHCL
waits indefinitely for a host response, causing an infinite wait loop. Updated the error handling path to avoid expecting and waiting for a host response.