-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle nomad exec
termination events in order
#10657
Conversation
In this loop, we ought to close the websocket connection gracefully when the StreamErrWrapper reaches EOF. Previously, it's possible that that we drop the last few events or skip sending the websocket closure. If `handler(handlerPipe)` returns and `cancel` is called, before the loop here completes processing streaming events, the loop exits prematurely without propagating the last few events. Instead here, the loop continues until we hit `httpPipe` EOF (through `decoder.Decode`), to ensure we process the events to completion.
refactor the api handling of `nomad exec`, and ensure that we process all received events before handling websocket closing. The exit code should be the last message received, and we ought to ignore any websocket close error we receive afterwards. Previously, we used two channels: one for websocket frames and another for handling errors. This raised the possibility that we processed the error before processing the frames, resulting into an "unexpected EOF" error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; just the one question / suggestion
api/allocations_exec.go
Outdated
} | ||
|
||
func (s *execSession) startConnection() (*websocket.Conn, error) { | ||
nodeClient, _ := s.client.GetNodeClientWithTimeout(s.alloc.NodeID, ClientConnTimeout, s.q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to have a comment explaining why nodeClient
being nil here is acceptable - also, would it make sense to switch on the error here? E.g. we could get back NodeDownErr
which seems like a case to bail early?
https://github.com/hashicorp/nomad/blob/main/api/api.go#L492
Handle `nomad exec` termination events in order
Hi Team, I raise this discussion a few days ago https://discuss.hashicorp.com/t/nomad-exec-error-failed-to-exec-into-task-unexpected-eof/26539 Does this change requires access to NodeID which a non admin ACL token does not have access to? Which it will return 403 ?
Because the exec seems to be fine admin token. Thanks |
Hi @surajthakur ! I'm sorry I missed this comment earlier. I've investigated a related issue in #10922 . Can you try out the suggestions in #10922 (comment) and see if the change addresses the issue you are seeing. If not, I'd encourage you to file a new GitHub issue and I will follow up closer. I just tried reproducing the issue with permissions but without much luck. |
Hey @notnoop, The above mentioned issue is be a different one. My issue was related to ACL when performing exec. If I provided an admin ACL, then the exec was fine, as I suppose it would have easily fetched the nodeID. Where as given the below ACL, doesn't have nodeID fetch access, so when i tried exec with the below policy nomad server would return 403. and exec returning EOF exec error
There is special case here: If I try to make exec request bypassing load balancer, using http://nc-server-1:4646 the exec request is success. But in case of https it goes through load balancer(haproxy), it returns 403. I got solved it giving read permission for nodeID. So this is the new ACL permission
I checked in the earlier version which nomad 1.0.4, there was an 403 in /v1/node/* but exec probably just ignored it where was in nomad 1.1.2 it is kind of enforced and gives error if nodeID is returned 403 If you want I can share logs. if you still want I can create a github issue on this |
I suspect there are multiple issues at play - mind if you try the binaries in the issue I linked? The bit about how exec succeeds when bypassing the load balancer makes me suspect it's the issue there and not permission. I suspect that when you are using the root token, the CLI attempts and succeeds at connecting to the node directly bypassing the ELB for actual exec call. The CLI attempts to connect directly to the node to avoid overloading servers with unnecessary traffic. When the token lacks node permissions, it will fallback to execing through the ELB and because of the http/2 issue, nomad 1.1.2 fails. For record, here is my attempt with nomad 1.1.2:
On the server side, I see the
|
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Ensure that websocket channels are closed gracefully and that exit codes are properly captured, to avoid the spurious "unexpected EOF" failures even when command completes successfully.
Examining the paths, there are two bugs contributing to the cause.
First, the http handler that bridge the internal streaming RPC to websocket connection had a goroutine that could returned early before emitting the
websocket.CloseNormalClosure
message.Second, on the api consumer side, if we got an "unexpected EOF" error, the api may prioritize handling it before emitting the last output or exit code. In theory, we can safely ignore any error occurring after receiving the exit code message, as the exit code message is the last message.
In this PR, we fix the goroutine early termination, and ensure that all receive messages and errors are processed in order.
Fixes #9892