You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While every transient network error is *url.Error when using net/http's http.Client.Do() (https://github.com/golang/go/blob/master/src/net/http/client.go#L579-L580), the only *url.Error that is considered transient is ECONNCLOSED(source code). #9017 deals with the ECONNRESET error, but this pull is not very effective on http request because ECONNRESET is an *url.Error actually.
2. argoexec does not retry ExecutorPlugin API call on transient network errors.
The document describes that any connection/socket error occurred during a plugin API call is considered transient, therefore triggers retry logic. Temporary(this term is arguable. golang/go#45729) network errors are handled as of now, but this does not fully cover every transient network error. ECONNRESET is one of the example. Generally it is not a temporary network error(golang/go#24808), but in the context of argo-workflow we can consider it as a transient error, i.e. we may retry and get the successful reply from the plugin.
3. EOF error during http request also should be considered as a transient network errors.
I accidentally set the requeueing period same as the keep-alive timeout of the ExecutorPlugin and I found EOF could also happen as well as ECONNRESET. When invoking read() syscall inside the http request module, EOF wrapped with *url.Error could be returned if peer has disconnected.
How to reproduce:
Implement a simple ExecutorPlugin with nodejs(e.g. with expressjs), without keep-alive timeout tuning. The server should respond with requeue: 5s, node.phase: Running.
Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
agilgur5
changed the title
Transient network errors are not handled properly
Transient network errors are not handled properly for plugin API calls
Apr 26, 2024
argoproj
locked as resolved and limited conversation to collaborators
Apr 26, 2024
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Pre-requisites
:latest
What happened/what you expected to happen?
1. Transient
url.Error
is not handled properly.While every transient network error is
*url.Error
when usingnet/http
'shttp.Client.Do()
(https://github.com/golang/go/blob/master/src/net/http/client.go#L579-L580), the only*url.Error
that is considered transient isECONNCLOSED
(source code).#9017 deals with the
ECONNRESET
error, but this pull is not very effective on http request becauseECONNRESET
is an*url.Error
actually.2. argoexec does not retry ExecutorPlugin API call on transient network errors.
The document describes that any connection/socket error occurred during a plugin API call is considered transient, therefore triggers retry logic. Temporary(this term is arguable. golang/go#45729) network errors are handled as of now, but this does not fully cover every transient network error.
ECONNRESET
is one of the example. Generally it is not a temporary network error(golang/go#24808), but in the context of argo-workflow we can consider it as a transient error, i.e. we may retry and get the successful reply from the plugin.3. EOF error during http request also should be considered as a transient network errors.
I accidentally set the requeueing period same as the keep-alive timeout of the ExecutorPlugin and I found
EOF
could also happen as well asECONNRESET
. When invokingread()
syscall inside the http request module, EOF wrapped with*url.Error
could be returned if peer has disconnected.How to reproduce:
Implement a simple ExecutorPlugin with nodejs(e.g. with expressjs), without keep-alive timeout tuning. The server should respond with
requeue: 5s
,node.phase: Running
.Version
latest
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
See the description above.
Logs from the workflow controller
Nothing special.
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: