-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TarWriter.WriteFile stuck on select when writing to io.pipe #8957
Comments
@EnchanterIO : thanks for reporting this. What is the impact of this issue on your service? Next steps that can be taken:
Maintainer @ajnavarro will attempt #2. |
Hi @BigLep, the impact is a crashing node due to constantly growing heap allocation due to stuck go-routines and IO pipes. Reagrding 1) we already have an explicit Thanks for looking into it! |
(Maybe related:) We never plumb the command context to the |
I think the Context is set to the One thing that is strange for me is switch err {
case nil:
case context.DeadlineExceeded, context.Canceled:
if ctx.Err() != nil {
return nil, ctx.Err()
}
// In this case, the context used to *preload* the node (in a previous
// `FetchChild` call) has been canceled. We need to retry the load with
// the current context and we might as well preload some extra nodes
// while we're at it.
nn.preload(ctx, childIndex)
child, err = nn.getPromiseValue(ctx, childIndex)
if err != nil {
return nil, err
}
default:
return nil, err
} I think that can be a problem. |
2022-05-12 conversation: |
The context passed is used only during fetching in our API, but the pipe created is not closed on it and the write gets stuck. |
I tried these two approaches to reproduce the error:
All get operations were canceled correctly.
I tried with CIDs pointing to files and folders too. |
Would it make a difference if the |
When following the original stack trace, one spot I found quite suspicious is I wasn't able to really pinpoint how that could get stuck, but generally this code and how the tear down is executed on error is quite confusing. I wouldn't be surprised if something is incorrect in there. This could be combined with the following gotcha of
So, if somehow:
... then that can block the whole thing? |
Fixes ipfs#8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]>
Fixes ipfs#8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]>
Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]>
Great job @Jorropo and @ajnavarro in finding/fixing the issue. A few things:
|
@BigLep
You could add: var b [1]byte
rand.Read(b[:])
if b[0] & 1 == 0 {
panic("failed")
} and it would likely be about as reliable. |
Some thoughts (nothing important, feel free to ignore):
|
This looks promising. Thanks for the effort. I applied the patch to our go-ipfs nodes and rolled them our to production this morning. Will post results by the end of the week if the memory problem and hanging go-routines are solved or not. |
Thanks a ton for looking into this @BigLep, @schomatis , @ajnavarro , @Jorropo. Over a longer period I will be more sure, but after a week it seems like we no longer have any stuck |
Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]>
Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]> (cherry picked from commit 7892cc9)
Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]> (cherry picked from commit 7892cc9)
Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <[email protected]> (cherry picked from commit 7892cc9)
Checklist
Installation method
built from source
Version
Config
-
Description
When debugging a memory leak on Infura nodes we noticed there are hundreds of stuck goroutines trying to process
get
commands.Stack trace example:
Seems like if the original reader is gone, the writer can't proceed? Maybe a bug when handling a context timeout / writing error / pipe closing?
Interestingly, the stuck
get
cmds aren't visible in theipfs diag cmds -v
list, as if they would be partially cleaned-up already.PS: our
/get
cmds are running with an explicittimeout=
option configured by a Reverse Proxy making HTTP requests to the go-ipfs node.The text was updated successfully, but these errors were encountered: