Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

Closed
3 tasks done
EnchanterIO opened this issue May 9, 2022 · 14 comments · Fixed by #8973
Closed
3 tasks done

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

EnchanterIO opened this issue May 9, 2022 · 14 comments · Fixed by #8973
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@EnchanterIO
Copy link
Contributor

Checklist

Installation method

built from source

Version

0.11

Config

-

Description

When debugging a memory leak on Infura nodes we noticed there are hundreds of stuck goroutines trying to process get commands.

Stack trace example:

goroutine 448012757 [select, 3993 minutes]:
io.(*pipe).Write(0xc0f60ecd80, {0xc1ef3d4000, 0x100000, 0x100000})
	/usr/local/go/src/io/pipe.go:94 +0x21a
io.(*PipeWriter).Write(0xc10fc49b72, {0xc1ef3d4000, 0x5, 0x4})
	/usr/local/go/src/io/pipe.go:163 +0x25
bufio.(*Writer).Flush(0xc0f68f88c0)
	/usr/local/go/src/bufio/bufio.go:607 +0x62
bufio.(*Writer).Write(0xc0f68f88c0, {0xc01ccf8000, 0xc061b905a0, 0xc10fc49af0})
	/usr/local/go/src/bufio/bufio.go:643 +0xd8
github.com/ipfs/go-ipfs/core/commands.(*identityWriteCloser).Write(0x50, {0xc01ccf8000, 0xc10fc49b70, 0x14d6dcc})
	go-ipfs/core/commands/get.go:258 +0x26
archive/tar.(*regFileWriter).Write(0xc0e3b9d380, {0xc01ccf8000, 0x100000, 0xc0499b4400})
	/usr/local/go/src/archive/tar/writer.go:497 +0x69
archive/tar.(*Writer).Write(0xc0657ddc00, {0xc01ccf8000, 0xc10fc49b01, 0xc10fc49bb0})
	/usr/local/go/src/archive/tar/writer.go:435 +0x4a
bytes.(*Reader).WriteTo(0xc0fa4714a0, {0x32dcc80, 0xc0657ddc00})
	/usr/local/go/src/bytes/reader.go:144 +0x87
github.com/ipfs/go-unixfs/io.(*dagReader).writeNodeDataBuffer(0xc0f60ecd20, {0x32dcc80, 0xc0657ddc00})
	pkg/mod/github.com/ipfs/[email protected]/io/dagreader.go:257 +0x33
github.com/ipfs/go-unixfs/io.(*dagReader).WriteTo.func1({0x32f27f0, 0xc102a2fa40})
	pkg/mod/github.com/ipfs/[email protected]/io/dagreader.go:314 +0xd9
github.com/ipfs/go-ipld-format.(*Walker).visitActiveNode(0xc044be0b40, 0x14d6f02)
	pkg/mod/github.com/ipfs/[email protected]/walker.go:359 +0x4a
github.com/ipfs/go-ipld-format.(*Walker).down(0xc044be0b40, 0x0)
	pkg/mod/github.com/ipfs/[email protected]/walker.go:306 +0x20f
github.com/ipfs/go-ipld-format.(*Walker).Iterate(0xc044be0b40, 0xc0657ddc00)
	pkg/mod/github.com/ipfs/[email protected]/walker.go:195 +0x3f
github.com/ipfs/go-unixfs/io.(*dagReader).WriteTo(0x2a67fc0, {0x32dcc80, 0xc0657ddc00})
	pkg/mod/github.com/ipfs/[email protected]/io/dagreader.go:297 +0xf8
io.copyBuffer({0x32dcc80, 0xc0657ddc00}, {0x7f85b43f3d48, 0xc132b8f7c0}, {0x0, 0x0, 0x0})
	/usr/local/go/src/io/io.go:405 +0x16e
io.Copy(...)
	/usr/local/go/src/io/io.go:382
github.com/ipfs/go-ipfs-files.(*TarWriter).writeFile(0xc19bb81248, {0x7f85b43f3d10, 0xc132b8f7c0}, {0xc0cf864a06, 0x3b})
	pkg/mod/github.com/ipfs/[email protected]/tarwriter.go:46 +0xad
github.com/ipfs/go-ipfs-files.(*TarWriter).WriteFile(0xc19bb81248, {0x32f2bd8, 0xc132b8f7c0}, {0xc0cf864a06, 0x3b})
	pkg/mod/github.com/ipfs/[email protected]/tarwriter.go:59 +0x192
github.com/ipfs/go-ipfs/core/commands.fileArchive.func4()
	go-ipfs/core/commands/get.go:322 +0x45
created by github.com/ipfs/go-ipfs/core/commands.fileArchive
	go-ipfs/core/commands/get.go:320 +0x5da

Seems like if the original reader is gone, the writer can't proceed? Maybe a bug when handling a context timeout / writing error / pipe closing?

Interestingly, the stuck get cmds aren't visible in the ipfs diag cmds -v list, as if they would be partially cleaned-up already.

PS: our /get cmds are running with an explicit timeout= option configured by a Reverse Proxy making HTTP requests to the go-ipfs node.

@EnchanterIO EnchanterIO added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 9, 2022
@BigLep BigLep moved this to 🥞 Todo in IPFS Shipyard Team May 9, 2022
@BigLep
Copy link
Contributor

BigLep commented May 10, 2022

@EnchanterIO : thanks for reporting this. What is the impact of this issue on your service?

Next steps that can be taken:

  1. (Infura) explicitly add timeouts at the go-ipfs level to confirm the issue (rather than relying on nginx/lb timeouts)
  2. (Either side) see if can reproduce this issue locally by making ipfs get requests for various content that does exist and doesn't.
  3. (Maintainers) inspect the code to see if can reason about this from the stack trace

Maintainer @ajnavarro will attempt #2.

@EnchanterIO
Copy link
Contributor Author

EnchanterIO commented May 10, 2022

Hi @BigLep, the impact is a crashing node due to constantly growing heap allocation due to stuck go-routines and IO pipes.

Reagrding 1) we already have an explicit ?timeout= flag sent to go-ipfs nodes for every API call as well as Proxy timeout so the go-routine seems to be definitely stuck due to some resource mishandling.

Thanks for looking into it!

@schomatis
Copy link
Contributor

(Maybe related:) We never plumb the command context to the fileArchive that started the stack.

@ajnavarro
Copy link
Member

(Maybe related:) We never plumb the command context to the fileArchive that started the stack.

I think the Context is set to the file when calling api.Unixfs().Get. After that, the context is also set inside the dagReader and used by the dagWalker when calling WriteTo. dagWalker is using the context when calling NavigableIPLDNode.FetchChild and there, the context can be canceled correctly.

One thing that is strange for me is NavigableIPLDNode is trying to preload using an already canceled context:

switch err {
	case nil:
	case context.DeadlineExceeded, context.Canceled:
		if ctx.Err() != nil {
			return nil, ctx.Err()
		}

		// In this case, the context used to *preload* the node (in a previous
		// `FetchChild` call) has been canceled. We need to retry the load with
		// the current context and we might as well preload some extra nodes
		// while we're at it.
		nn.preload(ctx, childIndex)
		child, err = nn.getPromiseValue(ctx, childIndex)
		if err != nil {
			return nil, err
		}
	default:
		return nil, err
	}

I think that can be a problem.

@BigLep BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team May 12, 2022
@ipfs ipfs deleted a comment from welcome bot May 12, 2022
@BigLep
Copy link
Contributor

BigLep commented May 12, 2022

2022-05-12 conversation:
@ajnavarro hasn't been able to reproduce the error locally. He will post the approaches he took and what he found. Hopefully others will spot what we might be doing wrong.
Basically, we need to get a repro case so we can further debug.

@schomatis
Copy link
Contributor

The context passed is used only during fetching in our API, but the pipe created is not closed on it and the write gets stuck.

@ajnavarro
Copy link
Member

ajnavarro commented May 13, 2022

I tried these two approaches to reproduce the error:

  1. Add a file, remove some random block from it, and try to fetch it. Steps:
ipfs daemon
ipfs add --pin=false Downloads/go1.17.9.linux-amd64.tar.gz # add go.17 file
ipfs ls QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # find a block to delete
ipfs block rm QmVwPxYFpbu96gi4pqk3f4xsHnjag79MdFbjKP7RnPHHTT # remove one random block
ipfs repo gc
ipfs get QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # execute and cancel command
ipfs diag profile

All get operations were canceled correctly.

  1. Force a bunch of get operations and see if some of them get stuck:
ipfs daemon
ipfs add --pin=false Downloads/go1.17.9.linux-amd64.tar.gz # add go.17 file
ipfs ls QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # find a block to delete
ipfs block rm QmVwPxYFpbu96gi4pqk3f4xsHnjag79MdFbjKP7RnPHHTT # remove one random block
ipfs repo gc
for i in $(seq 1 1000); do (ipfs get QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC --timeout=1m &); done
ipfs diag profile

I tried with CIDs pointing to files and folders too.

@EnchanterIO
Copy link
Contributor Author

Would it make a difference if the get reqs would be over HTTP? And if the original client would terminate before a get operation under timeout finishes?

@MichaelMure
Copy link
Contributor

When following the original stack trace, one spot I found quite suspicious is fileArchive(). In that stack trace, the flow goes through that function to setup some writers and a io.Pipe, go down the DAG fetching/handling, then come back in that io code where it get stuck regardless of the context status.

I wasn't able to really pinpoint how that could get stuck, but generally this code and how the tear down is executed on error is quite confusing. I wouldn't be surprised if something is incorrect in there. This could be combined with the following gotcha of io.Pipe:

// Reads and Writes on the pipe are matched one to one
// except when multiple Reads are needed to consume a single Write.
// That is, each Write to the PipeWriter --> blocks <-- until it has satisfied
// one or more Reads from the PipeReader that fully consume
// the written data.

So, if somehow:

  • a request get cancelled
  • ... while data was already read and piped into that buffer
  • ... but for some reason the write was big enough to require more than one read
  • ... and the request get cancelled just between those reads
  • ... killing the command output (that is, no more read), while data is still in the io.Pipe, including I suppose the final error

... then that can block the whole thing?

Jorropo added a commit to Jorropo/go-ipfs that referenced this issue May 13, 2022
Fixes ipfs#8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
Jorropo added a commit to Jorropo/go-ipfs that referenced this issue May 13, 2022
Fixes ipfs#8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team May 13, 2022
Jorropo added a commit that referenced this issue May 13, 2022
Fixes #8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
@BigLep
Copy link
Contributor

BigLep commented May 13, 2022

Great job @Jorropo and @ajnavarro in finding/fixing the issue. A few things:

  1. I see fix: hanging goroutine in get fileArchive handler #8973 doesn't have any test. What would it take to add one?
  2. @EnchanterIO : can you confirm this addresses your issue?
  3. Assuming we have confirmation this fixes the issue, I assume we'll include this in the 0.13 release (Release v0.13 #8640 ).

@BigLep BigLep mentioned this issue May 13, 2022
65 tasks
@Jorropo
Copy link
Contributor

Jorropo commented May 13, 2022

doesn't have any test. What would it take to add one?

@BigLep
No the test would be very VERY flaky:

  • It require the http API to be really slow.
    (the cancel must happen on an http write which is unlikely to begin with)
  • It require scanning the goroutines stacks to not find a certain function (which has billions of way to fail (or not)).

You could add:

var b [1]byte
rand.Read(b[:])
if b[0] & 1 == 0 {
  panic("failed")
}

and it would likely be about as reliable.

@schomatis
Copy link
Contributor

Some thoughts (nothing important, feel free to ignore):

  • Testing a leak is not that painful (brittle), we can try tools like https://github.com/fortytw2/leaktest or https://github.com/uber-go/goleak.

  • Triggering the leak is rather straightforward, it's not about the HTTP API; we just don't consume the returned reader.

  • The tricky part is that our commands library as it stands doesn't allow us to easily abstract and test a command on its own (that's why we rely so much on bash/sharness tests, not because we care that much about integration testing but because we can't easily test our commands in native Go).

  • On this particular issue testing costs do seem to outweigh its gains. Eventually we should have a good framework to test altogether any leaks in a running daemon after a battery of commands, but a regression on this particular issue seems unlikely (given the OP confirms the issue as resolved; we did close this one too early).

@EnchanterIO
Copy link
Contributor Author

This looks promising. Thanks for the effort. I applied the patch to our go-ipfs nodes and rolled them our to production this morning.

Will post results by the end of the week if the memory problem and hanging go-routines are solved or not.

@EnchanterIO
Copy link
Contributor Author

Thanks a ton for looking into this @BigLep, @schomatis , @ajnavarro , @Jorropo.

Over a longer period I will be more sure, but after a week it seems like we no longer have any stuck get cmds at any instance and the memory heap seems healthy again. 🥳

guseggert pushed a commit that referenced this issue Jun 8, 2022
Fixes #8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
guseggert pushed a commit that referenced this issue Jun 8, 2022
Fixes #8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
(cherry picked from commit 7892cc9)
guseggert pushed a commit that referenced this issue Jun 8, 2022
Fixes #8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
(cherry picked from commit 7892cc9)
guseggert pushed a commit that referenced this issue Jun 8, 2022
Fixes #8957

The context was only checked while reading data.
Not while writing data to the http connection.
So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging.

Co-authored-by: Antonio Navarro Perez <[email protected]>
(cherry picked from commit 7892cc9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants