Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stalled writer goroutines #51

Closed
raulk opened this issue May 7, 2019 · 3 comments · Fixed by #52
Closed

stalled writer goroutines #51

raulk opened this issue May 7, 2019 · 3 comments · Fixed by #52
Assignees

Comments

@raulk
Copy link
Member

raulk commented May 7, 2019

Grepping through goroutine traces of our relay infrastructure, I found stalled goroutines:

goroutine 2826624695 [select, 3 minutes]:
github.com/libp2p/go-mplex.(*Multiplex).sendMsg(0xc1f430fc00, 0xf77f00, 0xc0000cc010, 0x18350, 0xc3aefa9990, 0x5, 0x20, 0x0, 0x0)
#011/home/ubuntu/go/pkg/mod/github.com/vyzo/[email protected]/multiplex.go:143 +0xf5
github.com/libp2p/go-mplex.(*Multiplex).NewNamedStream(0xc1f430fc00, 0xc044180008, 0x5, 0xc3aefa9a20, 0xe8ec18, 0xc1720a67a0)
#011/home/ubuntu/go/pkg/mod/github.com/vyzo/[email protected]/multiplex.go:217 +0x1a0
github.com/libp2p/go-mplex.(*Multiplex).NewStream(...)
[...]

When we write control messages, we transfer the deadline from the context to the connection, but the sendMsg method is always called with the background context, which never carries a deadline. So if one write stalls, it'll stall forever and it'll also block any subsequent sending of control messages on that multiplexed connection.

See: https://github.com/libp2p/go-mplex/blob/master/multiplex.go#L152

@vyzo
Copy link
Contributor

vyzo commented May 7, 2019

We do it on resets and the header too, occur sees 3 occurrences of sending messages with context.Background():

    217:	err := mp.sendMsg(context.Background(), header, []byte(name))
    373:				go mp.sendMsg(context.Background(), ch.header(resetTag), nil)
    385:				go mp.sendMsg(context.Background(), msch.id.header(resetTag), nil)

@vyzo
Copy link
Contributor

vyzo commented May 8, 2019

I think the main culprit is the reset message when we get data in a closed stream:

			if remoteClosed {
				// closed stream, return b
				pool.Put(b)

				log.Warningf("Received data from remote after stream was closed by them. (len = %d)", len(b))
				go mp.sendMsg(context.Background(), msch.id.header(resetTag), nil)
				continue
			}

This occurs so often that we had to downgrade the error log in #46 because it filled the disk with logs in the mplex relay.

@vyzo
Copy link
Contributor

vyzo commented May 8, 2019

The other place where we send timeout-less messages is when we do an actual Close or Reset ourselves.

@vyzo vyzo closed this as completed in #52 May 10, 2019
@ghost ghost removed the status/in-progress In progress label May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants