Occasional stream hang #3651

whyrusleeping · 2017-02-03T23:18:49Z

Ive been observing that every so often under a certain workflow i get a stuck stream. The workflow that i've been doing when this occurs is as follows:

mars is running, and runs the gexpin service.
My laptop ipfs node is adding content (gx packages) and then i pin those packages on mars through the gexpin site
The process works most of the time, but occasionally hangs

Things i've noticed:

Mars' wantlist contains the hash i asked it to pin, but my nodes view of mars wantlist does not contain that (ipfs bitswap wantlist -p QmaCpd... shows nothing). This implies that mars is hanging while sending me their wantlist and not that my node is hanging while sending the blocks. Which then further implies that the wantmanager stream that we use to send wantlist updates has something wrong with it.
Mars OOM crashes and reboots roughly once a day (bootstrap node + only 1GB ram == pain)
This issue is correlated i think with my laptop either losing wifi connection or going to sleep. Though i have slept my laptop, opened it back up, and had things work just fine.
I first noticed the issue with yamux, and tried out using multiplex to see if it would fix the issue, but i still managed to get a repro.
- the more interesting thing here is that the failure behaviour was slightly different under yamux vs multiplex

yamux vs multiplex:

under yamux, when this bug occurred, i could still ping mars from my node (ping opens a new stream, so this tells us that just one stream is hung and not the multiplexer routine). Under multiplex, this was not possible, no new stream could be opened, and ipfs ping MARS just hung, meaning that we couldnt open a new stream, implying that our multiplex writer routine was blocked somewhere and that somewhere on mars' side, a multiplex stream was full of data with nothing to read it.

My next steps will be to deeply investigate the wantmanager wantlist stream sender code to make sure we arent accidentally dropping a stream without closing it, and to look for any other place where such a thing might happen.

The text was updated successfully, but these errors were encountered:

matthewrobertbell · 2017-02-04T15:15:33Z

An aside: Why is the bootstrap node only a 1GB server? Is it a budget limitation?

whyrusleeping · 2017-02-05T01:22:28Z

Most of our bootstrap nodes have more, but mars was the first one we made, and I was being very cost conscious at the time. Ive been keeping it at 1GB because it puts pressure on me to get the memory usage down (and is also a great testbed for testing out memory improvements).

whyrusleeping · 2017-02-08T17:22:18Z

More observations from a recent repro:
ipfs swarm peers -v on mars shows two bitswap streams to my laptop, but ipfs swarm peers -v on my laptop shows only one bitswap stream open to mars. This is weird. It suggests that the stream was closed on my laptop, but mars didnt get the memo.

whyrusleeping · 2017-02-08T22:52:42Z

ipfs swarm peers -v gets its information about streams from go-peerstream. It could be the case that somehow the stream was removed from go-peerstream on my laptop, but the close didnt happen for some reason.

whyrusleeping · 2017-02-08T23:22:37Z

AH! I THINK I'VE GOT IT!

If a bitswap send fails because of a timeout (here: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L63) It returns the error, SKIPPING the timeout reset. The caller then sees that the write errored and closes the stream. Everythings good, right? Wrong.
Closing a stream requires sending a message to the other side, and if sending that message also times out, then the other side won't get the stream closed message (because it times out, resulting in the Close call erroring). If the connection resumes after this, then we think the stream is closed (its not really) and the other side will send messages over it that we never receive.

Now, how the hell do i test this?

whyrusleeping · 2017-02-09T06:09:37Z

Beginning to doubt my previous confidence. If the write and the close error out, we would just close the stream, and open a new one. The stream we open to the other peer does not get written on by the remote.

For this to make sense, the stream would have to experience the 'close failure' thing on the accepter side (my laptop). So now the question is, what about that situation could cause this loop to fail: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L179

whyrusleeping · 2017-02-09T06:28:23Z

some good news, my theory that this problem is caused by a 'Close' message failing to send is supported by this bit of the yamux codebase: https://github.com/hashicorp/yamux/blob/master/session.go#L469

If our close errored, it would clear the stream from our local stream set, meaning the above code would just ignore any messages received for that stream. The other side has no idea we're just tossing messages away, so it keeps sending them.

whyrusleeping · 2017-02-09T22:41:36Z

I've added code that closes the connection if a stream close failed, and i was able to reproduce the issue with that code running. Nothing makes sense anymore. (I'm rerunning with that code to verify that i was actually running the right code)

whyrusleeping · 2017-02-16T01:23:22Z

Encountered the issue again. Between by desktop and mars. The interesting bit this time is that i had just restarted the node on my desktop. Perhaps this has to do with reconnecting? Maybe whats happening is that both sides are trying to connect to eachother and its doing weird things with the Connect/Disconnect notifications... Will need to find time to investigate further.

whyrusleeping · 2017-02-16T01:23:43Z

Useful debugging tool to have: connection lifetimes

whyrusleeping · 2017-02-22T23:24:50Z

One more update, noticed again after:

starting my node
having mars pin a thing from it
killing my node
starting my node up again a short time later
trying to pin something on mars from my node

jbenet · 2017-02-23T03:59:08Z

would be amazing to get this figured out. can anyone in our community help with this?

whyrusleeping · 2017-03-01T23:41:41Z

yamux logs from recent reproduction of issue:

2017/03/01 15:30:38 [ERR] yamux: keepalive failed: session shutdown
2017/03/01 15:33:10 [ERR] yamux: established stream without inflight SYN (didn't have semaphore)
2017/03/01 15:33:10 [ERR] yamux: established stream without inflight SYN (didn't have semaphore)
2017/03/01 15:34:48 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:34:48 Stream close errored: session shutdown
2017/03/01 15:34:53 [ERR] yamux: Failed to read stream data: MAC verification failed
2017/03/01 15:35:10 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:35:10 Stream close errored: session shutdown
2017/03/01 15:36:20 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:36:20 Stream close errored: session shutdown
2017/03/01 15:37:41 [WARN] yamux: Discarding data for stream: 37

Not sure which are actually related, but figured i'd leave it here as a reference

whyrusleeping · 2017-03-06T01:45:50Z

Encountered this again with multiplex. I was still able to ping mars during the bug (this goes against what i was saying earlier where pinging mars hung while the bug was in effect under multiplex)

whyrusleeping · 2017-03-06T06:03:32Z

Wow. this issue crops up a lot on shitty comcast internet. Definitely a corelation with tcp connection timeouts here

whyrusleeping · 2017-06-21T05:47:35Z

@Stebalien this is the issue I was talking about earlier

Kubuxu · 2017-06-21T22:40:47Z

RR (http://rr-project.org/) might be useful tool to debug it but 1. it would have to happen in single threaded context (RR limits threading) 2. we would have to capture it

If we have captured it under RR it would be great as we could analyse the cause more easily.

Do we have any possible reproduction scenarios that might have some chance of reproducing it?

whyrusleeping · 2017-07-13T02:50:27Z

today is a good day.

@Stebalien ran into this bug earlier today, and we decided to sit down and debug it. four hours later, we know what causes the bug, and have one patch that fixes this particular instance of the bug, and know what to do to prevent this from happening in to other protocols that might make the same mistake.

The reproduction goes as follows:

Peer A connects to peer B
Peer B opens a new bitswap message queue for peer A. This opens a stream to B, and since they have never talked before, negotiates the protocol to use, and sets the 'remembered protocol' cache
Peer A disconnects from B
Peer A reconnects to B
Peer B again opens a bitswap message queue, however, this time around we remember the protocol and create a 'lazy' connection (to save ourselves a round-trip or two).
Now, since peer B does not have anything in their wantlist, they don't send a message. Because they don't send a message, the stream never gets negotiated, and this timeout trips: https://github.com/libp2p/go-libp2p/blob/master/p2p/host/basic/basic_host.go#L205

At this point, the bug has been reproduced. A very simple workaround is to not open a bitswap stream in the wantmanager unless we actually have a message to send. However, whats not known is why the other side doesnt recover from this. When the negotiate timeout happens, it closes the stream on its side, at which point the other side should fail to write on the next try, and reopen a new stream that should sucessfully negotiate in time.

whyrusleeping · 2017-07-13T05:04:19Z

However, whats not known is why the other side doesnt recover from this.

Figured it out. When you close one side of a yamux stream, the other side can continue to write without erroring. I am firmly of the opinion that this is a bug.

whyrusleeping · 2017-07-16T02:48:15Z

Okay, so moving forward here. The plan is to add a Reset method to our stream interfaces, explicitly for use when one side wants to fully close the connection (example: a handshake timeout https://github.com/libp2p/go-libp2p/blob/master/p2p/host/basic/basic_host.go#L205)

Shorter term, we will fix the bitswap issue by not opening a stream until we actually have data to write.

whyrusleeping added topic/bitswap Topic bitswap kind/bug A bug in existing code (including security flaws) labels Feb 3, 2017

whyrusleeping mentioned this issue Feb 6, 2017

dht stream hang clogs up queries #3657

Closed

whyrusleeping mentioned this issue Jul 16, 2017

Only open a message sender when we have messages to send #4069

Merged

whyrusleeping self-assigned this Jul 16, 2017

whyrusleeping added the status/in-progress In progress label Jul 16, 2017

whyrusleeping closed this as completed in #4069 Jul 16, 2017

whyrusleeping removed the status/in-progress In progress label Jul 16, 2017

This was referenced Jul 18, 2017

Implement circuit/relay/0.1.0 spec libp2p/go-libp2p-circuit#3

Merged

Add a reset method to streams. libp2p/go-libp2p-net#9

Closed

Stebalien mentioned this issue Jul 19, 2017

add documentation what Close and Reset are supposed to do libp2p/go-stream-muxer#13

Closed

whyrusleeping mentioned this issue Aug 31, 2017

Update on libp2p work #4029

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional stream hang #3651

Occasional stream hang #3651

whyrusleeping commented Feb 3, 2017

matthewrobertbell commented Feb 4, 2017

whyrusleeping commented Feb 5, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 16, 2017

whyrusleeping commented Feb 16, 2017

whyrusleeping commented Feb 22, 2017

jbenet commented Feb 23, 2017

whyrusleeping commented Mar 1, 2017

whyrusleeping commented Mar 6, 2017

whyrusleeping commented Mar 6, 2017

whyrusleeping commented Jun 21, 2017

Kubuxu commented Jun 21, 2017

whyrusleeping commented Jul 13, 2017

whyrusleeping commented Jul 13, 2017

whyrusleeping commented Jul 16, 2017

Occasional stream hang #3651

Occasional stream hang #3651

Comments

whyrusleeping commented Feb 3, 2017

matthewrobertbell commented Feb 4, 2017

whyrusleeping commented Feb 5, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 8, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 9, 2017

whyrusleeping commented Feb 16, 2017

whyrusleeping commented Feb 16, 2017

whyrusleeping commented Feb 22, 2017

jbenet commented Feb 23, 2017

whyrusleeping commented Mar 1, 2017

whyrusleeping commented Mar 6, 2017

whyrusleeping commented Mar 6, 2017

whyrusleeping commented Jun 21, 2017

Kubuxu commented Jun 21, 2017

whyrusleeping commented Jul 13, 2017

whyrusleeping commented Jul 13, 2017

whyrusleeping commented Jul 16, 2017