Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional stream hang #3651

Closed
whyrusleeping opened this issue Feb 3, 2017 · 20 comments · Fixed by #4069
Closed

Occasional stream hang #3651

whyrusleeping opened this issue Feb 3, 2017 · 20 comments · Fixed by #4069
Assignees
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap

Comments

@whyrusleeping
Copy link
Member

Ive been observing that every so often under a certain workflow i get a stuck stream. The workflow that i've been doing when this occurs is as follows:

  • mars is running, and runs the gexpin service.
  • My laptop ipfs node is adding content (gx packages) and then i pin those packages on mars through the gexpin site
  • The process works most of the time, but occasionally hangs

Things i've noticed:

  • Mars' wantlist contains the hash i asked it to pin, but my nodes view of mars wantlist does not contain that (ipfs bitswap wantlist -p QmaCpd... shows nothing). This implies that mars is hanging while sending me their wantlist and not that my node is hanging while sending the blocks. Which then further implies that the wantmanager stream that we use to send wantlist updates has something wrong with it.
  • Mars OOM crashes and reboots roughly once a day (bootstrap node + only 1GB ram == pain)
  • This issue is correlated i think with my laptop either losing wifi connection or going to sleep. Though i have slept my laptop, opened it back up, and had things work just fine.
  • I first noticed the issue with yamux, and tried out using multiplex to see if it would fix the issue, but i still managed to get a repro.
    • the more interesting thing here is that the failure behaviour was slightly different under yamux vs multiplex

yamux vs multiplex:

under yamux, when this bug occurred, i could still ping mars from my node (ping opens a new stream, so this tells us that just one stream is hung and not the multiplexer routine). Under multiplex, this was not possible, no new stream could be opened, and ipfs ping MARS just hung, meaning that we couldnt open a new stream, implying that our multiplex writer routine was blocked somewhere and that somewhere on mars' side, a multiplex stream was full of data with nothing to read it.

My next steps will be to deeply investigate the wantmanager wantlist stream sender code to make sure we arent accidentally dropping a stream without closing it, and to look for any other place where such a thing might happen.

@whyrusleeping whyrusleeping added topic/bitswap Topic bitswap kind/bug A bug in existing code (including security flaws) labels Feb 3, 2017
@matthewrobertbell
Copy link

An aside: Why is the bootstrap node only a 1GB server? Is it a budget limitation?

@whyrusleeping
Copy link
Member Author

Most of our bootstrap nodes have more, but mars was the first one we made, and I was being very cost conscious at the time. Ive been keeping it at 1GB because it puts pressure on me to get the memory usage down (and is also a great testbed for testing out memory improvements).

@whyrusleeping
Copy link
Member Author

More observations from a recent repro:
ipfs swarm peers -v on mars shows two bitswap streams to my laptop, but ipfs swarm peers -v on my laptop shows only one bitswap stream open to mars. This is weird. It suggests that the stream was closed on my laptop, but mars didnt get the memo.

@whyrusleeping
Copy link
Member Author

ipfs swarm peers -v gets its information about streams from go-peerstream. It could be the case that somehow the stream was removed from go-peerstream on my laptop, but the close didnt happen for some reason.

@whyrusleeping
Copy link
Member Author

AH! I THINK I'VE GOT IT!

If a bitswap send fails because of a timeout (here: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L63) It returns the error, SKIPPING the timeout reset. The caller then sees that the write errored and closes the stream. Everythings good, right? Wrong.
Closing a stream requires sending a message to the other side, and if sending that message also times out, then the other side won't get the stream closed message (because it times out, resulting in the Close call erroring). If the connection resumes after this, then we think the stream is closed (its not really) and the other side will send messages over it that we never receive.

Now, how the hell do i test this?

@whyrusleeping
Copy link
Member Author

Beginning to doubt my previous confidence. If the write and the close error out, we would just close the stream, and open a new one. The stream we open to the other peer does not get written on by the remote.

For this to make sense, the stream would have to experience the 'close failure' thing on the accepter side (my laptop). So now the question is, what about that situation could cause this loop to fail: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L179

@whyrusleeping
Copy link
Member Author

some good news, my theory that this problem is caused by a 'Close' message failing to send is supported by this bit of the yamux codebase: https://github.com/hashicorp/yamux/blob/master/session.go#L469

If our close errored, it would clear the stream from our local stream set, meaning the above code would just ignore any messages received for that stream. The other side has no idea we're just tossing messages away, so it keeps sending them.

@whyrusleeping
Copy link
Member Author

I've added code that closes the connection if a stream close failed, and i was able to reproduce the issue with that code running. Nothing makes sense anymore. (I'm rerunning with that code to verify that i was actually running the right code)

@whyrusleeping
Copy link
Member Author

Encountered the issue again. Between by desktop and mars. The interesting bit this time is that i had just restarted the node on my desktop. Perhaps this has to do with reconnecting? Maybe whats happening is that both sides are trying to connect to eachother and its doing weird things with the Connect/Disconnect notifications... Will need to find time to investigate further.

@whyrusleeping
Copy link
Member Author

Useful debugging tool to have: connection lifetimes

@whyrusleeping
Copy link
Member Author

One more update, noticed again after:

  • starting my node
  • having mars pin a thing from it
  • killing my node
  • starting my node up again a short time later
  • trying to pin something on mars from my node

@jbenet
Copy link
Member

jbenet commented Feb 23, 2017

would be amazing to get this figured out. can anyone in our community help with this?

@whyrusleeping
Copy link
Member Author

yamux logs from recent reproduction of issue:

2017/03/01 15:30:38 [ERR] yamux: keepalive failed: session shutdown
2017/03/01 15:33:10 [ERR] yamux: established stream without inflight SYN (didn't have semaphore)
2017/03/01 15:33:10 [ERR] yamux: established stream without inflight SYN (didn't have semaphore)
2017/03/01 15:34:48 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:34:48 Stream close errored: session shutdown
2017/03/01 15:34:53 [ERR] yamux: Failed to read stream data: MAC verification failed
2017/03/01 15:35:10 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:35:10 Stream close errored: session shutdown
2017/03/01 15:36:20 [ERR] yamux: keepalive failed: i/o deadline reached
2017/03/01 15:36:20 Stream close errored: session shutdown
2017/03/01 15:37:41 [WARN] yamux: Discarding data for stream: 37

Not sure which are actually related, but figured i'd leave it here as a reference

@whyrusleeping
Copy link
Member Author

Encountered this again with multiplex. I was still able to ping mars during the bug (this goes against what i was saying earlier where pinging mars hung while the bug was in effect under multiplex)

@whyrusleeping
Copy link
Member Author

Wow. this issue crops up a lot on shitty comcast internet. Definitely a corelation with tcp connection timeouts here

@whyrusleeping
Copy link
Member Author

@Stebalien this is the issue I was talking about earlier

@Kubuxu
Copy link
Member

Kubuxu commented Jun 21, 2017

RR (http://rr-project.org/) might be useful tool to debug it but 1. it would have to happen in single threaded context (RR limits threading) 2. we would have to capture it

If we have captured it under RR it would be great as we could analyse the cause more easily.

Do we have any possible reproduction scenarios that might have some chance of reproducing it?

@whyrusleeping
Copy link
Member Author

today is a good day.

@Stebalien ran into this bug earlier today, and we decided to sit down and debug it. four hours later, we know what causes the bug, and have one patch that fixes this particular instance of the bug, and know what to do to prevent this from happening in to other protocols that might make the same mistake.

The reproduction goes as follows:

  • Peer A connects to peer B
  • Peer B opens a new bitswap message queue for peer A. This opens a stream to B, and since they have never talked before, negotiates the protocol to use, and sets the 'remembered protocol' cache
  • Peer A disconnects from B
  • Peer A reconnects to B
  • Peer B again opens a bitswap message queue, however, this time around we remember the protocol and create a 'lazy' connection (to save ourselves a round-trip or two).
  • Now, since peer B does not have anything in their wantlist, they don't send a message. Because they don't send a message, the stream never gets negotiated, and this timeout trips: https://github.com/libp2p/go-libp2p/blob/master/p2p/host/basic/basic_host.go#L205

At this point, the bug has been reproduced. A very simple workaround is to not open a bitswap stream in the wantmanager unless we actually have a message to send. However, whats not known is why the other side doesnt recover from this. When the negotiate timeout happens, it closes the stream on its side, at which point the other side should fail to write on the next try, and reopen a new stream that should sucessfully negotiate in time.

@whyrusleeping
Copy link
Member Author

However, whats not known is why the other side doesnt recover from this.

Figured it out. When you close one side of a yamux stream, the other side can continue to write without erroring. I am firmly of the opinion that this is a bug.

@whyrusleeping
Copy link
Member Author

Okay, so moving forward here. The plan is to add a Reset method to our stream interfaces, explicitly for use when one side wants to fully close the connection (example: a handshake timeout https://github.com/libp2p/go-libp2p/blob/master/p2p/host/basic/basic_host.go#L205)

Shorter term, we will fix the bitswap issue by not opening a stream until we actually have data to write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants