-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional stream hang #3651
Comments
An aside: Why is the bootstrap node only a 1GB server? Is it a budget limitation? |
Most of our bootstrap nodes have more, but mars was the first one we made, and I was being very cost conscious at the time. Ive been keeping it at 1GB because it puts pressure on me to get the memory usage down (and is also a great testbed for testing out memory improvements). |
More observations from a recent repro: |
|
AH! I THINK I'VE GOT IT! If a bitswap send fails because of a timeout (here: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L63) It returns the error, SKIPPING the timeout reset. The caller then sees that the write errored and closes the stream. Everythings good, right? Wrong. Now, how the hell do i test this? |
Beginning to doubt my previous confidence. If the write and the close error out, we would just close the stream, and open a new one. The stream we open to the other peer does not get written on by the remote. For this to make sense, the stream would have to experience the 'close failure' thing on the accepter side (my laptop). So now the question is, what about that situation could cause this loop to fail: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/network/ipfs_impl.go#L179 |
some good news, my theory that this problem is caused by a 'Close' message failing to send is supported by this bit of the yamux codebase: https://github.com/hashicorp/yamux/blob/master/session.go#L469 If our close errored, it would clear the stream from our local stream set, meaning the above code would just ignore any messages received for that stream. The other side has no idea we're just tossing messages away, so it keeps sending them. |
I've added code that closes the connection if a stream close failed, and i was able to reproduce the issue with that code running. Nothing makes sense anymore. (I'm rerunning with that code to verify that i was actually running the right code) |
Encountered the issue again. Between by desktop and mars. The interesting bit this time is that i had just restarted the node on my desktop. Perhaps this has to do with reconnecting? Maybe whats happening is that both sides are trying to connect to eachother and its doing weird things with the Connect/Disconnect notifications... Will need to find time to investigate further. |
Useful debugging tool to have: connection lifetimes |
One more update, noticed again after:
|
would be amazing to get this figured out. can anyone in our community help with this? |
yamux logs from recent reproduction of issue:
Not sure which are actually related, but figured i'd leave it here as a reference |
Encountered this again with multiplex. I was still able to ping mars during the bug (this goes against what i was saying earlier where pinging mars hung while the bug was in effect under multiplex) |
Wow. this issue crops up a lot on shitty comcast internet. Definitely a corelation with tcp connection timeouts here |
@Stebalien this is the issue I was talking about earlier |
RR (http://rr-project.org/) might be useful tool to debug it but 1. it would have to happen in single threaded context (RR limits threading) 2. we would have to capture it If we have captured it under RR it would be great as we could analyse the cause more easily. Do we have any possible reproduction scenarios that might have some chance of reproducing it? |
today is a good day. @Stebalien ran into this bug earlier today, and we decided to sit down and debug it. four hours later, we know what causes the bug, and have one patch that fixes this particular instance of the bug, and know what to do to prevent this from happening in to other protocols that might make the same mistake. The reproduction goes as follows:
At this point, the bug has been reproduced. A very simple workaround is to not open a bitswap stream in the wantmanager unless we actually have a message to send. However, whats not known is why the other side doesnt recover from this. When the negotiate timeout happens, it closes the stream on its side, at which point the other side should fail to write on the next try, and reopen a new stream that should sucessfully negotiate in time. |
Figured it out. When you close one side of a yamux stream, the other side can continue to write without erroring. I am firmly of the opinion that this is a bug. |
Okay, so moving forward here. The plan is to add a Shorter term, we will fix the bitswap issue by not opening a stream until we actually have data to write. |
Ive been observing that every so often under a certain workflow i get a stuck stream. The workflow that i've been doing when this occurs is as follows:
Things i've noticed:
ipfs bitswap wantlist -p QmaCpd...
shows nothing). This implies that mars is hanging while sending me their wantlist and not that my node is hanging while sending the blocks. Which then further implies that the wantmanager stream that we use to send wantlist updates has something wrong with it.yamux vs multiplex:
under yamux, when this bug occurred, i could still ping mars from my node (ping opens a new stream, so this tells us that just one stream is hung and not the multiplexer routine). Under multiplex, this was not possible, no new stream could be opened, and
ipfs ping MARS
just hung, meaning that we couldnt open a new stream, implying that our multiplex writer routine was blocked somewhere and that somewhere on mars' side, a multiplex stream was full of data with nothing to read it.My next steps will be to deeply investigate the wantmanager wantlist stream sender code to make sure we arent accidentally dropping a stream without closing it, and to look for any other place where such a thing might happen.
The text was updated successfully, but these errors were encountered: