-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random upgrade timeouts between peers #1629
Comments
Hi, To give some insights: the "upgrade", in the case of Kademlia, also includes transferring the Kademlia request and response. The word "upgrade" is a bit misleading. The substream doesn't leave Could you maybe look at the |
Thank you for the explanation and your fast response. If I understood correctly, it may be that kademlia on node B doesn't lookup for the record then?
When node B first starts to add the providers records, then it logs a bit more like:
But that's about it, no apparent error or anything else. I'll double check those (that's about 9k lines of log). Maybe this part of the code doesn't have much logs? I'll try to add some logs and see what happens then! If it can help, here is my kademelia configuration:
|
My intuition is that you're not properly polling the |
Thanks for the hint. I'm using the exact same pattern as this example https://github.com/libp2p/rust-libp2p/blob/master/examples/chat.rs#L150 to poll the swarm. I also have the stdin pattern to read debug commands I might need at runtime. When I'm reproducing the timeout, hitting Node B logs
Node C logs
The log line I haven't had the time since last thursday to get back into debugging this so I don't have more information yet. Also if you have some ideas as to where I could add logs to try to figure out what's happening, I will gladly adding them. As a side note, my DHT requests are not concurrent. I mean that in this case, I only have a DHT request at a time but I may have multiple requests per second. Once one is finished, a second might happen. |
Hey, So I've debugged this a bit more but still haven't found the root cause yet. But here's what I noticed. It seems like the problem comes from node C instead of node B after all. Node B successfully (at least now) calls Node C is the node that initiate the DHT requests. On node C it never reaches the Anyway, I always have at least one of the two requests that get a response for one of the nodes. From here, here's what happens on the request that fails:
The second request sent will go through the same process but I guess we already had both messages in the network buffer so they get polled only once. But on previous requests, I've noticed the same behaviour: the last stream to be opened will be the first one to be polled and the first message is already there but not the second (yet). BUT in that case, when the second one is received, the connection handler gets polled again, which is not the case here. Here are some logs, it's quite verbose but I didn't want to clean up so you could have the exact same view as I have, they are annotated though. I've deleted the timestamps but they all happen in the same 800 milliseconds except for the few at the end once the timeout is reached. Annotated lines are prefixed with Detailed logs
And from here, I'm not sure where to look or what to do :) Again, thanks for your time on this! |
It really doesn't have to be minimal or reliably reproduce the problem all the time but if we would have a way to run your scenario ourselves with some reasonable probability of reproducing the problem, I think that would help us greatly. In any case, thanks for your detailed write-ups. |
I am also wondering if there is some relation to #1504 which, albeit being closed due to the panic being resolved, also seemed to be primarily about problems with provider lookups as this excerpt shows:
I can try to generally take a close look at the substream management in |
Once the issue here is resolved I would love to hear about the problems you had with yamux. Maybe you could open an issue at https://github.com/paritytech/yamux or write me an e-mail? |
Hello, So I've come up with a reproducer. Since I knew more about the issue, I knew what to do to trigger it again. You can find the repository here: https://github.com/BlackYoup/libp2p-handler-timeout-reproducer It's still a handler upgrade timeout but during multiple PUT of providers into the DHT, not during GET PROVIDERS requests. The PUT follow the same pattern, one at a time. I unexpectedly ran into it when iterating the reproducer and, on the opposite, couldn't reproduce my issue when using GET PROVIDERS requests. Anyway, hopefully the same thing happens. The readme contains most of the stuff to reproduce and how I start it on my machine, so ask me any questions about it if needed. I'm using rust 1.44.1.
Sure. It's on my todolist to report this issue. I've hit the issue 2/3 months ago and my knowledge was more limited so I would have to retry. If I remember correctly, it was about the fact that I wanted to send big chunks over the network to another peer and yamux would not handle it correctly, leading to data not being sent. When I switched to mplex, I used
Which solved my problem. I believe this is the same issue as explained in #1257 (comment) and from which I then moved to use mplex. |
I do not want to hijack this issue thread so it would be best to discuss this elsewhere but I would be keen to know what exactly was not handled correctly. |
I'm sorry to say that so far I've been unable to reproduce a problem. All 1500 provider records are always successfully stored when I run the provided code, i.e. it always counts down to |
I could reproduce the problem using https://github.com/BlackYoup/libp2p-handler-timeout-reproducer. It looks like a missing task wakeup to me. Usually a stalling output resumes later, e.g. because identify kicks in:
Sometimes, there are shorter pauses:
It may be related to mplex, but I am not sure. Maybe it just has different timing characteristics which triggers the bug. Anyway I tried the test with yamux and could no longer reproduce it. @BlackYoup, could you try with yamux and see if the problem persists for you? If not we may focus on mplex, otherwise we may need to look elsewhere. Edit: I should add that I was using libp2p 0.21.1 from crates.io and async-std 1.6.2 |
Hello! I was in holidays last week so I couldn't update this thread.
Oh, alright then. Did you use my fork of libp2p or did you change it for the official repository? If you did not use my fork, then it's normal that all the records are correctly created because no errors are emitted. The pauses will happen and the process will normally resume after them. With my fork, a panic occurs which directly points the issue. Other than that, I don't have much ideas. Maybe it comes down to processor speed, I don't really know.
Yeah, I've seen the longer pauses too. On my side, I mostly hit the 10 seconds pause. The longer ones happened here and there but were very rare. They were so rare that I forgot about them to be honest. And if I recall correctly, I did not had them using this reproducer. Sure, I'll try with yamux and I'll keep you posted. I'll try to find time today or at some point during the week. |
Hello everyone, I tested my reproducer with yamux instead of mplex. It doesn't seem to trigger the bug for the ~10 tests I ran. I quickly tried on my internal project but unfortunately I'm having another issue, probably due to how I send messages between peers. Not sure if it is due to yamux or my debug code but I won't have much time to dig into it those next few days. Sorry about that. |
Since #1769 addressed some problems with mplex, it would be great if you could check whether the issues you ran into with mplex still occur with your test setup when using |
@romanb It seems like I'm still hitting the 10 seconds timeout with my reproducer. I used commit f66f40b and I hit it every time I tried (5 tests). Another weird thing is that, on my reproducer, both using yamux and mplex on the current master, the swarm is sometimes not properly working. When I start my two instances, both peers seem to connect fine to each other but some things are off:
I pushed this branch with my up-to-date setup (completely deleted Cargo.lock and target folder): https://github.com/BlackYoup/libp2p-handler-timeout-reproducer/tree/mplex-fix |
Thanks for the feedback, I will take another look at your setup. |
So the good news is that I think I found the reasons for the sporadic upgrade timeouts (see #1781). Regarding the other two new issues you describe, I'm still looking into that but I've been confused so far by the fact that there was a period where I could trigger them rather easily, but only with |
My current status here is that I can no longer reproduce any of the issues mentioned with |
I've probably stumbled upon the cause for these new early connection negotiation stalls, which seem related to the plaintext protocol. See #1831. I hope that this was the last problem to resolve here 🙏. |
I think all issues here have been resolved. Feel free to re-open if you still encounter the problems discussed here so far or open new issues. Thanks again for your detailed reports and sample code. |
Hello everyone,
I'm having a weird issue in which, sometimes, a substream will timeout during its upgrade. I have little knowledge in rust futures and in this project's internals so I'm sorry if I misinterpret some stuff. You can find a commit with my debug logs here.
Context
I'm running on my local machine 3 services (let's call them node A / B / C), all of them using rust-libp2p. The swarm created uses a custom NetworkBehaviour which has 3 behaviours:
The transport used is plaintext TCP using the mplex streams multiplexer. I have some other issues with yamux which made me use mplex.
My mplex configuration:
I'm using rust-libp2p 0.20.1 but had the same issue on 0.19.1. The issue may be related to timing. It might not occur in debug mode with enough logs to "slow" the services but release mode almost always triggers it. Also, I started seeing it after a refactor of my code which seems to have brought a performance boost as well.
The 3 nodes connect to each other on my local machine using 127.0.0.1:port . Node B then advertise 1/2k providers records into the DHT. Node C will then receive requests from another external service and will do providers lookups into the DHT. The timeout randomly appears during those lookups. From time to time I get something like
Request to PeerId("12D3KooWAZ1gXB3ZokKzEnkXvWDK6RWj38rNkkeVTsqyevL5AYqV") in query QueryId(67) failed with Upgrade(Timeout)
. The timeout is the default one, 10 seconds.Upgrade timeout
I started using RUST_LOG=trace to get all the logs I could and started tracing the issue. It seems that node C successfully creates the substream and starts sending data to node B. On node B, I can see the created substream:
Substream 68 logs
I added a few logs in node_handler and mplex, the whole commit is my debug attempt.
From what I see here, the newly created substream never leaves the
negotiating_in
"state" and I'm not sure why. For example, here's another substream successfully created right before substream 68 would be created:Substream 67 logs
As you can see, the difference between substreams
67
and68
is that the "filtering" function (logs starting with theFiltering.
log) successfully returns when called on substream67
but for substream68
, it tries to match it to a different substream and never try again later. I've seen substreams where the filter function "fails" and then "succeed" to match the substream at some point.I can't know if that's the issue for sure but that's where I am right now and I don't have much more ideas. I'm not even sure that my issue comes from here and that's why I opened this issue.
Unfortunately I can't share a reproducer as I'm not sure how to reproduce exactly. Like I said at the start, I first start by adding ~1/2k providers records to the DHT so it creates that much substreams too and it works fine.
Would you happen to have any ideas as to where I should look next to continue debugging this?
Thanks!
The text was updated successfully, but these errors were encountered: