-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Connections between peers on a private testnet drop for no apparent reason #2668
Comments
Updates from Gitter: upgraded Parity indeed shows peer info. Now again with two instances I have 2 peers connected in each instead of 1, both with the same id (being the enode of the other instance, unsurprisingly) and differing only in the localAddress and remoteAddress, see https://gist.github.com/lgpawel/638b21babf3cf61047b1b118b12311fa |
Now with |
Should be fixed in latest master. @lgpawel could you confirm? |
Not really, unfortunately. Pulled in the latest updates, cargo-built the release, version now reports as Then I've ran miners for two of them. (The third sits on a quite long fork that it cannot reorg by itself. The fork is a result of one of these disconnects happening some time ago; I was not around to force the reconnect before it got that long. I don't care about this, though.) All connections between the instances 1 and 3 dropped pretty much instantly, in minutes one of the two connections between instances 1 and 2 followed. In after some time, one of the connections between 2 and 3 died, soon followed by the rest of them, and finally the remaining connection between 1 and 2 was also lost. All this has happened over the course of some 30 minutes. I don't care about 3 disconnecting from 1 or 2, I guess this may even be intended, when 3 eg. receives a block freshly mined by 1 or 2, recognizes it as incompatible with the fork it's sitting on, and concludes that these peers are not going to send sensible blocks. But obviously 1 and 2 disconnecting is an issue. I didn't grab the full logs, but some of them are here: https://gist.github.com/lgpawel/0171fe09498b1c95cffa277ea4d02eb3 |
Could not reproduce this on latest master with the given chain spec. @lgpawel could you check again? Please run with |
Just to be on the safe side, I've (re)installed Cargo from http://rustup.rs, re-cloned Parity repo, did I ran three nodes as Now, the redundant connections between 1 and 3, 1 and 2, and 2 and 3 dropped some 2.5, 3.5, and 10 hours after mining on, respectively. As of now, 15 hours after mining on, single connections between all nodes are kept. So it's not 100% conclusive to me. I kind of expected to find all the connections, including the redundant ones, to be live in the morning, and then I'd be more than happy to have this issue closed. In this situation, however, I can still entertain a possibility that this connection dropping that I've observed may not be limited to redundant connections. I'm not posting any logs yet, because after a run that long with I'm keeping the setup running as of now, and I certainly can let it run over the weekend, unless I accidentally kill the processes or something. Update: a few minutes ago (some 17.5 hours after mining on) a second connection between 2 and 3 (re)appeared. This is something I've never seen in my local testnet experiments before, and it's nice to see connections to appear rather than drop for a change, but for a redundant connection it seems somewhat dodgy. Moar update: back to single connection between these nodes 20 min later. (Disclaimer: I'm not looking into specific peer info from Parity, instead I infer the number of connections between specific nodes by the number of peers each of them reports. Eg. if these go from 2-2-2 to 2-3-3, I conclude that a connection appeared between 2 and 3.) |
So each node reports 4 connections initially? as in |
As for the status of the test: the remaining connection between 1 and 2 dropped some three hours ago, and just a while ago the same happened to the last connection between 1 and 3. This left 1 isolated from the rest of the network for some five minutes until the 1-3 connection reappeared. (Also the additional 2-3 connection reappeared and disappeared four times in the meantime.) I don't know if even on a small LAN testnet it is enough for a miner to be disconnected for five minutes to develop a critically long fork, but if five minutes is required to reconnect even to nodes on the same computer, then I guess it may take even longer, especially in other circumstances. That is, if it isn't by design. Also note that the connection between 1 and 2 has not been reestablished for 3 hours. (Update: it has just been reestablished, after some 3.5 hours.) I guess I should upload just the relevant parts of the logs, and I guess that would mean from some time before a connection is lost until that (or after that). Please let me know how long sections should I extract and for which disconnects (or reconnects). |
@lgpawel a few seconds worth of logs around any disconnect event and the start should do it, thank you |
I had left it running over the weekend and today morning I've discovered that they were all disconnected from one another: node 1 from the others for 10 hours already, and nodes 2-3 from one another for some two hours with some short reconnects happening after that. I've stopped the processes; the logs are 6.5-7.7 gigs each. As for the recommendation above, I've decided to look at the messages printed every 30 secs that contain peer count, and whenever this count changed between consecutive messages, print all the messages logged in this 30 second window. However, this is still a lot: across all three nodes, there have been 220 occurences of this kind, and for each of them there is anything between 146 and 193527 lines logged (although both of these are outliers). The whole set has 2.5M lines and 200 MB. In https://gist.github.com/lgpawel/66c5453ada316f199ad8311d28b5a60e I've pasted a list of these log snippets with their respective line counts and sizes. The file names contain the node number, times of the beginning and end of the window logged, and a change in the number of peers that has happened within it. I can upload them somewhere (what filesharing service is easy to use these days? or should I set up a GitHub repo?), but if only some of them are needed, please let me know. If I should crop them even further, please let me know, but if so, then I'd like to be able to script it. |
A couple of these files would do for a start. I'm interested in the disconnect reason. It should in the log along with disconnect event. |
Here https://gist.github.com/lgpawel/c4d0f827630a42a100559ff7e262dbcb are two pairs of disconnect logs, chosen by virtue of relative brevity. One pair is |
So it looks like the node was disconnected because it failed to respond to ping. Is there a log from the other node for the same time? If you are mining on the same machine consider allocating less CPU cores for mining. Also use EDIT: nm, I've found the second log in the same gist. |
As stated above, I've passed flags |
any movement on this, @arkpar ? |
Potential fix has been merged to master. @lgpawel could you give it another run with the latest build? |
I'm now testing it in a basically identical setup as in https://github.com/ethcore/parity/issues/2668#issuecomment-267560784. Parity is now in So far the behaviour is similar to the report above. All the nodes initially reported four connections each, from which I infer that they've established double connections to one another. This remained consistent for some 18 hours until I've switched the mining on (with https://gist.github.com/lgpawel/4ab5b97a2e062f07b02b1db2f687e371 contains the log fragments for when the connection drops mentioned above. Also for node 1 there was a brief moment in which one of the two remaining connections showed up as pending and this log fragment is also included. As an aside, in https://gist.github.com/lgpawel/7168834dc25ce7d8ef0a932fb854e791 I've posted a few hundred lines from the top of each of the logs, up to the point where they already report the duplicate connections, maybe this is of some interest. I guess that even if the connection dropping is fixed, this warrants an issue on its own, even if it's less important? |
Putting aside a handful of seconds-long disconnects, over the weekend there were three occasions in which the nodes all disconnected from one another, leaving the network completely fragmented, and then reestablishing full connectivity (in all cases, after about 90 minutes). Luckily this seems not to have broken the consensus in this case, as now the nodes are connected and report importing mined blocks from one another. In https://gist.github.com/lgpawel/c19f39fb13e00a7c478b52725496ef76 there's a minute's worth of logs from each of the nodes building up to the first of the three longer disconnects. At the end there's a list of snippets that I've generated with a script, and the filenames serve as a transcript of changes to the reported connection numbers. The file sizes/line counts are not directly comparable to those from the run from a month ago, as the log verbosity seems to have been altered internally and also my script is different now. I'd say that the connection stability is much, much better, as there is much fewer disconnects and the connections seem to get picked up after at most ca. 90 mins (compared to >10 hours of not reconnecting in one case previously). Yet these periods of complete network fragmentation are worrying and I'm not going to go ahead and close this issue yet. Please ask away for more logs if needed. |
Looks like remaining disconnects might be caused by your LAN going down for a short duration. Try using localhost address instead of LAN address for bootnodes. I.e. replace |
I've now restarted the run accordingly after updating to Would it be worth to try this out with nodes running on separate machines with actual LAN connections between them? Or maybe it should be considered solved altogether, if LAN problems were responsible for the last disconnects? |
If this is caused by a LAN problems it is not an issue indeed as long as peers reconnect when the network is up. |
OK, I've ran it with the new version and for 17 hours before mining on and 53 hours afterwards, and all I've seen no redundant connections and only a handful of seconds-long disconnects. I guess this concludes it all. Thanks! |
Thank you for detailed reports and testing. |
With Parity version
v1.4.0-unstable-271bcf4-20161009/x86_64-linux-gnu/rustc1.12.0
and a private testnet (spec file: https://gist.github.com/lgpawel/a6ba7660b778dc9775b00849abfc8be0) I experience unexpected disconnects. In both of the following examples, there are just two instances of Parity running parallel on the same machine, launched with--bootnodes
pointing the instances to one another. For some reason (which may or may not be a separate issue), the numbers of connected peers are bigger than 1, as if there were multiple parallel connections between the instances.In these logs the instances sit idle reporting two connections between them, then one of them drops, and after some time the other does too: https://gist.github.com/lgpawel/711033cfc0f4e0bd180d2e5397d3e91b
In these, the instances sit idle, apparently stable with three connections between them, then I launch mining for both of them (explicitly limiting it to respectively 1 and 2 cores of a four-core processor) and immediately the connections are dropped, this time with much more verbose output: https://gist.github.com/lgpawel/c7c093325a533388d961e9e294849290
The text was updated successfully, but these errors were encountered: