Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libp2p does not recover from interface being temporally down #374

Closed
hsanjuan opened this issue Jul 16, 2018 · 7 comments
Closed

Libp2p does not recover from interface being temporally down #374

hsanjuan opened this issue Jul 16, 2018 · 7 comments
Labels
exp/wizard Extensive knowledge (implications, ramifications) required kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up

Comments

@hsanjuan
Copy link
Contributor

Our storage cluster suffers from time to time from some issues where libp2p hosts do dial_backoff forever after a problem with the listen interface. It starts with:

Jul 16 04:02:21 cluster2.fsn kernel: igb 0000:03:00.0 eth0: igb: eth0 NIC Link is Down
Jul 16 04:20:29 cluster2.fsn kernel: igb 0000:03:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Libp2p does not recover from this event. It shows:

i/o deadline reached (a few times after iface is down)
connection reset (5 secs later)
dial attempt failed: context deadline exceeded (+1min later)
dial backoff (+1 min later)

From there, dial attempt failed and dial backoffs keep continuously happening until the peers are restarted, well after the network interface is back up. After re-starting the peers, everything works well again.

This are standard libp2p basic hosts, from 5.0.17. The services they run will continuously attempt to re-open streams in such failures.

Ideally, libp2p should be able to recover from such events automatically, but I'm not sure how difficult this is or why exactly it stays erroring.

@Stebalien
Copy link
Member

I don't think this is due to the backoff logic, probably something else.

  • Are the peer's changing their IP addresses?
  • Can they dial peers outside the cluster?
  • What listen addresses are you using in your config?

@daviddias daviddias added kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up exp/wizard Extensive knowledge (implications, ramifications) required labels Jul 17, 2018
@arzahs
Copy link

arzahs commented Feb 10, 2021

Hello. Is there a solution for the issue? I got the same.

github.com/libp2p/go-libp2p v0.11.0
github.com/libp2p/go-libp2p-core v0.6.1

@jkassis
Copy link

jkassis commented Aug 5, 2021

peer A creates a stream that initializes a connection to peer B. peer A's network interface connects to a new wifi network (goes out of range of the one it was in and then hops to another). the connection / stream doesn't work. not completely unexpected. but when it hops back to the previous wifi network and gets the same assigned IP address, the connection / stream doesn't resume.

not entire sure about the design objectives here, but this is kindof a show-stopper for P2P iOT.

@jkassis
Copy link

jkassis commented Aug 5, 2021

how can i at least just check the status of the connection? Conn.Stat doesn't seem to do it.

@Stebalien
Copy link
Member

but when it hops back to the previous wifi network and gets the same assigned IP address, the connection / stream doesn't resume.

That's not something we can fix (without stream migration, see libp2p/specs#328). When peer B switches networks, all of its connections get cut by the OS. The problem is that peer A won't see this till it hits a timeout.

@Stebalien
Copy link
Member

Stebalien commented Aug 6, 2021 via email

@MarcoPolo
Copy link
Collaborator

I know this is an old issue, but I just verified that this no longer happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/wizard Extensive knowledge (implications, ramifications) required kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

6 participants