-
Notifications
You must be signed in to change notification settings - Fork 673
restarting a peer can lead to permanent global non-convergence and connection churn #1554
Comments
I have a somewhat vague theory... For a restarted peer to be permanently "accepted" into a network, all peers in that network must no longer hold a reference to the old incarnation. That requires a) all peers that were connected to that peer to detect the loss of connectivity to that peer, and b) for that information to propagate to all other peers. Furthermore, all this needs to happen before any peer accepts the new incarnation of a peer. Otherwise, in effect, the "old incarnation of peer has gone away" and "here is a (new incarnation of a) peer" facts race around the network in the topology gossip and whenever they collide end up dropping the gossiping connection. I haven't totally convinced myself that convergence is impossible in some such circumstances, but it's certainly non-trivial to proof that it is. |
Here is a proposed fix:
The underlying idea is that knowledge in the network of an old incarnation of a peer will be replaced by information about the incarnation without requiring an intermediate step where peers are able to 'gc' the old incarnation. |
Hmm. I think this means we can get rid of the UID completely. |
There is an edge case here that prevents convergence in the absence of change: if the network has knowledge of an old incarnation at version V, and the new incarnation is also at version V, then the latter information will not supplant the former. That only happens when there is a further change to the peer, bumping the version to V+1. I can think of two ways of dealing with that:
The first prevents us from getting rid of UIDs. The second is potentially expensive, especially considering that we will nearly always hit the comparison case while the local peer is not changing. We could eliminate the cost by adding a hash of the content, but arguably that is no better than just keeping UIDs. And I reckon we also need to introduce the following tie-breaker:
That way there is convergence in the "same version but different UID" case, albeit possibly to the "wrong" state, i.e. using information from an old incarnation, but this will only be a temporary aberration. |
Backward compatibility is an issue here, even if we keep UIDs. The problem is that, as explained, current routers very much depend on the entire network forgetting about a peer first before accept a new incarnation of a peer. And the disconnects on UID mismatch rather help in creating such a situation. New routers wouldn't do that. One possibility here is to keep the disconnecting behaviour for direct connections, i.e. the We could alternatively/additionally advise users to "wait a while" when restarting a peer. |
I cannot manage to come up with a scenario where the old peer is not forgotten. For example, suppose we have peers A, B, C, D, connected in a ring, i.e. A<->B<->C<->D<->E and A<->E. At the start of this analysis, all peers know about all other peers.
A does not garbage-collect C as it still sees a connection via E.
E does not garbage-collect C as it still sees a connection via A. However, A will now gossip to E to say that its connection to B is gone, and once that happens, E will garbage-collect C and accept the new C2. Now A will drop the connection to E because E's knowledge of C2 clashes, and so A has no connections to anywhere and will garbage-collect everything. We see peer version numbers in the millions in logs, so connections were being added and removed very frequently. #1496 should slow this down, and perhaps allow us to understand the real problem underlying this report. |
Not if it since has been re-established.
#1496 will help. It may even make the problem go away. However, the scheme we have for dealing with re-starting peers is not obviously convergent, since it in effect relies on the transmission of events rather than states. And it couples the topology gossip in very intricate ways to connection lifecycle, namely closing connections on certain error conditions is required for convergence. My proposal fix addresses that head on. |
How can this happen? As long as B is telling A about C2, A will disconnect from B. |
The gossip is received some time after connection establishment. |
OK, but some time after that A will tell E that A->B is gone, or A will try to tell E about C2 and E will drop E->A. Either way, E will lose all connectivity to C and garbage-collect it. |
Not if the connection has been re-established. If you have two peers who continuously attempt to connect each other and succeed, only for the connection to be dropped due to some gossip, it is possible that the "disconnected" state is never propagated to other peers. |
My broader point though is that gossip convergence should not depend on careful orchestration of the underlay connectivity. That's just asking for trouble. |
OK, looking again at the logs I see that it is typical for the collection to last 10-40ms before being torn down, while the delay before reconnection is perhaps 2-10ms. With many peers, the chance that at least one of them claims to have a path to the old peer is increased. FWIW, I couldn't get this from the description:
What about: Otherwise, in effect, the "this connection to an old peer has gone away" and "I have a new connection which I still think leads to the old peer" facts race around the network in the topology gossip ? And
is disproved by my first attempt at a counter-example. |
The two facts I mentioned are the new facts. They both race around the network, together with gossip about the old state, i.e. the fact you mention.
It isn't. |
re backward compatibility
@bboreham and I are not convinced this really helps, and it delays acceptance of new peers. There is an interesting question here though... What should we do when accepting such a connection? In particular, should we update the UID? That would be oddly non-monotonic. We could have existing connections to the old incarnation that we haven't noticed dropping yet, the peer might be bouncing rapidly, etc. If we update the UID we hold for the peer, it would be down to luck to end up with the most recent one. Also, the version would be bogus. So instead I propose we simply leave the UID alone and let gossip update it eventually. |
I have finally managed to reproduce this. Sort of. First apply the following patch... diff --git a/router/gossip_channel.go b/router/gossip_channel.go
index 11e8e8a..7bb72d2 100644
--- a/router/gossip_channel.go
+++ b/router/gossip_channel.go
@@ -4,7 +4,10 @@ import (
"bytes"
"encoding/gob"
"fmt"
+ "strings"
"sync"
+
+ "github.com/weaveworks/weave/common/debug"
)
type GossipChannel struct {
@@ -156,6 +159,10 @@ func (c *GossipChannel) sendDown(conn Connection, data GossipData) {
sender, found := c.senders[conn]
if !found {
sender = NewGossipSender(func(pending GossipData) {
+ if c.name == "topology" &&
+ (strings.Contains(c.ourself.NickName, "host1") || strings.Contains(conn.Remote().NickName, "host1")) {
+ debug.Break("[gossip %s]: sending to %s", c.name, conn.Remote())
+ }
for _, msg := range pending.Encode() {
if len(msg) > maxFeasibleMessageLen {
panic(fmt.Sprintf("Gossip message too large: len=%d bytes; on channel '%s' from %+v", len(msg), c.name, pending))
@@ -231,6 +238,10 @@ func (c *GossipChannel) sendBroadcast(srcName PeerName, update GossipData) {
protocolMsg := ProtocolMsg{ProtocolGossipBroadcast, GobEncode(c.name, srcName, msg)}
// FIXME a single blocked connection can stall us
for _, conn := range connections {
+ if c.name == "topology" && strings.Contains(conn.Remote().NickName, "host1") {
+ c.log("DROPPING broadcast to", conn.Remote().String(), "from", srcName)
+ continue
+ }
conn.(ProtocolSender).SendProtocolMsg(protocolMsg)
}
} This also requires weave/common/debug/debug.go: package debug
import (
"fmt"
"os"
"os/signal"
"syscall"
)
type breakpoint struct {
format string
args []interface{}
resume chan<- struct{}
}
type debugger struct {
enabled bool
breakpoints []breakpoint
suspend chan<- breakpoint
}
var d debugger
func init() {
suspend := make(chan breakpoint, 0)
d = debugger{false, nil, suspend}
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGUSR1, syscall.SIGUSR2)
go func() {
for {
select {
case sig := <-sigs:
switch sig {
case syscall.SIGUSR1:
d.enabled = !d.enabled
if !d.enabled {
fmt.Fprintln(os.Stderr, "DEBUG: breakpoints disabled")
resume()
} else {
fmt.Fprintln(os.Stderr, "DEBUG: breakpoints enabled")
}
case syscall.SIGUSR2:
resume()
}
case bp := <-suspend:
if d.enabled {
fmt.Fprintf(os.Stderr, "BREAK "+bp.format+"\n", bp.args...)
d.breakpoints = append(d.breakpoints, bp)
} else {
bp.resume <- struct{}{}
}
}
}
}()
}
func resume() {
for _, bp := range d.breakpoints {
fmt.Fprintf(os.Stderr, "RESUME "+bp.format+"\n", bp.args...)
bp.resume <- struct{}{}
}
d.breakpoints = nil
}
func Break(format string, args ...interface{}) {
resume := make(chan struct{}, 0)
d.suspend <- breakpoint{format, args, resume}
<-resume
} Then establish the following topology using your desktop machine and our three vagrant VMs: host2
/ | \
desktop | host1
\ | /
host3 I do this by
The patch above
Once everything is settled (check with
This pauses topology gossip between host2/3 and host1. Next we wait for the "breakpoints" to hit - there will be messages like
in the host1/2/3 weave docker container logs. Once that has happened we know that any subsequent gossip will be merged into the existing gossip data held by the We now restart weave on our desktop:
Wait for a while for things to settle. Then re-enable gossip from host2 to host1:
Watch the host1 and host2 logs. Eventually - it takes a while because of the continued dropping of topology gossip from host1 and topology broadcast to host1 - we see the connection between the two getting dropped with something like
Eventually host2 will reconnect to host1, and, after a while, the same error occurs. And so on, indefinitely. |
The test demonstrates that some extreme scheduling can result in indefinite non-convergence. Essentially host1 never forgets about the old incarnation of the desktop peer because it thinks it has a route to it via host3. The situation resolves itself when we re-enable the gossip from host3 to host1. That too will result in the connection getting dropped with the above error. But crucially at that point host1 no longer has any connection to either host2 or host3, hence no route to the desktop peer, and hence it garbage-collects it. Next time host2 or host3 connect to it, no conflict is detected. Note though that because the conflict is detected some time after the connection is established - on receipt of some gossip - it is possible, I think, for the connections from host2 and host3 to host1 to always overlap, i.e. there always being a connection to at least one of them, which in turn prevents garbage-collection of the desktop peer from host1. This situation would have been much more likely pre 1.2.0, due to the lack of backoff on late connection failures. The test also shows that just network disruption can at least cause prolonged, but temporary, non-convergence. The dropping and pausing of topology gossip/broadcast we introduce could equally just be indefinite queueing due to TCP buffers filling up. However, any such disruption would cause the connections to be terminated after at most heartbeat-interval * 2. And this in turn would cause host1 to garbage-collect the stale information about the desktop peer. |
Previously when a peer restarted, information about the new incarnation (i.e. with a different UID) was not accepted by other peers (and connections would be dropped) unless all knowledge of the previous incarnation had been purged. This could result in a lot of connection churn and hence connectivity disruption, and, in some pathological cases, very slow convergence and hence acceptance of the new incarnation into the network. We now no longer drop connections when encountering different incarnations of a peer. There are two situations when that can happen: 1) on connection establishment we simply proceed 2) on receipt of gossip to ensure convergence we a) treat the UID as an additional discriminator when deciding whether we should update our information about a peer with that which was gossiped. Specifically, we update the information we hold when a) the gossiped version is greater, or b) is the same and the UID is greater. b) include the UID in the information we update c) move our own version number beyond any we receive for ourselves, if the received UID differs from ours. With (a) we establishes a total order of peer information across several incarnations of the same peer. i.e. we consider information to be fresher if it has a higher version, or the same version and higher UID. This may seem somewhat counter intutive, since it will generally treat information about new incarnations as older than old incarnations, since incarnations always start life with version 1. But to do better we'd need to establish a total order of incarnations that matches their temporal occurrence. Which requires some sort of durable state. So instead we have (c). Through that we learn the highest version number of any old incarnation of ourselves that other peers still hold, and then make sure that our version is greater than that. Essentially we continue where the old incarnations left off. It's as if instead of restarting we had simply changed UIDs. And due to (a) and (b) the information about the new incarnation of ourselves, now with a higher version, will supersede that of the old incarnations. Fixes #1554.
Previously when a peer restarted, information about the new incarnation (i.e. with a different UID) was not accepted by other peers (and connections would be dropped) unless all knowledge of the previous incarnation had been purged. This could result in a lot of connection churn and hence connectivity disruption, and, in some pathological cases, very slow convergence and hence acceptance of the new incarnation into the network. We now no longer drop connections when encountering different incarnations of a peer. There are two situations when that can happen: 1) on connection establishment we simply proceed 2) on receipt of gossip to ensure convergence we a) treat the UID as an additional discriminator when deciding whether we should update our information about a peer with that which was gossiped. Specifically, we update the information we hold when a) the gossiped version is greater, or b) is the same and the UID is greater. b) include the UID in the information we update c) move our own version number beyond any we receive for ourselves, if the received UID differs from ours. With (a) we establishes a total order of peer information across several incarnations of the same peer. i.e. we consider information to be fresher if it has a higher version, or the same version and higher UID. This may seem somewhat counter intutive, since it will generally treat information about new incarnations as older than old incarnations, since incarnations always start life with version 1. But to do better we'd need to establish a total order of incarnations that matches their temporal occurrence. Which requires some sort of durable state. So instead we have (c). Through that we learn the highest version number of any old incarnation of ourselves that other peers still hold, and then make sure that our version is greater than that. Essentially we continue where the old incarnations left off. It's as if instead of restarting we had simply changed UIDs. And due to (a) and (b) the information about the new incarnation of ourselves, now with a higher version, will supersede that of the old incarnations. Fixes #1554.
Previously when a peer restarted, information about the new incarnation (i.e. with a different UID) was not accepted by other peers (and connections would be dropped) unless all knowledge of the previous incarnation had been purged. This could result in a lot of connection churn and hence connectivity disruption, and, in some pathological cases, very slow convergence and hence acceptance of the new incarnation into the network. We now no longer drop connections when encountering different incarnations of a peer. There are two situations when that can happen: 1) on connection establishment we simply proceed 2) on receipt of gossip to ensure convergence we a) treat the UID as an additional discriminator when deciding whether we should update our information about a peer with that which was gossiped. Specifically, we update the information we hold when a) the gossiped version is greater, or b) is the same and the UID is greater. b) include the UID in the information we update c) move our own version number beyond any we receive for ourselves, if the received UID differs from ours. With (a) we establishes a total order of peer information across several incarnations of the same peer. i.e. we consider information to be fresher if it has a higher version, or the same version and higher UID. This may seem somewhat counter intutive, since it will generally treat information about new incarnations as older than old incarnations, since incarnations always start life with version 1. But to do better we'd need to establish a total order of incarnations that matches their temporal occurrence. Which requires some sort of durable state. So instead we have (c). Through that we learn the highest version number of any old incarnation of ourselves that other peers still hold, and then make sure that our version is greater than that. Essentially we continue where the old incarnations left off. It's as if instead of restarting we had simply changed UIDs. And due to (a) and (b) the information about the new incarnation of ourselves, now with a higher version, will supersede that of the old incarnations. Fixes #1554.
…ange better convergence when peers restart; fixes #1554.
On a number of occasions we have seen log files that show connections shutting down with
Multiple peers found with same name: ...
andConnection appears to be with different version of a peer we already know of
with associated significant levels of connection churn.To date we considered these to be transitory conditions, caused by delays in peers detecting that one of their connections is stale, and thus still believing they are connected to the old incarnation of a restarted peer. Heartbeat timeouts should resolve that in less than ~1 minute.
However, users report that they have seen these errors persist for far longer than 1 minute.
The text was updated successfully, but these errors were encountered: