backoff late connection failures #1496

rade · 2015-10-01T18:33:44Z

If a connection gets through all the handshake and subsequent checks - i.e. it ends up getting added to the LocalPeer's connection list - but gets terminated within one minute (of the start of the successful connection attempt), then we apply the existing backoff logic to schedule a retry, instead of retrying immediately.

In order to do this we retain the 'Target' data structure for such connections, marking it as 'connected' instead of deleting it (as we did previously).

Fixes #953.

rade · 2015-10-01T18:55:11Z

Items for discussion:

the ResetAfter interval, currently set to 1 minute, which determines how long a connection must stay alive for it to be retried immediately on subsequent failure instead of being subject to backoff. Note that this period starts at the time the successful connection attempt was scheduled for and hence includes a) any delays in CM to attempt the connection, b) the time it takes perform the handshake and other checks. We could exclude the latter two by adding a target.tryAfter = time.Now() to ConnectionMaker.ConnectionedCreated(), but this would "overshoot" in the other direction since any delay in CM processing the event would be excluded too.
the three refactoring commits; they are in a dependency chain but I'd be happy to lop some off the end
smoke test: I could add one, but it would involve some sleeps and poking the weave status, which is rather distasteful. Note that code coverage of the new code is already excellent, so is it worth for an extra three lines of coverage?

If a connection gets through all the handshake and subsequent checks - i.e. it ends up getting added to the LocalPeer's connection list - but gets terminated within one minute (of the start of the successful connection attempt), then we apply the existing backoff logic to schedule a retry, instead of retrying immediately. In order to do this we retain the 'Target' data structure for such connections, marking it as 'connected' instead of deleting it (as we did previously). Fixes #953.

rade · 2015-10-01T20:18:47Z

an extra three lines of coverage?

Actually the behaviour we want to test for - that connections which fail shortly after establishment don't trigger an immediate reconnect - doesn't expand code coverage; the three uncovered lines are for the case that the connection failed later. And we probably don't want to wait that long, depending on what ResetAfter interval we settle on.

And even writing a test for the former is tricky; the difference in connection status is that previously we'd cycle through retrying, pending, established, whereas now it's failed, retying, pending, established. The duration of the failed state will be quite short to start with, and expand as the backoff kicks in. I cannot think of a simple way to write a test here that wouldn't be prone to all sorts of timing-related failures.

…failures LGTM. The timing choices are all very arbitrary, so we should wait for more experience in the wild before trying to tune them further. Fixes #953

rade assigned bboreham Oct 1, 2015

rade added 4 commits October 1, 2015 20:05

refactor: better representation of 3-valued target state

b8b3e6f

refactor: more obviously handle all target states in reporting

e318262

refactor: simplify target status reporting

4169195

rade force-pushed the 953-backoff-late-connection-failures branch from 8622e24 to 4169195 Compare October 1, 2015 19:05

refactor: better 'try' method names

19c44b8

bboreham merged commit 5b7599e into master Oct 2, 2015

rade added this to the 1.2.0 milestone Oct 5, 2015

bboreham mentioned this pull request Oct 19, 2015

restarting a peer can lead to permanent global non-convergence and connection churn #1554

Closed

awh deleted the 953-backoff-late-connection-failures branch November 9, 2015 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backoff late connection failures #1496

backoff late connection failures #1496

rade commented Oct 1, 2015

rade commented Oct 1, 2015

rade commented Oct 1, 2015

backoff late connection failures #1496

backoff late connection failures #1496

Conversation

rade commented Oct 1, 2015

rade commented Oct 1, 2015

rade commented Oct 1, 2015