Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

backoff late connection failures #1496

Merged
merged 5 commits into from
Oct 2, 2015

Conversation

rade
Copy link
Member

@rade rade commented Oct 1, 2015

If a connection gets through all the handshake and subsequent checks - i.e. it ends up getting added to the LocalPeer's connection list - but gets terminated within one minute (of the start of the successful connection attempt), then we apply the existing backoff logic to schedule a retry, instead of retrying immediately.

In order to do this we retain the 'Target' data structure for such connections, marking it as 'connected' instead of deleting it (as we did previously).

Fixes #953.

@rade
Copy link
Member Author

rade commented Oct 1, 2015

Items for discussion:

  • the ResetAfter interval, currently set to 1 minute, which determines how long a connection must stay alive for it to be retried immediately on subsequent failure instead of being subject to backoff. Note that this period starts at the time the successful connection attempt was scheduled for and hence includes a) any delays in CM to attempt the connection, b) the time it takes perform the handshake and other checks. We could exclude the latter two by adding a target.tryAfter = time.Now() to ConnectionMaker.ConnectionedCreated(), but this would "overshoot" in the other direction since any delay in CM processing the event would be excluded too.
  • the three refactoring commits; they are in a dependency chain but I'd be happy to lop some off the end
  • smoke test: I could add one, but it would involve some sleeps and poking the weave status, which is rather distasteful. Note that code coverage of the new code is already excellent, so is it worth for an extra three lines of coverage?

rade added 4 commits October 1, 2015 20:05
If a connection gets through all the handshake and subsequent checks -
i.e. it ends up getting added to the LocalPeer's connection list - but
gets terminated within one minute (of the start of the successful
connection attempt), then we apply the existing backoff logic to
schedule a retry, instead of retrying immediately.

In order to do this we retain the 'Target' data structure for such
connections, marking it as 'connected' instead of deleting it (as we
did previously).

Fixes #953.
@rade rade force-pushed the 953-backoff-late-connection-failures branch from 8622e24 to 4169195 Compare October 1, 2015 19:05
@rade
Copy link
Member Author

rade commented Oct 1, 2015

an extra three lines of coverage?

Actually the behaviour we want to test for - that connections which fail shortly after establishment don't trigger an immediate reconnect - doesn't expand code coverage; the three uncovered lines are for the case that the connection failed later. And we probably don't want to wait that long, depending on what ResetAfter interval we settle on.

And even writing a test for the former is tricky; the difference in connection status is that previously we'd cycle through retrying, pending, established, whereas now it's failed, retying, pending, established. The duration of the failed state will be quite short to start with, and expand as the backoff kicks in. I cannot think of a simple way to write a test here that wouldn't be prone to all sorts of timing-related failures.

bboreham added a commit that referenced this pull request Oct 2, 2015
…failures

LGTM.  The timing choices are all very arbitrary, so we should wait for more experience in the wild before trying to tune them further.

Fixes #953
@bboreham bboreham merged commit 5b7599e into master Oct 2, 2015
@rade rade added this to the 1.2.0 milestone Oct 5, 2015
@awh awh deleted the 953-backoff-late-connection-failures branch November 9, 2015 16:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants