Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Outbound federation breaks to a particular server until restart #3842

Closed
erikjohnston opened this issue Sep 11, 2018 · 9 comments
Closed

Outbound federation breaks to a particular server until restart #3842

erikjohnston opened this issue Sep 11, 2018 · 9 comments
Assignees

Comments

@erikjohnston
Copy link
Member

erikjohnston commented Sep 11, 2018

No events were sent from jki.re to matrix.org for ~24h. Other requests were successfully sent.

The last log line in the federation transmission loop for matrix.org is:

2018-09-10 10:15:54,901 - synapse.http.outbound - 182 - INFO - federation_transaction_transmission_loop-26525 - {PUT-O-13438} [matrix.org] Sending request: PUT matrix://matrix.org/_matrix/federation/v1/send/1536569581947/

suggesting that the deferred to actually send the transaction got wedged entirely.

Later we have an unhandled error, which may or may not be related.

2018-09-10 10:15:56,438 - twisted - 131 - CRITICAL -  - Unhandled error in Deferred:
2018-09-10 10:15:56,438 - twisted - 131 - CRITICAL -  - 
ResponseNeverReceived: [<twisted.python.failure.Failure twisted.web._newclient.ConnectionAborted: >]

Edit: Version: Synapse/0.33.4 (b=develop,9a5ea511b)

@erikjohnston erikjohnston self-assigned this Sep 11, 2018
@erikjohnston
Copy link
Member Author

One recent change that may be relevant is that we've moved to use treq.

We also don't seem to time out reading the body from the response:

response = yield self._request(
destination,
"PUT",
path,
json_callback=json_data_callback,
query=args,
long_retries=long_retries,
timeout=timeout,
ignore_backoff=ignore_backoff,
backoff_on_404=backoff_on_404,
)
if 200 <= response.code < 300:
# We need to update the transactions table to say it was sent?
check_content_type_is_json(response.headers)
with logcontext.PreserveLoggingContext():
body = yield treq.json_content(response)
defer.returnValue(body)

@erikjohnston
Copy link
Member Author

erikjohnston commented Sep 12, 2018

Right, so what probably isn't helping is that our retry schedule for "long retries" is insane:

if retries_left and not timeout:
if long_retries:
delay = 4 ** (MAX_LONG_RETRIES + 1 - retries_left)
delay = min(delay, 60)
delay *= random.uniform(0.8, 1.4)
else:
delay = 0.5 * 2 ** (MAX_SHORT_RETRIES - retries_left)
delay = min(delay, 2)
delay *= random.uniform(0.8, 1.4)
yield self.clock.sleep(delay)
retries_left -= 1

Edit: Which gives a potential max sleep duration as over 16 days

@erikjohnston
Copy link
Member Author

In what will shock everyone it turns out I can't read, the maximum sleep time is actually only 84 seconds

@erikjohnston
Copy link
Member Author

After the logging change its clear that things are getting wedged in the actual request deferred, which suggests that the timeouts are broken.

erikjohnston added a commit that referenced this issue Sep 14, 2018
This is an attempt to mitigate #3842 by adding yet-another-timeout
@erikjohnston
Copy link
Member Author

The awful hack didn't work. But I found a stack trace:

2018-09-14 19:14:17,035 - twisted - 243 - CRITICAL -  - Unhandled Error
Traceback (most recent call last):
  File "synapse/app/_base.py", line 104, in run
    
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "synapse/metrics/__init__.py", line 349, in f
    
--- <exception caught here> ---
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "synapse/util/async_helpers.py", line 459, in time_it_out
    
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 548, in cancel
    self.result.cancel()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 537, in cancel
    canceller(self)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/web/_newclient.py", line 1481, in cancelRequest
    self.transport.abortConnection()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/protocols/tls.py", line 435, in abortConnection
    self._shutdownTLS()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/protocols/tls.py", line 338, in _shutdownTLS
    shutdownSuccess = self._tlsConnection.shutdown()
exceptions.AttributeError: 'NoneType' object has no attribute 'shutdown'

so I hacked the hack to hack around that and moved the cancellation until after resolving the new deferred

@erikjohnston
Copy link
Member Author

The hack is:

def timeout_no_seriously(deferred, timeout, reactor):
"""The in build twisted deferred addTimeout (and the method above)
completely fail to time things out under some unknown circumstances.
Lets try a different way of timing things out and maybe that will make
things work?!
TODO: Kill this with fire.
"""
new_d = defer.Deferred()
timed_out = [False]
def time_it_out():
timed_out[0] = True
if not new_d.called:
new_d.errback(DeferredTimeoutError(timeout, "Deferred"))
deferred.cancel()
delayed_call = reactor.callLater(timeout, time_it_out)
def convert_cancelled(value):
if timed_out[0]:
return _cancelled_to_timed_out_error(value, timeout)
return value
deferred.addBoth(convert_cancelled)
def cancel_timeout(result):
# stop the pending call to cancel the deferred if it's been fired
if delayed_call.active():
delayed_call.cancel()
return result
deferred.addBoth(cancel_timeout)
def success_cb(val):
if not new_d.called:
new_d.callback(val)
def failure_cb(val):
if not new_d.called:
new_d.errback(val)
deferred.addCallbacks(success_cb, failure_cb)
return new_d

@erikjohnston
Copy link
Member Author

So conclusion is that this is a bug in twisted, where the cancel function doesn't handle the canceller throwing an exception, which stops the deferred from being resolved. https://github.com/twisted/twisted/blob/twisted-18.4.0/src/twisted/internet/defer.py#L536

@erikjohnston
Copy link
Member Author

erikjohnston added a commit that referenced this issue Sep 19, 2018
The existing deferred timeout helper function (and the one into twisted)
suffer from a bug when a deferred's canceller throws an exception, #3842.

The new helper function doesn't suffer from this problem.
@erikjohnston
Copy link
Member Author

We've updated all timeouts to use the new function

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants