Rolling restarts for matrix.org #11136

squahtx · 2021-10-20T14:51:54Z

See #7968 also

Currently, matrix.org upgrades are done by restarting every Synapse process at once. This puts a lot of load on the database while workers load everything back into their caches and Synapse as a whole is slow to resume serving clients. There are two disadvantages of this:

The minutes of downtime are very noticeable to users and does not leave the best impression
Any bugs that surface will impact all users on matrix.org at once

Metrics

@richvdh suggests measuring some aspects of /sync requests since that is what users will observe:

Request duration isn't a good metric because /sync requests take 30 seconds when there is no data.

One potential metric could be the rate of 500 errors for /sync requests returned by haproxy during a restart:
https://grafana.matrix.org/d/000000018/haproxy-backend-metrics?viewPanel=8&orgId=1&from=1634118580444&to=1634120389397&var-duration=$__auto_interval_duration&var-backend=synchrotron
(you have to select an appropriate instance)
These would be because the haproxy was unable to connect to any synchrotrons (the initial spike), or because the haproxy timed out the request (later spikes).

Our aim would be to get these error rates to 0 during a deployment.

What needs changing

Workers will refuse to start if the database schema is old. We already have a script to run schema upgrades and background upgrades (#10793).

The deployment process needs to be reworked to do something like this:

Run scripts/update_synapse_database
Restart processes one by one over some period of time
- Check that processes are healthy before continuing too far with the deployment

The text was updated successfully, but these errors were encountered:

richvdh · 2022-02-16T10:28:01Z

can we consider this done?

clokep · 2022-08-25T20:08:52Z

I'm pretty sure we've been doing this successfully for a while now. I'm going to close it. @squahtx please shout if that's not true.

squahtx added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. S-Minor Blocks non-critical functionality, workarounds exist. labels Oct 20, 2021

clokep closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling restarts for matrix.org #11136

Rolling restarts for matrix.org #11136

squahtx commented Oct 20, 2021

richvdh commented Feb 16, 2022

clokep commented Aug 25, 2022

Rolling restarts for matrix.org #11136

Rolling restarts for matrix.org #11136

Comments

squahtx commented Oct 20, 2021

Metrics

What needs changing

richvdh commented Feb 16, 2022

clokep commented Aug 25, 2022