Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Rolling restarts for matrix.org #11136

Closed
squahtx opened this issue Oct 20, 2021 · 2 comments
Closed

Rolling restarts for matrix.org #11136

squahtx opened this issue Oct 20, 2021 · 2 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@squahtx
Copy link
Contributor

squahtx commented Oct 20, 2021

See #7968 also

Currently, matrix.org upgrades are done by restarting every Synapse process at once. This puts a lot of load on the database while workers load everything back into their caches and Synapse as a whole is slow to resume serving clients. There are two disadvantages of this:

  • The minutes of downtime are very noticeable to users and does not leave the best impression
  • Any bugs that surface will impact all users on matrix.org at once

Metrics

@richvdh suggests measuring some aspects of /sync requests since that is what users will observe:

Request duration isn't a good metric because /sync requests take 30 seconds when there is no data.

One potential metric could be the rate of 500 errors for /sync requests returned by haproxy during a restart:
https://grafana.matrix.org/d/000000018/haproxy-backend-metrics?viewPanel=8&orgId=1&from=1634118580444&to=1634120389397&var-duration=$__auto_interval_duration&var-backend=synchrotron
(you have to select an appropriate instance)
These would be because the haproxy was unable to connect to any synchrotrons (the initial spike), or because the haproxy timed out the request (later spikes).

Our aim would be to get these error rates to 0 during a deployment.

What needs changing

Workers will refuse to start if the database schema is old. We already have a script to run schema upgrades and background upgrades (#10793).

The deployment process needs to be reworked to do something like this:

  1. Run scripts/update_synapse_database
  2. Restart processes one by one over some period of time
    • Check that processes are healthy before continuing too far with the deployment
@squahtx squahtx added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. S-Minor Blocks non-critical functionality, workarounds exist. labels Oct 20, 2021
@richvdh
Copy link
Member

richvdh commented Feb 16, 2022

can we consider this done?

@clokep
Copy link
Member

clokep commented Aug 25, 2022

I'm pretty sure we've been doing this successfully for a while now. I'm going to close it. @squahtx please shout if that's not true.

@clokep clokep closed this as completed Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

3 participants