This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Rolling restarts for matrix.org #11136
Labels
S-Minor
Blocks non-critical functionality, workarounds exist.
T-Enhancement
New features, changes in functionality, improvements in performance, or user-facing enhancements.
See #7968 also
Currently, matrix.org upgrades are done by restarting every Synapse process at once. This puts a lot of load on the database while workers load everything back into their caches and Synapse as a whole is slow to resume serving clients. There are two disadvantages of this:
Metrics
@richvdh suggests measuring some aspects of
/sync
requests since that is what users will observe:Request duration isn't a good metric because
/sync
requests take 30 seconds when there is no data.One potential metric could be the rate of 500 errors for
/sync
requests returned by haproxy during a restart:https://grafana.matrix.org/d/000000018/haproxy-backend-metrics?viewPanel=8&orgId=1&from=1634118580444&to=1634120389397&var-duration=$__auto_interval_duration&var-backend=synchrotron
(you have to select an appropriate instance)
These would be because the haproxy was unable to connect to any synchrotrons (the initial spike), or because the haproxy timed out the request (later spikes).
Our aim would be to get these error rates to 0 during a deployment.
What needs changing
Workers will refuse to start if the database schema is old. We already have a script to run schema upgrades and background upgrades (#10793).
The deployment process needs to be reworked to do something like this:
scripts/update_synapse_database
The text was updated successfully, but these errors were encountered: