Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unable to join some large rooms due to high RAM consumption #7339

Closed
rihardsk opened this issue Apr 23, 2020 · 18 comments
Closed

Unable to join some large rooms due to high RAM consumption #7339

rihardsk opened this issue Apr 23, 2020 · 18 comments
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)

Comments

@rihardsk
Copy link

Description

Joining some large rooms, such as #freenode_#haskel:matrix.org (1.4k members), fails because synapse eats up all the available memory leading to it being forcefully stopped. I've configured my system to limit synapse to 3.5 GB of RAM. Upon joining, synapse first spends some time doing some processing (seeing high CPU usage, RAM usage close to baseline 500 MB) and after a while the RAM consumption starts to steadily climb until it reaches the 3.5 GB mark when it has to be killed.

Here are the logs from the moment of joining the room up until synapse getting killed
ram-crash.redacted.log
The request for joining the room comes in at line 28. On line 564 synapse stopped printing anything to the logs and just maintained high CPU usage while steadily growing RAM consumption for ~1 minute until being killed.

Joining other large rooms, e.g., #matrix:matrix.org (3.2k members), #synapse:matrix.org (1.2k members), works fine (haven't monitored RAM consumption when joining but i had the same limits set). Some other person on #synapse:matrix.org reported joining a room with ~20k people with RAM consumption going up to ~1.1 GB, which leads me to suspect that i might be seeing something abnormal in my case. Am i?

Other than this issue, synapse seems to be working fine. I'm willing to repeat this and do some profiling if necessary.

Steps to reproduce

  • try to join #freenode_#haskel:matrix.org
  • watch as synapse's RAM consumption grows to > 3.5 GB

Version information

  • Homeserver: my private homeserver

  • Version: 1.12.1

  • Install method: NixOS

  • Platform: latest NixOS master branch (commit 01c8795673ecff6272895a085e9fe1ffa3b33620) running on a rockpro64 sbc (with a custom patched kernel).
@babolivier babolivier added z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Apr 27, 2020
@babolivier
Copy link
Contributor

babolivier commented Apr 27, 2020

The "size" of a Matrix room isn't described by its number of users but the number of state events (e.g. joins, leaves, kicks, bans, changes of name, topic, power level rules, join rules, etc.) in its history. To summarise, there is a component to Matrix called the state resolution algorithm that's in charge of resolving clashes between two servers that got out of sync regarding what state a given room currently is. This algorithm works through the whole state of the room, and needs to load most (if not all) state events in that room in memory to work. This is what's making Synapse so hungry on RAM when trying to join a large room, because it needs to retrieve and authenticate every state event, which can be expensive for old rooms. If you're interested, how exactly this algorithm works has been explained recently on the matrix.org website: https://matrix.org/docs/guides/implementing-stateres

IIRC this is also the reason why some rooms can't be joined from small homeservers on modular.im.

The above is more a point of context and details than "it has a reason so it's not an issue" (because it definitely is an issue), and I don't think there's an open issue about that on this repo so I'll keep that one open to track the status of this.

@c7hm4r
Copy link

c7hm4r commented Sep 17, 2020

@babolivier: This algorithm [...] needs to load most (if not all) state events in that room in memory to work.

Every algorithm can be implemented using few RAM, but then maybe requiring more I/O to persistent storage (such as a DB) and being slower. This is a tradeoff decision.

The current implementation decisions exclude users of cheap hardware (for home servers) to join larger rooms. IMO this is a bug, isn’t it?

If the algorithm implementation would be tied more to the DB and the DB would appropriately implement caching, the memory usage behavior would probably be automatically more adaptive to the amount of available memory and maybe not that much slower with much RAM.

Another idea: Repeatedly check available free memory during execution of the algorithm, and if the requirements are not met, abort cleanly, send an error message to the user, and fall back to some (maybe less secure) alternative, instead of hoping for the OOM killer to do the right thing (after a phase in which the whole system nearly freezes).

@mxvin
Copy link

mxvin commented Oct 27, 2020

What I think is, why don't we just give these chore to the homeserver of that room resides?
Say that I wanna join to X homeserver room, Just talk to X HS "Hey, I wanna join "xyz" room" and our HS also saving the state that we join "xyz" room of X HS. Then, event history sync and Future transaction like texts, media etc... just make our homeserver proxying that straight from X Homeserver ( I guess media delivery also using this approach too)

Why every server that wanna join these room need to process every state/event and all of that logic? I think that can be bypassed.
Federation is the core of matrix. By using these approach, every person even with a very small resource computing like Raspberry PI can spin their own homeserver and join to any room they like.

@immanuelfodor
Copy link

Bootstrapping room state quickly from a data/db sync, I like the idea.

@auscompgeek
Copy link
Contributor

@mxvin a room is replicated to all homeservers that participate in that room, they don't live on a single server like in XMPP.

@lqdev
Copy link

lqdev commented Nov 21, 2020

Similar issues for me. Though not sure if it's because of RAM consumption. I used htop to track the processes and RAM almost never goes above 500MB.

Currently running a homeserver on a Raspberry Pi 4 B with 4GB RAM. Initially, I was running on a Raspberry Pi 3 with 1 GB RAM. I've been able to join rooms like Element Android (2.5k), Synapse Admins (719). I'm using SQLite DB at the moment.

Trying to join a room like Matrix HQ (7.8k) though takes an extremely long time to try to join the room. Eventually, my server crashes and I get a 502 Bad Gateway error.

@ptman
Copy link
Contributor

ptman commented Nov 21, 2020

@lqdev First switch from sqlite to postgres. You shouldn't federate with sqlite.

@lqdev
Copy link

lqdev commented Nov 22, 2020

Thanks @ptman I'll give that a try.

@lqdev
Copy link

lqdev commented Nov 22, 2020

@ptman federation is a bit snappier after migrating to Postgres. Thanks for that suggestion. Though I'm still intermittently running into issues. I'm guessing part of that is the fact I have everything running on a RPi. To clarify, it appears it's large rooms that are bridged that I have trouble with, so I can see how that might be an issue (i.e. #techlore:matrix.org)

@c7hm4r
Copy link

c7hm4r commented Feb 1, 2021

It seems that with Synapse 0.26 memory consumption is much lower. Now, my server can join rooms with complexity between 20 and 30, but the largest rooms on matrix.org are still prohibitive.

@jkufner
Copy link

jkufner commented Feb 22, 2021

Memory usage is certainly a problem. Server's memory usage should not depend on number of historical events in a room.

Ideally, the memory consumption should be constant. If there is a session state or event queue for each client, then it should be linear with number of clients. Other than that, it should be posible to run the server in constant memory space. We have a powerful SQL database available, the Synapse should use it.

Anyway, if a large room is defined by number of events, can we make a state snapshot from time to time, and then synchronize from the last snapshot? This way, we can throw away (or lazy load) the history before the snapshot and every room becomes a small room. The snapshot may be a hash of the state or something like that, not necessairly prepresenting the complete state. If a client desires the earlier history, it could be provided on demand (nobody reads it all anyway).

@ptman
Copy link
Contributor

ptman commented Feb 22, 2021

@jkufner complexity (resource use) does not depend on number of events (messages, attachments, etc.) but number of state events (related to e.g. federation, permission calculation, ...) https://github.com/matrix-org/synapse/blob/master/synapse/storage/databases/main/events_worker.py#L1072

@ptman
Copy link
Contributor

ptman commented Feb 22, 2021

#8659

@jkufner
Copy link

jkufner commented Feb 22, 2021

@ptman Ok, sorry for inaccuracy; however, the argument still stands.

@AnInternetTroll
Copy link

Any update on this?

@c7hm4r
Copy link

c7hm4r commented Jun 8, 2021

@erikjohnston
Copy link
Member

This should hopefully be significantly improved in the upcoming v1.36.0 release. I'm going to close this for now, if people still see issues after updating then feel free to make a new issue.

@gabrix73
Copy link

I installed debian11 matrix-synapse-py3 on a CX21 server machine with 2 cpu, 4GB of ram, 40GB of hard disk. It’s not really a small homeserver or a raspberry pi4.
I am the only user and i try to explore just one room.
Loading #matrix:matrix.org from clients takes the server ram up to the 70%, the same for both the CPU.
I assume that a normal use of my server with more users, with a large list of rooms of various kinds loaded, remains impossible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests