Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't lock up when joining large rooms #16903

Merged
merged 4 commits into from
Feb 20, 2024
Merged

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Feb 12, 2024

Hopefully fixes the issue #16895

@erikjohnston erikjohnston marked this pull request as ready for review February 12, 2024 12:19
@erikjohnston erikjohnston requested a review from a team as a code owner February 12, 2024 12:19
@erikjohnston erikjohnston marked this pull request as draft February 12, 2024 12:56
@erikjohnston erikjohnston removed the request for review from a team February 12, 2024 12:56
@anoadragon453 anoadragon453 marked this pull request as ready for review February 20, 2024 14:05
@anoadragon453 anoadragon453 self-requested a review February 20, 2024 14:07
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tested this change internally and found that it helps alleviate performance issues for a server running v1.101.0.

@anoadragon453 anoadragon453 enabled auto-merge (squash) February 20, 2024 14:11
@anoadragon453 anoadragon453 merged commit cdbbf36 into develop Feb 20, 2024
38 checks passed
@anoadragon453 anoadragon453 deleted the erikj/faster_join_yield branch February 20, 2024 14:29
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Mar 9, 2024
# Synapse 1.102.0 (2024-03-05)

### Bugfixes

- Revert element-hq/synapse#16756, which caused incorrect notification counts on mobile clients since v1.100.0. ([\#16979](element-hq/synapse#16979))


# Synapse 1.102.0rc1 (2024-02-20)

### Features

- A metric was added for emails sent by Synapse, broken down by type: `synapse_emails_sent_total`. Contributed by Remi Rampin. ([\#16881](element-hq/synapse#16881))

### Bugfixes

- Do not send multiple concurrent requests for keys for the same server. ([\#16894](element-hq/synapse#16894))
- Fix performance issue when joining very large rooms that can cause the server to lock up. Introduced in v1.100.0. ([\#16903](element-hq/synapse#16903))
- Always prefer unthreaded receipt when >1 exist ([MSC4102](matrix-org/matrix-spec-proposals#4102)). ([\#16927](element-hq/synapse#16927))

### Improved Documentation

- Fix a small typo in the Rooms section of the Admin API documentation. Contributed by @RainerZufall187. ([\#16857](element-hq/synapse#16857))

### Internal Changes

- Don't invalidate the entire event cache when we purge history. ([\#16905](element-hq/synapse#16905))
- Add experimental config option to not send device list updates for specific users. ([\#16909](element-hq/synapse#16909))
- Fix incorrect docker hub link in release script. ([\#16910](element-hq/synapse#16910))



### Updates to locked dependencies

* Bump attrs from 23.1.0 to 23.2.0. ([\#16899](element-hq/synapse#16899))
* Bump bcrypt from 4.0.1 to 4.1.2. ([\#16900](element-hq/synapse#16900))
* Bump pygithub from 2.1.1 to 2.2.0. ([\#16902](element-hq/synapse#16902))
* Bump sentry-sdk from 1.40.0 to 1.40.3. ([\#16898](element-hq/synapse#16898))
erikjohnston pushed a commit that referenced this pull request Mar 12, 2024
This PR aims to fix #16895, caused by a regression in #7 and not fixed
by #16903. The PR #16903 only fixes a starvation issue, where the CPU
isn't released. There is a second issue, where the execution is blocked.
This theory is supported by the flame graphs provided in #16895 and the
fact that I see the CPU usage reducing and far below the limit.

Since the changes in #7, the method `check_state_independent_auth_rules`
is called with the additional parameter `batched_auth_events`:


https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/handlers/federation_event.py#L1741-L1743


It makes the execution enter this if clause, introduced with #15195


https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/event_auth.py#L178-L189

There are two issues in the above code snippet.

First, there is the blocking issue. I'm not entirely sure if this is a
deadlock, starvation, or something different. In the beginning, I
thought the copy operation was responsible. It wasn't. Then I
investigated the nested `store.get_events` inside the function `update`.
This was also not causing the blocking issue. Only when I replaced the
set difference operation (`-` ) with a list comprehension, the blocking
was resolved. Creating and comparing sets with a very large amount of
events seems to be problematic.

This is how the flamegraph looks now while persisting outliers. As you
can see, the execution no longer locks up in the above function.

![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec)

Second, the copying here doesn't serve any purpose, because only a
shallow copy is created. This means the same objects from the original
dict are referenced. This fails the intention of protecting these
objects from mutation. The review of the original PR
matrix-org/synapse#15195 had an extensive
discussion about this matter.

Various approaches to copying the auth_events were attempted:
1) Implementing a deepcopy caused issues due to
builtins.EventInternalMetadata not being pickleable.
2) Creating a dict with new objects akin to a deepcopy.
3) Creating a dict with new objects containing only necessary
attributes.

Concluding, there is no easy way to create an actual copy of the
objects. Opting for a deepcopy can significantly strain memory and CPU
resources, making it an inefficient choice. I don't see why the copy is
necessary in the first place. Therefore I'm proposing to remove it
altogether.

After these changes, I was able to successfully join these rooms,
without the main worker locking up:
- #synapse:matrix.org
- #element-android:matrix.org
- #element-web:matrix.org
- #ecips:matrix.org
- #ipfs-chatter:ipfs.io
- #python:matrix.org
- #matrix:matrix.org
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants