Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In sync wait for worker to catch up since token #17215

Merged
merged 9 commits into from
May 30, 2024

Conversation

erikjohnston
Copy link
Member

Otherwise things will get confused.

An alternative would be to make sure that for lagging stream we don't return anything (and make sure the returned next_batch token doesn't go backwards). But that is a faff.

reivilibre
reivilibre previously approved these changes May 18, 2024
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable, maybe something better could be done about the sleeps but I agree with the 'TODO be better' :D

@erikjohnston
Copy link
Member Author

Oh bleurgh, the typing stream can legitimately go backwards 🙄

@erikjohnston erikjohnston marked this pull request as ready for review May 18, 2024 12:24
@erikjohnston erikjohnston requested a review from a team as a code owner May 18, 2024 12:24
@reivilibre
Copy link
Contributor

somewhat-flakey Complement failure is because the events stream (and possibly another) is going backwards. I don't know why this happens but we've noticed it before in Complement.

Waiting for current token to reach StreamToken(room_key=RoomStreamToken(stream=9, instance_map=immutabledict({}), topological=None), presence_key=12, typing_key=0, receipt_key=MultiWriterStreamToken(stream=1, instance_map=immutabledict({})), account_data_key=1, push_rules_key=1, to_device_key=1, device_list_key=6, groups_key=0, un_partial_stated_rooms_key=4); currently at StreamToken(room_key=RoomStreamToken(stream=8, instance_map=immutabledict({}), topological=None), presence_key=16, typing_key=0, receipt_key=MultiWriterStreamToken(stream=1, instance_map=immutabledict({})), account_data_key=1, push_rules_key=1, to_device_key=1, device_list_key=7, groups_key=0, un_partial_stated_rooms_key=4)

as an example of where it gets stuck, when locally enhanced with logging

@erikjohnston erikjohnston force-pushed the erikj/wait_for_stream_pos branch from b1300c4 to a84615e Compare May 20, 2024 10:10
@erikjohnston
Copy link
Member Author

Right, so the problem is that there are occasions where we ask for a new stream ID from the generator, but then don't persist it in the DB. If this happens before a restart then the StreamIDGenerator will try and read the largest used value in the DB, and so the stream will start with an ID less than what it was before the restart.

This doesn't affect MultiWriterIdGenerator as on Postgres that is backed by a proper DB sequence.

I'm not quite sure what the right way forward is her here, ideally we'd make it so that stream IDs must be used, but that is hard.

@erikjohnston
Copy link
Member Author

I'm not quite sure what the right way forward is her here, ideally we'd make it so that stream IDs must be used, but that is hard.

Actually, maybe we use stream_positions table, as matrix-org/synapse#8374 sounds basically the same problem.

@erikjohnston erikjohnston dismissed reivilibre’s stale review May 21, 2024 15:48

we need changes

@erikjohnston erikjohnston removed the request for review from a team May 22, 2024 10:18
@erikjohnston
Copy link
Member Author

I'm not quite sure what the right way forward is her here, ideally we'd make it so that stream IDs must be used, but that is hard.

Actually, maybe we use stream_positions table, as matrix-org/synapse#8374 sounds basically the same problem.

#17226

erikjohnston added a commit that referenced this pull request May 29, 2024
There is a problem with `StreamIdGenerator` where it can go backwards
over restarts when a stream ID is requested but then not inserted into
the DB. This is problematic if we want to land #17215, and is generally
a potential cause for all sorts of nastiness.

Instead of trying to fix `StreamIdGenerator`, we may as well move to
`MultiWriterIdGenerator` that does not suffer from this problem (the
latest positions are stored in `stream_positions` table). This involves
adding SQLite support to the class.

This only changes id generators that were already using
`MultiWriterIdGenerator` under postgres, a separate PR will move the
rest of the uses of `StreamIdGenerator` over.
@erikjohnston erikjohnston requested a review from reivilibre May 30, 2024 11:29
@erikjohnston erikjohnston merged commit 5624c8b into develop May 30, 2024
38 checks passed
@erikjohnston erikjohnston deleted the erikj/wait_for_stream_pos branch May 30, 2024 13:03
H-Shay pushed a commit to H-Shay/hq_synapse that referenced this pull request May 31, 2024
)

There is a problem with `StreamIdGenerator` where it can go backwards
over restarts when a stream ID is requested but then not inserted into
the DB. This is problematic if we want to land element-hq#17215, and is generally
a potential cause for all sorts of nastiness.

Instead of trying to fix `StreamIdGenerator`, we may as well move to
`MultiWriterIdGenerator` that does not suffer from this problem (the
latest positions are stored in `stream_positions` table). This involves
adding SQLite support to the class.

This only changes id generators that were already using
`MultiWriterIdGenerator` under postgres, a separate PR will move the
rest of the uses of `StreamIdGenerator` over.
H-Shay pushed a commit to H-Shay/hq_synapse that referenced this pull request May 31, 2024
Otherwise things will get confused.

An alternative would be to make sure that for lagging stream we don't
return anything (and make sure the returned next_batch token doesn't go
backwards). But that is a faff.
erikjohnston added a commit that referenced this pull request Jun 6, 2024
erikjohnston added a commit that referenced this pull request Jun 10, 2024
Mic92 pushed a commit to Mic92/synapse that referenced this pull request Jun 14, 2024
)

There is a problem with `StreamIdGenerator` where it can go backwards
over restarts when a stream ID is requested but then not inserted into
the DB. This is problematic if we want to land element-hq#17215, and is generally
a potential cause for all sorts of nastiness.

Instead of trying to fix `StreamIdGenerator`, we may as well move to
`MultiWriterIdGenerator` that does not suffer from this problem (the
latest positions are stored in `stream_positions` table). This involves
adding SQLite support to the class.

This only changes id generators that were already using
`MultiWriterIdGenerator` under postgres, a separate PR will move the
rest of the uses of `StreamIdGenerator` over.
Mic92 pushed a commit to Mic92/synapse that referenced this pull request Jun 14, 2024
Otherwise things will get confused.

An alternative would be to make sure that for lagging stream we don't
return anything (and make sure the returned next_batch token doesn't go
backwards). But that is a faff.
Mic92 pushed a commit to Mic92/synapse that referenced this pull request Jun 14, 2024
yingziwu added a commit to yingziwu/synapse that referenced this pull request Jun 26, 2024
- Fix the building of binary wheels for macOS by switching to macOS 12 CI runners. ([\#17319](element-hq/synapse#17319))

- When rolling back to a previous Synapse version and then forwards again to this release, don't require server operators to manually run SQL. ([\#17305](element-hq/synapse#17305), [\#17309](element-hq/synapse#17309))

- Use the release branch for sytest in release-branch PRs. ([\#17306](element-hq/synapse#17306))

- Fix bug where one-time-keys were not always included in `/sync` response when using workers. Introduced in v1.109.0rc1. ([\#17275](element-hq/synapse#17275))
- Fix bug where `/sync` could get stuck due to edge case in device lists handling. Introduced in v1.109.0rc1. ([\#17292](element-hq/synapse#17292))

- Add the ability to auto-accept invites on the behalf of users. See the [`auto_accept_invites`](https://element-hq.github.io/synapse/latest/usage/configuration/config_documentation.html#auto-accept-invites) config option for details. ([\#17147](element-hq/synapse#17147))
- Add experimental [MSC3575](matrix-org/matrix-spec-proposals#3575) Sliding Sync `/sync/e2ee` endpoint for to-device messages and device encryption info. ([\#17167](element-hq/synapse#17167))
- Support [MSC3916](matrix-org/matrix-spec-proposals#3916) by adding unstable media endpoints to `/_matrix/client`. ([\#17213](element-hq/synapse#17213))
- Add logging to tasks managed by the task scheduler, showing CPU and database usage. ([\#17219](element-hq/synapse#17219))

- Fix deduplicating of membership events to not create unused state groups. ([\#17164](element-hq/synapse#17164))
- Fix bug where duplicate events could be sent down sync when using workers that are overloaded. ([\#17215](element-hq/synapse#17215))
- Ignore attempts to send to-device messages to bad users, to avoid log spam when we try to connect to the bad server. ([\#17240](element-hq/synapse#17240))
- Fix handling of duplicate concurrent uploading of device one-time-keys. ([\#17241](element-hq/synapse#17241))
- Fix reporting of default tags to Sentry, such as worker name. Broke in v1.108.0. ([\#17251](element-hq/synapse#17251))
- Fix bug where typing updates would not be sent when using workers after a restart. ([\#17252](element-hq/synapse#17252))

- Update the LemonLDAP documentation to say that claims should be explicitly included in the returned `id_token`, as Synapse won't request them. ([\#17204](element-hq/synapse#17204))

- Improve DB usage when fetching related events. ([\#17083](element-hq/synapse#17083))
- Log exceptions when failing to auto-join new user according to the `auto_join_rooms` option. ([\#17176](element-hq/synapse#17176))
- Reduce work of calculating outbound device lists updates. ([\#17211](element-hq/synapse#17211))
- Improve performance of calculating device lists changes in `/sync`. ([\#17216](element-hq/synapse#17216))
- Move towards using `MultiWriterIdGenerator` everywhere. ([\#17226](element-hq/synapse#17226))
- Replaces all usages of `StreamIdGenerator` with `MultiWriterIdGenerator`. ([\#17229](element-hq/synapse#17229))
- Change the `allow_unsafe_locale` config option to also apply when setting up new databases. ([\#17238](element-hq/synapse#17238))
- Fix errors in logs about closing incorrect logging contexts when media gets rejected by a module. ([\#17239](element-hq/synapse#17239), [\#17246](element-hq/synapse#17246))
- Clean out invalid destinations from `device_federation_outbox` table. ([\#17242](element-hq/synapse#17242))
- Stop logging errors when receiving invalid User IDs in key querys requests. ([\#17250](element-hq/synapse#17250))

* Bump anyhow from 1.0.83 to 1.0.86. ([\#17220](element-hq/synapse#17220))
* Bump bcrypt from 4.1.2 to 4.1.3. ([\#17224](element-hq/synapse#17224))
* Bump lxml from 5.2.1 to 5.2.2. ([\#17261](element-hq/synapse#17261))
* Bump mypy-zope from 1.0.3 to 1.0.4. ([\#17262](element-hq/synapse#17262))
* Bump phonenumbers from 8.13.35 to 8.13.37. ([\#17235](element-hq/synapse#17235))
* Bump prometheus-client from 0.19.0 to 0.20.0. ([\#17233](element-hq/synapse#17233))
* Bump pyasn1 from 0.5.1 to 0.6.0. ([\#17223](element-hq/synapse#17223))
* Bump pyicu from 2.13 to 2.13.1. ([\#17236](element-hq/synapse#17236))
* Bump pyopenssl from 24.0.0 to 24.1.0. ([\#17234](element-hq/synapse#17234))
* Bump serde from 1.0.201 to 1.0.202. ([\#17221](element-hq/synapse#17221))
* Bump serde from 1.0.202 to 1.0.203. ([\#17232](element-hq/synapse#17232))
* Bump twine from 5.0.0 to 5.1.0. ([\#17225](element-hq/synapse#17225))
* Bump types-psycopg2 from 2.9.21.20240311 to 2.9.21.20240417. ([\#17222](element-hq/synapse#17222))
* Bump types-pyopenssl from 24.0.0.20240311 to 24.1.0.20240425. ([\#17260](element-hq/synapse#17260))
erikjohnston added a commit that referenced this pull request Jul 2, 2024
Fixes #17274, hopefully.

Basically, old versions of Synapse could advance streams without
persisting anything in the DB (fixed in #17229). On restart those
updates would get lost, and so the position of the stream would revert
to an older position. If this happened across an upgrade to a later
Synapse version which included #17215, then sync could get blocked
indefinitely (until the stream advanced to the position in the token).

We fix this by bounding the stream positions we'll wait for to the
maximum position of the underlying stream ID generator.
erikjohnston added a commit that referenced this pull request Jul 2, 2024
Fixes #17274, hopefully.

Basically, old versions of Synapse could advance streams without
persisting anything in the DB (fixed in #17229). On restart those
updates would get lost, and so the position of the stream would revert
to an older position. If this happened across an upgrade to a later
Synapse version which included #17215, then sync could get blocked
indefinitely (until the stream advanced to the position in the token).

We fix this by bounding the stream positions we'll wait for to the
maximum position of the underlying stream ID generator.
reivilibre added a commit that referenced this pull request Jul 23, 2024
…ced in v1.109.0. (#17428)

Introduced in: #17215

This caused us a minor bit of grief as the volume of logs produced was
much higher than normal

---------

Signed-off-by: Olivier 'reivilibre <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants