Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse initial sync becomes impossible for clients after possible silent database corruption #12878

Closed
donicrosby opened this issue May 25, 2022 · 6 comments
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@donicrosby
Copy link

Description

After upgrading synapse to either 1.57.1 or 1.58.1 (I'm not sure which version caused this) clients can no longer perform an initial sync with the home server. All of these clients have the same issue:

  • self-hosted Element,
  • Android Element,
  • Desktop Element,
  • Fluffychat
  • app.element.io

All will not allow users to perform an initial sync with affected home server. The symptom of this is that the homeserver never sends a response to the client causing a timeout with the client causing a retry loop with no errors shown in the client console or in the logs (looks to them as if the server became unresponsive). There are no errors being shown in the logs on the WARN log level on the INFO log level I have found multiple of these events that show up at server startup:

2022-05-25 16:57:22,738 - synapse.storage.util.partial_state_events_tracker - 77 - INFO - GET-536 - Awaiting un-partial-stating of events ['$noFqPKbjeQ9okjKQdMWGZqo5o677M-Fc8rcpuvPdnRs', '$SXZ69Jge2FYtpWhI7hai0jXoZBfivkx_Pz3D7Qu8zc0', '$XQ5YFZbBhxXUHttFkAkxtPucvher_OPq8B2IdudC1JQ', '$MKAvTHRWCZEoGI4yl3ssPKkycsrVhodKawHu8WMsHzE', '$TEbbDjIJAFy895QHPfGwwktIx94s6XNbu2gxwFjcI0M', '$r6HO96BQwOrs9mS2mOtkpmzaKh9tzQK4YT_cJuGSdRs', '$pPxK1DQMheNlqIUVrp8522ywM2xM8sAyOdyO6olJX-Y', '$hVtbTrTlHwzJVPWVymjosJG9bAq5nrHprDaI4-sCDTY', '$aADmLV6wErtaB0l9gy7Jj104emw8fYV5bdH4RPlWCHM', '$Fl0-SYIvOfUD0lKxTu7iQh9DwxttpKhS-HglS2EOiRo', '$Xv69QFRfq4YMF_UHhSvDpfeTCRZVVrRAmjsTfTYWY-k', '$1pd1UCw_LuTssCwDFzd601ewZbBr_RUkzNWdtggiF0I', '$Fvii-D7df5g-iPONJq5QWKhvvr0KaQHMfG3HMQC8k5s', '$ekn9gJ1KW7_cpH8hugBaBW_HcIM5KEa5FjdiobxK9N0', '$xY3KiZxOTyqoFrvEVHpsguT8QaS63hx7W0-utsVVINs', '$9ARX1px7IPtD1lbIKHMP-a2OtgZciXAm5zru8mxwX9k', '$iQASmEc4L6ouFztq1yricUGhiDdZQGuPKGNF02LWGf8', '$CICDdYEqs087x0zX7p3Z_Hktikl2c64zdD3gI-joKgk', '$AUCD67_kghOdyKp5HBsMxLA3NU9S0B5Ol7_i-jqHBs4', '$N2QLviuJrrVmHrjKxO7TlmqgrFhjZZc_XP0SzCmPCR4', '$n2XaLRinSKGCXPHOz_wXnFNqIA6LKFS5gxvvYZ48Ckw', '$A8TGMXq4SZURkIwf-KlPiWHbtVPpznXPxZh2Wf9KNVQ', '$01snaZolMquSInYjpH--PpGIn7O-ynQTFdcCO76fDBc', '$k_p5YlNGU8q1-iEk29_fEWjS9T4nT9RpOADbweHix-E', '$w_kKPK6N2fdT0uCSrYxMNkREx3nJsNIGxqnR_ICopNA', '$9t5rsJ9KADjB_YNH40tluGJo2mUpc8ncaighG4TGdXA', '$XXOXkxmjvM10wsqvipwKra3FcRHy_ZK6lpDFX688nrU', '$PDPL82CMe4gfsCnV_FE-5U4wV2V22j5mxGu5DYMuL0s', '$H1y1qsRN5f4hCQI-GRXaJgiCrlBXzuJODK0yxPYEpME', '$Ti_VpNrEainDSA1QBnDJqv8m_XP3IDq-BEVM5ovrYcQ', '$aqeHFefmTNAWils6czlvlzaMI8Q-3i1sjKvFJFPrM9s', '$bEhxnKhOu9LtmnR8cf2LLzS0pXwDWX5YWh4NBYZlFEQ', '$_U6ps-1uPNpThf7f9SD60GP1zJ7spsVeDn6i9k4SfPg', '$1y5U_HVRI-Yekau3Ih040NKP6vzJiTr-qqwHFhVmt8U', '$Ia8w5QTXbdrEXW6F_qTblgj0eLUgkBRai4vwA8-OD4A', '$5B9J4UI5rKLigJMNL89j1k_3MkNyKs9c8jVrbl1sQ9Y', '$MC-_4JzSsnmpI_VSGgd26n9qJnPm7nz72nScAWXLzFU', '$g0B-R2rxdVHkv09ZTSYrLY8z9INFpRHw3yYy08fkodA', '$S0U2KiPfwtTSKj9MhtVViEHIpPv4uqy29dXqf4sVLuY', '$QBcujRtKfnYVrXGGg1q3667VvA5E6pAPfSO5F9s72zY', '$KUd1-tWZw2i_6fx9J84b_KE-6uv7AlsXh9Hxi7w5DmQ']
Wed, May 25 2022 12:57:22 pm | Stack (most recent call last):
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
Wed, May 25 2022 12:57:22 pm | return _run_code(code, main_globals, None,
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
Wed, May 25 2022 12:57:22 pm | exec(code, run_globals)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/app/homeserver.py", line 460, in <module>
Wed, May 25 2022 12:57:22 pm | main()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/app/homeserver.py", line 456, in main
Wed, May 25 2022 12:57:22 pm | run(hs)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/app/homeserver.py", line 435, in run
Wed, May 25 2022 12:57:22 pm | _base.start_reactor(
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/app/_base.py", line 184, in start_reactor
Wed, May 25 2022 12:57:22 pm | run()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/app/_base.py", line 166, in run
Wed, May 25 2022 12:57:22 pm | run_command()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/base.py", line 1315, in run
Wed, May 25 2022 12:57:22 pm | self.mainLoop()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/base.py", line 1325, in mainLoop
Wed, May 25 2022 12:57:22 pm | reactorBaseSelf.runUntilCurrent()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/base.py", line 964, in runUntilCurrent
Wed, May 25 2022 12:57:22 pm | f(*a, **kw)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1750, in gotResult
Wed, May 25 2022 12:57:22 pm | current_context.run(_inlineCallbacks, r, gen, status)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1663, in _inlineCallbacks
Wed, May 25 2022 12:57:22 pm | status.deferred.callback(getattr(e, "value", None))
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1750, in gotResult
Wed, May 25 2022 12:57:22 pm | current_context.run(_inlineCallbacks, r, gen, status)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1663, in _inlineCallbacks
Wed, May 25 2022 12:57:22 pm | status.deferred.callback(getattr(e, "value", None))
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/util/caches/descriptors.py", line 497, in complete_all
Wed, May 25 2022 12:57:22 pm | d1.callback(val)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 120, in callback
Wed, May 25 2022 12:57:22 pm | observer.callback(r)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1337, in _cbDeferred
Wed, May 25 2022 12:57:22 pm | self.callback(cast(_DeferredListResultListT, self.resultList))
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 661, in callback
Wed, May 25 2022 12:57:22 pm | self._startRunCallbacks(result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
Wed, May 25 2022 12:57:22 pm | self._runCallbacks()
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
Wed, May 25 2022 12:57:22 pm | current.result = callback( # type: ignore[misc]
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1750, in gotResult
Wed, May 25 2022 12:57:22 pm | current_context.run(_inlineCallbacks, r, gen, status)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
Wed, May 25 2022 12:57:22 pm | result = current_context.run(gen.send, result)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 227, in _concurrently_execute_inner
Wed, May 25 2022 12:57:22 pm | await maybe_awaitable(func(value))
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1613, in handle_room_entries
Wed, May 25 2022 12:57:22 pm | await self._generate_room_entry(
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 2059, in _generate_room_entry
Wed, May 25 2022 12:57:22 pm | batch = await self._load_filtered_recents(
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 592, in _load_filtered_recents
Wed, May 25 2022 12:57:22 pm | loaded_recents = await filter_events_for_client(
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/visibility.py", line 85, in filter_events_for_client
Wed, May 25 2022 12:57:22 pm | event_id_to_state = await storage.state.get_state_for_events(
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/storage/state.py", line 676, in get_state_for_events
Wed, May 25 2022 12:57:22 pm | event_to_groups = await self._get_state_group_for_events(event_ids)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/storage/state.py", line 806, in _get_state_group_for_events
Wed, May 25 2022 12:57:22 pm | await self._partial_state_events_tracker.await_full_state(event_ids)
Wed, May 25 2022 12:57:22 pm | File "/usr/local/lib/python3.9/site-packages/synapse/storage/util/partial_state_events_tracker.py", line 77, in await_full_state
Wed, May 25 2022 12:57:22 pm | logger.info(

Steps to reproduce

Note: I'm unsure what settings or events causes this and as such I'll explain the symptoms of the issue that should be able to be pointed out to users/server admins

  • Wait for events in encrypted rooms to be "off-by-one" (a new message causes the previous message to be able to be decrypted)
  • Clear cache and reload/ try to create a fresh login
  • On a web client open developer tools under networking it will show that any of the /sync endpoints connection is still trying to be established
  • You will no longer be able to perform initial syncs with the server

Version information

  • Homeserver: This issue is present on just my self-hosted home server
  • Version:

    • server_version: 1.59.1
    • python_version: 3.9.13
  • Install method:

    • Docker/k8s
    • Homegrown helm chart that is not publicly available
  • Platform:

    • kubernetes cluster hosted on amd64 VMs
    • Uses synapse workers
@squahtx
Copy link
Contributor

squahtx commented May 25, 2022

Thanks for digging out the stack trace.

As noted by richvdh, it looks like you may have experimental_features: faster_joins: true in your config, which is likely the source of the trouble. That feature does not yet work across restarts (#12642) or deployments with workers.

Could you:

  1. Disable faster_joins in the config
  2. Pull out the list of stuck rooms from your database using psql or sqlite: SELECT * FROM partial_state_rooms;
  3. Purge the stuck rooms using the admin API. An example bash script to do that:
    rooms=(
        !ehXvUhWNASUkSLvAGP:matrix.org
        !ehXvUhWNASUkSLvAGP:matrix.org
        ...
    )
    for room_id in "${rooms[@]}"
    do
        curl -X DELETE --header "Authorization: Bearer syt_..." "https://your-homeserver/_synapse/admin/v1/rooms/$room_id" --data '{"block": false, "purge": true}'
    done
    See https://github.com/matrix-org/synapse/tree/develop/docs/usage/administration/admin_api for instructions on authenticating with the admin API.
  4. Optionally clean up the faster joins tables using psql or sqlite: TRUNCATE partial_state_events; TRUNCATE partial_state_rooms; TRUNCATE partial_state_rooms_servers;
  5. Restart the deployment.

@squahtx squahtx added the X-Needs-Info This issue is blocked awaiting information from the reporter label May 25, 2022
@donicrosby
Copy link
Author

I'm having an issue with the server stuck on the deletion part currently it's stuck on:

DELETE-6134 - Shutting down room '<room_id>'
DELETE-6134 - Kicking '[user]:[myserver]' from '<room_id>'...

It never seems to complete and just eventually times out

@squahtx
Copy link
Contributor

squahtx commented May 25, 2022

I wonder if the deletion is stuck for the same reason /sync is stuck. Do you see any DELETE-6134 - Awaiting un-partial-stating of events stack traces?

@donicrosby
Copy link
Author

Yes I am

@richvdh
Copy link
Member

richvdh commented May 26, 2022

I think you need to stop the server, then TRUNCATE partial_state_rooms CASCADE;, then restart it, before doing the purge.

And, for goodness sake, turn off faster_joins.

@richvdh
Copy link
Member

richvdh commented May 28, 2022

I'm going to close this, as I think the above instructions will fix it, and it's really somewhat self-inflicted

@richvdh richvdh closed this as completed May 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

3 participants