Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Error doing background update 'chain_cover' on v1.26.0rc1 #9208

Closed
intelfx opened this issue Jan 22, 2021 · 10 comments · Fixed by #9210
Closed

Error doing background update 'chain_cover' on v1.26.0rc1 #9208

intelfx opened this issue Jan 22, 2021 · 10 comments · Fixed by #9210
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@intelfx
Copy link
Contributor

intelfx commented Jan 22, 2021

Description

After upgrading v1.25.0 synapse installation to v1.26.0rc1 and restarting the instance one more time, it started to continuously report this error:

янв 22 19:05:51 stratofortress.nexus.i.intelfx.name synapse[373164]: synapse.storage.background_updates: [background_updates-0] Starting update batch on background update 'chain_cover'
янв 22 19:05:51 stratofortress.nexus.i.intelfx.name synapse[373164]: synapse.storage.background_updates: [background_updates-0] Error doing update
                                                                        Traceback (most recent call last):
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/background_updates.py", line 116, in run_background_updates
                                                                            result = await self.do_next_background_update(
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/background_updates.py", line 227, in do_next_background_update
                                                                            await self._do_background_update(desired_duration_ms)
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/background_updates.py", line 264, in _do_background_update
                                                                            items_updated = await update_handler(progress, batch_size)
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/databases/main/events_bg_updates.py", line 748, in _chain_cover_index
                                                                            result = await self.db_pool.runInteraction(
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/database.py", line 656, in runInteraction
                                                                            result = await self.runWithConnection(
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/database.py", line 739, in runWithConnection
                                                                            return await make_deferred_yieldable(
                                                                          File "/usr/lib/python3.9/site-packages/twisted/python/threadpool.py", line 250, in inContext
                                                                            result = inContext.theWork()
                                                                          File "/usr/lib/python3.9/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
                                                                            inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
                                                                          File "/usr/lib/python3.9/site-packages/twisted/python/context.py", line 122, in callWithContext
                                                                            return self.currentContext().callWithContext(ctx, func, *args, **kw)
                                                                          File "/usr/lib/python3.9/site-packages/twisted/python/context.py", line 85, in callWithContext
                                                                            return func(*args,**kw)
                                                                          File "/usr/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
                                                                            compat.reraise(excValue, excTraceback)
                                                                          File "/usr/lib/python3.9/site-packages/twisted/python/compat.py", line 464, in reraise
                                                                            raise exception.with_traceback(traceback)
                                                                          File "/usr/lib/python3.9/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
                                                                            result = func(conn, *args, **kw)
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/database.py", line 734, in inner_func
                                                                            return func(db_conn, *args, **kwargs)
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/database.py", line 534, in new_transaction
                                                                            r = func(cursor, *args, **kwargs)
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/databases/main/events_bg_updates.py", line 920, in _calculate_chain_cover_txn
                                                                            PersistEventsStore._add_chain_cover_index(
                                                                          File "/usr/lib/python3.9/site-packages/synapse/storage/databases/main/events.py", line 634, in _add_chain_cover_index
                                                                            existing_chain_id = chain_map[auth_id]
                                                                        KeyError: '$2PBTqUSs6gQtrZ3jZW8xVUSHvDbUYR3TKpFBoDHHwJk'

Steps to reproduce

  • upgrade v1.25.0 to v1.26.0rc1
  • restart synapse, wait until the migration and background updates are completed
  • restart synapse one more time

Version information

  • Homeserver: intelfx.name

If not matrix.org:

  • Version: {"server_version":"1.26.0rc1","python_version":"3.9.1"}

  • Install method: hand packaged

  • Platform: x86_64 Arch Linux

@callahad callahad added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Release-Blocker Must be resolved before making a release S-Major Major functionality / product severely impaired, no satisfactory workaround. S-Minor Blocks non-critical functionality, workarounds exist. and removed S-Major Major functionality / product severely impaired, no satisfactory workaround. labels Jan 22, 2021
@callahad
Copy link
Contributor

Thank you for the report! We're currently investigating.

Aside from the logspam, things should continue working correctly. Even though the background update to create a chain cover index has failed, Synapse will fall back to the previous state resolution algorithm which does not depend on that index being present.

@erikjohnston
Copy link
Member

I'm very confused how this is possible. Could you run the following patch and send over the logs (filtering by background_updates-0 will work)? Thanks

diff --git a/synapse/storage/databases/main/events.py b/synapse/storage/databases/main/events.py
index 3216b3f3c..d72585303 100644
--- a/synapse/storage/databases/main/events.py
+++ b/synapse/storage/databases/main/events.py
@@ -587,6 +587,11 @@ class PersistEventsStore:
                     auth_id not in chain_map
                     and auth_id not in events_to_calc_chain_id_for
                 ):
+                    logger.info(
+                        "Discarding event ID %s due to missing auth event %s",
+                        event_id,
+                        auth_id,
+                    )
                     events_to_calc_chain_id_for.discard(event_id)
 
                     # If this is an event we're trying to persist we add it to
@@ -614,6 +619,8 @@ class PersistEventsStore:
         if not events_to_calc_chain_id_for:
             return
 
+        logger.info("Calculating chain IDs for: %s", events_to_calc_chain_id_for)
+
         # We now calculate the chain IDs/sequence numbers for the events. We
         # do this by looking at the chain ID and sequence number of any auth
         # event with the same type/state_key and incrementing the sequence
@@ -628,9 +635,11 @@ class PersistEventsStore:
         for event_id in sorted_topologically(
             events_to_calc_chain_id_for, event_to_auth_chain
         ):
+            logger.info("Calculating chain ID for %s", event_id)
             existing_chain_id = None
             for auth_id in event_to_auth_chain.get(event_id, []):
                 if event_to_types.get(event_id) == event_to_types.get(auth_id):
+                    logger.info("Basing chain ID on %s", auth_id)
                     existing_chain_id = chain_map[auth_id]
                     break
 
@@ -666,6 +675,7 @@ class PersistEventsStore:
 
             chains_tuples_allocated.add(new_chain_tuple)
 
+            logger.info("Calculated chain ID for %s: %s", event_id, new_chain_tuple)
             chain_map[event_id] = new_chain_tuple
             new_chain_tuples[event_id] = new_chain_tuple
 

@intelfx
Copy link
Contributor Author

intelfx commented Jan 22, 2021

@erikjohnston ~1 minute worth of logs (all workers): 2021-01-22-synapse-9208.log

@erikjohnston
Copy link
Member

Huh, something odd is happening. It almost looks like we're not currently sorting topologically. I'd appreciate if you could try with some more logging:

diff --git a/synapse/storage/databases/main/events.py b/synapse/storage/databases/main/events.py
index 3216b3f3c..29ddf0831 100644
--- a/synapse/storage/databases/main/events.py
+++ b/synapse/storage/databases/main/events.py
@@ -579,9 +579,17 @@ class PersistEventsStore:
                 else:
                     chain_map[auth_id] = (chain_id, sequence_number)
 
+        logger.info("event_to_auth_chain: %s", event_to_auth_chain)
+
         # Now we check if we have any events where we don't have auth chain,
         # this should only be out of band memberships.
         for event_id in sorted_topologically(event_to_auth_chain, event_to_auth_chain):
+            logger.info(
+                "Checking auth events for %s (in %s): %s",
+                event_id,
+                event_to_room_id.get(event_id),
+                event_to_auth_chain[event_id],
+            )
             for auth_id in event_to_auth_chain[event_id]:
                 if (
                     auth_id not in chain_map

@intelfx
Copy link
Contributor Author

intelfx commented Jan 22, 2021

@erikjohnston 2021-01-22-synapse-9208-2.log

@erikjohnston
Copy link
Member

Ah, sorted_topologically doesn't handle duplicate edges correctly. This is usually not a problem as I believe we've started enforcing that we don't have duplicate entries in auth_events(?).

Can you see if the following patch fixes it?

diff --git a/synapse/util/iterutils.py b/synapse/util/iterutils.py
index 6ef2b008a..8d2411513 100644
--- a/synapse/util/iterutils.py
+++ b/synapse/util/iterutils.py
@@ -78,7 +78,7 @@ def sorted_topologically(
         if node not in degree_map:
             continue
 
-        for edge in edges:
+        for edge in set(edges):
             if edge in degree_map:
                 degree_map[node] += 1
 

@intelfx
Copy link
Contributor Author

intelfx commented Jan 22, 2021

@erikjohnston yep, the exception spam is gone.

Is this a sign of some latent issue with intelfx.name database?

@erikjohnston
Copy link
Member

Ish, they're probably old events from way back before we added sanity checks. They're not "wrong" per se, just that we don't see that situation in modern events and so our testing didn't pick the situation up :)

@erikjohnston
Copy link
Member

Also, yay that that fixed it.

@clokep clokep closed this as completed Jan 22, 2021
@clokep
Copy link
Member

clokep commented Jan 22, 2021

Closing since #9210 didn't see to close it automatically. 🤷

@callahad callahad removed the X-Release-Blocker Must be resolved before making a release label Jan 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants