Faster joins: handle gappy `/syncs` #12645

richvdh · 2022-05-05T12:30:33Z

We need to make sure that we're handling gappy /syncs correctly. It would be easy to think that we sent more state to the clients than we did, and we'll need to check this and write tests.

To do this, we'll first have to understand the existing code for tracking which state we have sent to clients. It looks entirely magical.

The text was updated successfully, but these errors were encountered:

erikjohnston · 2022-05-10T09:10:01Z

I think the code we care about is here:

synapse/synapse/handlers/sync.py

Lines 923 to 978 in ade3008

    
           elif batch.limited: 
        
               if batch: 
        
                   state_at_timeline_start = ( 
        
                       await self.state_store.get_state_ids_for_event( 
        
                           batch.events[0].event_id, state_filter=state_filter 
        
                       ) 
        
                   ) 
        
               else: 
        
                   # We can get here if the user has ignored the senders of all 
        
                   # the recent events. 
        
                   state_at_timeline_start = await self.get_state_at( 
        
                       room_id, stream_position=now_token, state_filter=state_filter 
        
                   ) 
        
               # for now, we disable LL for gappy syncs - see 
        
               # https://github.com/vector-im/riot-web/issues/7211#issuecomment-419976346 
        
               # N.B. this slows down incr syncs as we are now processing way 
        
               # more state in the server than if we were LLing. 
        
               # 
        
               # We still have to filter timeline_start to LL entries (above) in order 
        
               # for _calculate_state's LL logic to work, as we have to include LL 
        
               # members for timeline senders in case they weren't loaded in the initial 
        
               # sync.  We do this by (counterintuitively) by filtering timeline_start 
        
               # members to just be ones which were timeline senders, which then ensures 
        
               # all of the rest get included in the state block (if we need to know 
        
               # about them). 
        
               state_filter = StateFilter.all() 
        
               # If this is an initial sync then full_state should be set, and 
        
               # that case is handled above. We assert here to ensure that this 
        
               # is indeed the case. 
        
               assert since_token is not None 
        
               state_at_previous_sync = await self.get_state_at( 
        
                   room_id, stream_position=since_token, state_filter=state_filter 
        
               ) 
        
               if batch: 
        
                   current_state_ids = await self.state_store.get_state_ids_for_event( 
        
                       batch.events[-1].event_id, state_filter=state_filter 
        
                   ) 
        
               else: 
        
                   # Its not clear how we get here, but empirically we do 
        
                   # (#5407). Logging has been added elsewhere to try and 
        
                   # figure out where this state comes from. 
        
                   current_state_ids = await self.get_state_at( 
        
                       room_id, stream_position=now_token, state_filter=state_filter 
        
                   ) 
        
               state_ids = _calculate_state( 
        
                   timeline_contains=timeline_state, 
        
                   timeline_start=state_at_timeline_start, 
        
                   previous=state_at_previous_sync, 
        
                   current=current_state_ids, 
        
                   # we have to include LL members in case LL initial sync missed them 
        
                   lazy_load_members=lazy_load_members, 
        
               )

which basically gets the state at the end of the last sync response and the state at the start of the current sync response, takes the delta, and removes any state events that will appear in the timeline anyway.

One way of fixing this is to somehow set the state_at_previous_sync to the empty set if we were still in the process of lazy joining at the time?

squahtx · 2022-06-08T19:52:35Z

I think this is fine. The only state that we might think we sent, but didn't, are m.room.member events.

When lazy_load_members is off, we want to pretend the room has not been joined yet: #12989
When lazy_load_members is on, we:

first decide that we always want to return m.room.members in the /sync response, by kicking m.room.member events out of state-at-the-end-of-the-previous-sync:

synapse/synapse/handlers/sync.py

Lines 2258 to 2270 in 888a29f

    
           # If we are lazyloading room members, we explicitly add the membership events 
        
           # for the senders in the timeline into the state block returned by /sync, 
        
           # as we may not have sent them to the client before.  We find these membership 
        
           # events by filtering them out of timeline_start, which has already been filtered 
        
           # to only include membership events for the senders in the timeline. 
        
           # In practice, we can do this by removing them from the p_ids list, 
        
           # which is the list of relevant state we know we have already sent to the client. 
        
           # see https://github.com/matrix-org/synapse/pull/2970/files/efcdacad7d1b7f52f879179701c7e0d9b763511f#r204732809 
        
           if lazy_load_members: 
        
               p_ids.difference_update( 
        
                   e for t, e in timeline_start.items() if t[0] == EventTypes.Member 
        
               )

then filter out m.room.member events that we have already sent, by checking against an in-memory cache, which we update at the same time:

synapse/synapse/handlers/sync.py

Lines 1012 to 1039 in 888a29f

    
           if lazy_load_members and not include_redundant_members: 
        
               cache_key = (sync_config.user.to_string(), sync_config.device_id) 
        
               cache = self.get_lazy_loaded_members_cache(cache_key) 
        
               # if it's a new sync sequence, then assume the client has had 
        
               # amnesia and doesn't want any recent lazy-loaded members 
        
               # de-duplicated. 
        
               if since_token is None: 
        
                   logger.debug("clearing LruCache for %r", cache_key) 
        
                   cache.clear() 
        
               else: 
        
                   # only send members which aren't in our LruCache (either 
        
                   # because they're new to this client or have been pushed out 
        
                   # of the cache) 
        
                   logger.debug("filtering state from %r...", state_ids) 
        
                   state_ids = { 
        
                       t: event_id 
        
                       for t, event_id in state_ids.items() 
        
                       if cache.get(t[1]) != event_id 
        
                   } 
        
                   logger.debug("...to %r", state_ids) 
        
               # add any member IDs we are about to send into our LruCache 
        
               for t, event_id in itertools.chain( 
        
                   state_ids.items(), timeline_state.items() 
        
               ): 
        
                   if t[0] == EventTypes.Member: 
        
                       cache.set(t[1], event_id)

so in the following sequence, the gappy /sync will include the m.room.member event for Bob:

Alice joins room
/sync
Synapse finishes syncing partial state
Bob sends two messages
gappy /sync with limit: 1

squahtx · 2022-06-08T19:54:00Z

Complement test: matrix-org/complement#390

squahtx · 2022-06-10T11:35:11Z

The test has been merged. Please reopen if I've missed anything.

richvdh added this to the Q2 2022 ─ Faster joins phase 2: correctness milestone May 5, 2022

anoadragon453 added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label May 5, 2022

MadLittleMods added A-Sync defects related to /sync A-Federated-Join joins over federation generally suck labels Jun 3, 2022

squahtx self-assigned this Jun 8, 2022

squahtx mentioned this issue Jun 8, 2022

Add test for a gappy sync across de-partial stating of a room matrix-org/complement#390

Merged

squahtx closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster joins: handle gappy `/syncs` #12645

Faster joins: handle gappy `/syncs` #12645

richvdh commented May 5, 2022

erikjohnston commented May 10, 2022

squahtx commented Jun 8, 2022 •

edited

Loading

squahtx commented Jun 8, 2022

squahtx commented Jun 10, 2022

Faster joins: handle gappy /syncs #12645

Faster joins: handle gappy /syncs #12645

Comments

richvdh commented May 5, 2022

erikjohnston commented May 10, 2022

squahtx commented Jun 8, 2022 • edited Loading

squahtx commented Jun 8, 2022

squahtx commented Jun 10, 2022

Faster joins: handle gappy `/syncs` #12645

Faster joins: handle gappy `/syncs` #12645

squahtx commented Jun 8, 2022 •

edited

Loading