Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Complement can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint is flakey(?) #13944

Closed
DMRobertson opened this issue Sep 29, 2022 · 2 comments · Fixed by #14215
Assignees
Labels
A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures

Comments

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 29, 2022

https://github.com/matrix-org/synapse/actions/runs/3149623842/jobs/5121449357#step:5:5805

❌ TestJumpToDateEndpoint (14.08s)
❌ TestJumpToDateEndpoint/parallel (4.03s)
❌ TestJumpToDateEndpoint/parallel/federation (0s)
❌ TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint (1.54s)
      msc3030_test.go:195: SendEventSynced waiting for event ID $aEEhbyDijeJ586WpiJCNmpxF41yCQ5Sw0Hd2-0HkpRw
      msc3030_test.go:195: SendEventSynced waiting for event ID $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU
      msc3030_test.go:200: CSAPI.MustDoFunc response return non-2xx code: 403 Forbidden - body: {"errcode":"M_FORBIDDEN","error":"You don't have permission to access that event."}

Not sure if this is related to @MadLittleMods' work?

@DMRobertson DMRobertson added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Flake Tests that give intermittent failures Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact labels Sep 29, 2022
@squahtx squahtx added the A-Testing Issues related to testing in complement, synapse, etc label Sep 30, 2022
@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 6, 2022

It looks like this might be related to #13444 given we're seeing this log line Unexpectedly did not have cached prev group for $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU for the event that /context is failing to get.

2022-09-29T08:23:25.7520090Z     client.go:604: [CSAPI] GET hs2/_matrix/client/r0/rooms/!EhRVzzPYBOfRplzGyM:hs1/context/$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU => 403 Forbidden (26.804412ms)

2022-09-29T08:23:25.7599143Z synapse_main | 2022-09-29 08:21:08,284 - synapse.federation.sender - 453 - INFO - process_event_queue_for_federation-3 - Unexpectedly did not have cached prev group for $aEEhbyDijeJ586WpiJCNmpxF41yCQ5Sw0Hd2-0HkpRw
2022-09-29T08:23:25.7605062Z synapse_main | 2022-09-29 08:21:08,363 - synapse.federation.sender - 453 - INFO - process_event_queue_for_federation-4 - Unexpectedly did not have cached prev group for $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU

Maybe a red herring since not sure that would cause an actual failure though.


It is interesting that this reproduced in monolith mode though 🤔

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 6, 2022

Sleuthing the server logs on the CI failure, it looks like hs2:/timestamp_to_event found $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU as an outlier event locally. Then when we went and asked for it via /context, since it's an outlier, it was filtered out of the results -> You don't have permission to access that event.

This is reproducible when sync_partial_state races and persists $NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU as an outlier before we evaluate get_event_for_timestamp(...). To consistently reproduce locally, just add a delay at the start of get_event_for_timestamp(...) so it always runs after sync_partial_state completes.

# TODO: Make sure to assign `self.hs = hs` above
from twisted.internet import task as twisted_task
# wait 3.5s (without blocking anything)
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d

In a run where it passes, on hs2, get_event_for_timestamp(...) finds a different event locally which is next to a gap and we request from a closer one from hs1 which gets backfilled. And since the backfilled event is not an outlier, it's returned as expected during /context.


In get_event_for_timestamp(...), it looks like we also need to backfill the locally found event if it's an outlier.

And/or it would be nice if /context would just backfill the event for us. This would also help with #3848

MadLittleMods added a commit that referenced this issue Oct 7, 2022
…timestamp_to_event_endpoint` test flake

See #13944

If the local event we find is an `outlier`, backfill it
so we can get a pagination token for
it with `/context` and paginate `/messages` from this
point. Previously we only did this for a remote event.

In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848

Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.
MadLittleMods added a commit that referenced this issue Oct 7, 2022
…timestamp_to_event_endpoint` test flake. Make sure we backfill the closest local event if it's an `outlier` so we can use it with `/context` and paginate `/messages` from this point. Previously we only did this for a remote event.

Fix  #13944

Why did this fail before? Why was it flakey?
====================================================

Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

Future changes
==========================

In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848
MadLittleMods added a commit that referenced this issue Oct 7, 2022
…timestamp_to_event_endpoint` test flake. Make sure we backfill the closest local event if it's an `outlier` so we can use it with `/context` and paginate `/messages` from this point. Previously we only did this for a remote event.

Fix  #13944

Why did this fail before? Why was it flakey?
====================================================

Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

Future changes
==========================

In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848
MadLittleMods added a commit that referenced this issue Oct 7, 2022
…timestamp_to_event_endpoint` test flake. Make sure we backfill the closest local event if it's an `outlier` so we can use it with `/context` and paginate `/messages` from this point. Previously we only did this for a remote event.

Fix  #13944

Why did this fail before? Why was it flakey?
====================================================

Sleuthing the server logs on the CI failure, it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

Future changes
==========================

In a future PR, it would be nice if `/context` would just backfill the event for us. This would also help with #3848
MadLittleMods added a commit that referenced this issue Oct 17, 2022
Spawned while investigating #13944

This way we might get some more context whenever an `403 Forbidden - body: {"errcode":"M_FORBIDDEN","error":"You don't have permission to access that event."}` error is produced.

`log_config.yaml`
```yaml
loggers:
    synapse:
        level: INFO

    synapse.visibility:
        level: DEBUG
```
MadLittleMods added a commit that referenced this issue Oct 19, 2022
… idea whether are near a gap or not (#14215)

Fix MSC3030 `/timestamp_to_event` endpoint returning `outliers` that it has no idea whether are near a gap or not (and therefore unable to determine whether it's actually the closest event). The reason Synapse doesn't know whether an `outlier` is next to a gap is because our gap checks rely on entries in the `event_edges`, `event_forward_extremeties`, and `event_backward_extremities` tables which is [not the case for `outliers`](https://github.com/matrix-org/synapse/blob/2c63cdcc3f1aa4625e947de3c23e0a8133c61286/docs/development/room-dag-concepts.md#outliers).

Also fixes MSC3030 Complement `can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint` test flake.  Although this acted flakey in Complement, if `sync_partial_state` raced and beat us before `/timestamp_to_event`, then even if we retried the failing `/context` request it wouldn't work until we made this Synapse change. With this PR, Synapse will never return an `outlier` event so that test will always go and ask over federation.

Fix  #13944


### Why did this fail before? Why was it flakey?

Sleuthing the server logs on the [CI failure](https://github.com/matrix-org/synapse/actions/runs/3149623842/jobs/5121449357#step:5:5805), it looks like `hs2:/timestamp_to_event` found `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` event locally. Then when we went and asked for it via `/context`, since it's an `outlier`, it was filtered out of the results -> `You don't have permission to access that event.`

This is reproducible when `sync_partial_state` races and persists `$NP6-oU7mIFVyhtKfGvfrEQX949hQX-T-gvuauG6eurU` as an `outlier` before we evaluate `get_event_for_timestamp(...)`. To consistently reproduce locally, just add a delay at the [start of `get_event_for_timestamp(...)`](https://github.com/matrix-org/synapse/blob/cb20b885cb4bd1648581dd043a184d86fc8c7a00/synapse/handlers/room.py#L1470-L1496) so it always runs after `sync_partial_state` completes.

```py
from twisted.internet import task as twisted_task
d = twisted_task.deferLater(self.hs.get_reactor(), 3.5)
await d
```

In a run where it passes, on `hs2`, `get_event_for_timestamp(...)` finds a different event locally which is next to a gap and we request from a closer one from `hs1` which gets backfilled. And since the backfilled event is not an `outlier`, it's returned as expected during `/context`.

With this PR, Synapse will never return an `outlier` event so that test will always go and ask over federation.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures
Projects
None yet
3 participants