-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more tests for receiving events while resyncing for a partial join #419
Add more tests for receiving events while resyncing for a partial join #419
Conversation
`testReceiveEventDuringPartialStateJoin()` sends an event over federation, checks that a client can see it and checks the state at the event.
…isisng_events` requests
var prevEvents interface{} | ||
if prevEventIDs == nil { | ||
prevEvents = nil | ||
} else { | ||
prevEvents = prevEventIDs | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this because interface{}(nil)
is not equal to []string(nil)
// since we have not yet given the homeserver the full state at the join event and allowed | ||
// the partial join to complete, it can't possibly know the full state at the last event. | ||
// While it may be possible for the response to be correct by some accident of state res, | ||
// the homeserver is still wrong in spirit. | ||
t.Fatalf("/state_ids request for event %s did not block when it should have", event.EventID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part tests matrix-org/synapse#13002 . If Synapse does not correctly flag that it does not have partial state in the CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin
test case, it will erroneously return state (which was derived via state res and only happens to be correct) instead of blocking.
}) | ||
|
||
// we should be able to receive events with missing prevs over federation during the resync | ||
t.Run("CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions for better names are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "can receive events with some missing grandparents" or something?
… missing and one known prev event
6b43730
to
5e38528
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally good, a few suggestions.
@@ -143,66 +143,8 @@ func TestPartialStateJoin(t *testing.T) { | |||
|
|||
// derek sends an event in the room | |||
event := psjResult.CreateMessageEvent(t, "derek", nil) | |||
psjResult.Server.MustSendTransaction(t, deployment, "hs1", []json.RawMessage{event.JSON()}, nil) | |||
t.Logf("Derek sent event event ID %s", event.EventID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sent/created/ here
// fire off a /state_ids request for the last event. | ||
// it must either: | ||
// * block because the homeserver does not have full state at the last event | ||
// * or 403 because the homeserver does not have full state yet and does not consider the | ||
// Complement homeserver to be in the room |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we know which, and test for that specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now only the last test case blocks instead of 403s. I'm somewhat reluctant to be more specific in the tests, because it feels like an implementation detail of synapse (which will change when we get around to matrix-org/synapse#13288).
If you're really keen on being more specific, I'm happy to add a boolean parameter or suchlike. (It's not easy to move this part of the test out, since it needs to happen after the /event request above has succeeded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right. tbh I don't think either blocking or returning a 403 is correct behaviour, though I'm not quite sure what is.
It would be very useful to add a link to #13288 and say that this is somewhat temporary until we solve it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, is there a better way to test whether synapse thinks an event has partial state?
this part of the test was intended to check that _resolve_state_at_missing_prevs
calculates the partial state flag correctly and it sounds like it's not a good idea to rely on /state_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none that I can immediately think of. The ideal thing would be to check that after the resync, /state_ids returns some result that would have been impossible without the information conveyed during the resync - but that sounds rather fiddly for a test (and, probably, a hard-to-understand test failure if it later turns out not to be true).
select { | ||
case <-time.After(1 * time.Second): | ||
t.Logf("/state_ids request for event %s blocked as expected", event.EventID()) | ||
defer func() { <-stateIdsResponseChan }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this defer thing do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the /state_ids
request blocks, the send of the response happens some time later, perhaps even after the test ends.
If we don't read from the channel, the goroutine above blocks on sending into the channel and complement gets very unhappy 10 minutes later because it's still not finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so we're making sure we read from the channel so that the goroutine doesn't block. Can you add a comment to that effect?
if err := psjResult.Server.SendFederationRequest(deployment, stateReq, &respStateIDs); err != nil { | ||
httpErr, ok := err.(gomatrix.HTTPError) | ||
t.Logf("%v", httpErr) | ||
if ok && httpErr.Code == 403 { | ||
stateIdsResponseChan <- nil | ||
return | ||
} | ||
t.Errorf("/state_ids request returned non-200: %s", err) | ||
return | ||
} | ||
stateIdsResponseChan <- &respStateIDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than matching for the 403 here and sending a nil
magic value down the channel, I suggest you define
type stateIDsResult struct {
RespStateIDs gomatrixserverlib.RespStateIDs
Error error
}
... and then send that back down the channel, and do the matching in the main thread.
result.ServerRoom, | ||
result.fedStateIdsRequestReceivedWaiter, | ||
result.fedStateIdsSendResponseWaiter, | ||
result.fedStateIdsAllowedEvents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks a lot cleaner. I've borrowed that commit and it works great. Do we want to split that change into a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | ||
|
||
// we should be able to receive events with missing prevs over federation during the resync | ||
t.Run("CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "can receive events with some missing grandparents" or something?
t.Logf("Derek sent event A with ID %s", eventA.EventID()) | ||
t.Logf("Derek sent event B with ID %s", eventB.EventID()) | ||
t.Logf("Derek sent event C with ID %s", eventC.EventID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sent/created/. He hasn't sent them to hs1 yet.
… to complete" This reverts commit 2cbeaae.
I'm going to add a test which will involve *multiple* /state and /state_ids requests, so we need to make the registered handlers more selective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
// read from the channel and discard the result, so that the goroutine above doesn't block | ||
// indefinitely on the send |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm a bit surprised that closing the channel isn't sufficient for this, but fair enough)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get a panic: send on closed channel
without this. To clarify:
no close and no read -> complement times out after 10 minutes
defer close + no read -> panic
defer close + defer read -> complement is happy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see why this is done in a goroutine at all - is this because SendFederationRequest
can block indefinitely? There is a default 30s timeout - if that is too long then you need to send a custom context with a timeout here -
complement/internal/federation/server.go
Line 227 in d842670
return httpClient.DoRequestAndParseResponse(context.Background(), httpReq, resBody) |
context.Background()
so it won't ever expire, but you can change that easily by doing:
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
httpClient.DoRequestAndParseResponse(ctx, httpReq, resBody)
This will require a change to Complement's API to:
func (s *Server) SendFederationRequest(ctx context.Context, deployment *docker.Deployment, req gomatrixserverlib.FederationRequest, resBody interface{}) error {
but that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, that's a lot cleaner!
…ns_calculate_partial_state_flag_for_backfill
Before we merge, is there a better way to check that Synapse has calculated the partial state flag for an event correctly? |
if time.Since(start) > time.Second { | ||
t.Fatalf("timeout waiting for received event to be visible") | ||
} | ||
res := alice.DoFunc(t, "GET", []string{"_matrix", "client", "r0", "rooms", psjResult.ServerRoom.RoomID, "event", event.EventID()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can now use WithRetryUntil in your DoFunc
so you can remove the for
loop etc. It would look a bit like:
res := alice.DoFunc(t, "GET", []string{"_matrix", "client", "r0", "rooms", psjResult.ServerRoom.RoomID, "event", event.EventID()},
client.WithRetryUntil(time.Second, func(res *http.Response) bool {
if res.StatusCode == 200 {
return true
}
eventResBody := client.ParseJSON(t, res)
if res.StatusCode == 404 && gjson.GetBytes(eventResBody, "errcode").String() == "M_NOT_FOUND" {
return false
}
t.Fatalf("GET /event failed with %d: %s", res.StatusCode, string(eventResBody))
return false
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use client.WithRetryUntil
and add a context
to SendFederationRequest
and you should be able to remove a fair chunk of boilerplate here, and help other test authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm otherwise.
.vscode/settings.json
Outdated
{ | ||
"go.testEnvVars": { | ||
"COMPLEMENT_BASE_IMAGE": "complement-synapse" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we should be commiting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally not, as it will make Dendrite people sad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot! I must have added it by accident while merging in main
.
Thank you both for the reviews! |
Add tests for three cases where a homeserver receives a latest event with missing prev events while it is still resyncing state after a partial join.
Namely:
and
where event M is the last event the homeserver knows about and the rightmost event is the one it receives.
The last case tests the code path mentioned in matrix-org/synapse#13002
Best reviewed commit by commit.