-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2638: Ability for clients to request homeservers to resync device lists #2638
Conversation
bc03f4d
to
505d24a
Compare
505d24a
to
bc02e25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be probably the correct solution for the time being, though the lack of flood protection specified by the MSC is a bit concerning.
without the proper retry mechanisms. It can also happen if a homeserver was down | ||
for enough time to miss out on device list updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also happen if a homeserver was down for enough time to miss out on device list updates.
Surely this is an implementation bug if it doesn't send them once the server returns? The spec doesn't say to give up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementations are free to decide when they retry. Currently Synapse isn't too clever about that (related: matrix-org/synapse#2526) - afaik it retries with a backoff meaning that if you've been down for long enough you could have to wait a long time (I think we go up to 24h but I'm not 100% certain) without getting the transactions it's missed. Also I don't think the "retry as soon as the server has returned" behaviour is mandated in the spec so it doesn't entirely sounds like an implementation bug to me.
The spec doesn't say to give up.
Well surely implementations will give up at some point. It sounds like a gigantic waste of resources to e.g. retry 100s of servers that have been down for weeks. And if the spec doesn't specifically mandate not to give up (which again would feel a bit meh to me) then imho it just means that the homeserver is free to decide whether to give up and when to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all pass through the transaction endpoints in the federation spec, which does recommend backing off but doesn't say that the sending server should ditch the request and start at a more convenient place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not seeing how that makes the issue this MSC tries to solve disappear TBH. Both backing off for hours/days or considering the server down and stopping sending it transactions entirely if it's been down for long enough don't sound unreasonable nor do they look like implementation bugs to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @turt2live that it feels wrong that we have to bodge around devicelists being unreliable by exposing an endpoint for clients to manually refresh them. It'd be better for device lists to be reliable in the first place, which means fixing matrix-org/synapse#2526 - which feels like a much better use of time. Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug? (cc @kegsay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing matrix-org/synapse#2526 - which feels like a much better use of time
Is that (retrying pending transactions as soon as a server comes back to life) something that's enforced in the spec? If so I agree. Otherwise it feels like we're fixing the issue only for Synapse and ignoring people using an implementation that's handling that in another way (either because it's less advanced or by design) which feels wrong to me.
it feels wrong that we have to bodge around device lists being unreliable by exposing an endpoint for clients to manually refresh them
I tend to agree; my concern is that it's a feature that's easy to get wrong (as shown by the multiple Synapse bugs related to it), and any mishap ruins a whole part of the E2EE UX for a user - my opinion was that it could be nice to have a fallback just in case an implementation, whether it's Synapse or another one, screws up. Implementing inbound device list updates over federation requires to implement some things (e.g. retries) the spec doesn't mention, and screwing it up will likely leave the homeserver with a bunch of stale device lists it can't trivially fix even if the code gets fixed.
The proposal also addresses another issue, which is how to fix a server screwing up on top of just fixing the bug in the implementation. For example, with the multiple Synapse bugs, it's fair to say we're likely to have quite a few stale device lists, with no other way to fix them other than having the remote user rename a device (which isn't intuitive at all). Homeservers can't really figure out which device lists are stale without resyncing the device list of every single remote user they knows of, whereas it's (imho) much more trivial for clients to figure out when a device list is likely to be out of date.
FWIW the main cause of user grumpiness following the issues we had in Synapse was that users that aren't operating their own homeserver basically can't fix the stale device lists they could see for remote users, making the E2EE experience quite disappointing. It seems likely to me that this happens with other implementations in the future (or even in Synapse if we introduce a regression in that code by mistake).
I guess my point is that implementing device list updates is quite prone to bugs and design flaws and either can be very annoying to both server admins and users. If this is an issue then we should probably explore alternative designs for this feature.
Are there other bugs that would cause device lists to get out of sync that we know of beyond that bug?
There is one I know of (though apparently I haven't taken the time to open an issue for it yet) which is that Synapse expects updates to arrive sorted on their stream_id
, which the spec doesn't enforce - so if Synapse receives updates in the wrong order and the other homeserver fails to respond to the resync request the device list will become stale until the remote server starts to successfully resync again (or we're retrying again, which can be hampered due to the current backoff and to matrix-org/synapse#2526).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unaware of any other issues which can cause them to get out of sync. However, I would push for an explicit resync mechanism over the proposed "just make /send
reliable".
The core issue here is one of timing - much like matrix-org/synapse#2526 - but there's also issues around unbounded growth of messages to send to servers. Unlike PDUs, servers don't have to store all these device list updates in their DB unless they are forced to, so making them free to discard them is desirable.
With regards to timing, the options are:
- Sending server immediately informs other servers of device list updates and backoff retries if they are down. This causes flooding issues - Performance of device list updates could be improved (aka /login is unusably slow) synapse#7721 - and the retry mechanism has been a source of bugs and subtleties around the unbounded growth of messages e.g Synapse relies on unspecced behaviour for the remote server to resync device lists after
prune_outbound_device_list_pokes
synapse#7173 - I regard this option as eagerly sending. - Receiving client requests keys as and when they determine that they need the device keys. This doesn't cause flooding issues when someone updates their devices as people will gradually request keys. However, sometimes clients will not know that they are missing keys (e.g signing in on a new device) so it would need to be augmented with some sort of ping/poke to inform clients that there are new devices. We already have this mechanism in the CS API (
device_lists.changed
in/sync
) but no such mechanism exists in Federation. I regard this option as lazy sending as the keys are only fetched when requested by a client, even though they've been eagerly notified that there is a change in the device list (though we could be smarter about when we notify servers about the new device as we know the metadata on events). This would have also averted Cross-signing signatures not being always federated correctly synapse#7418 because instead of being told "Alice has a new device and here is the key" you'd be told "Alice has a new device" and then have to fetch the list (which would pull in any missing keys).
As for the proposed endpoint, I don't think it's neccessary. We already have https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-keys-query so just use that? We may need to add a refresh
flag to stop servers from serving up local stale copies though.
The proposal descibed here fixes this issue by adding a new endpoint to the | ||
client-server API, allowing clients to request homeservers to resync device | ||
lists. Clients could then either expose a button a user can use when necessary, | ||
or call this endpoint on specific actions, or both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counter proposal (that I'm not convinced entirely on): implementations could be smarter about how they cache/detect failure. If a client hits the device list endpoint several times (where several is subjective), the server could realize that something may be wrong and re-sync. Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures, and request it automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, as mentioned above, the server would already be able to detect some trivial cases of self-imposed corruption, such as having keys but no signatures
I don't mention anywhere that servers detect that kind of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementations could be smarter about how they cache/detect failure
I don't think that would solve the issue. If we miss a bunch of updates because we went down, when we come back up there's no way to know what we've missed, and I wouldn't be surprised if there were other situations where it would be tricky for the homeserver to figure out when to resync. This proposal is meant as a way for clients to tell the server "you didn't detect this failure correctly, but it failed, please resync".
fwiw this counter proposal is already described in the "Alternatives" section of this MSC, with a summary of the explanation above.
The homeserver responds to a request to this endpoint with a 202 Accepted | ||
response code and an empty body (`{}`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also support rate limiting (strongly) and error codes/error ranges for typical failures, like the homeserver being invalid, dead, etc or the user being invalid, local, etc.
If this is meant to work for local users too, that should be specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@babolivier this is where the rate limiting needs to be mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks. That sounded like a security thing to me so I wrote it in the security section. I'll move it here.
XXX: I'm unsure whether the spec should be mandating when a client uses this | ||
endpoint or whether this is left as an implementation detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation detail. The spec is a toolbag, and it's your responsibility to grab a hammer to put a nail in the wall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification!
This is already covered in the "Security considerations" section (including rate-limiting), is that not the right place for this kind of things? |
If the request to the aforementioned federation endpoint succeeds (i.e. the | ||
destination responds with a 200 response code and a device list for the target | ||
user), then the homeserver would include the updated device list in sync | ||
responses in order to relay it to clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that you have to hit the endpoint and then wait for something to come down /sync
is still quite potentially racy and feels like a bad smell. What is the client supposed to do in the meantime? Will a client be able to retry the sync and not lose the device keys? Is the response to /devices
undefined in the meantime - how do you know if it has updated?
Co-authored-by: Kegsay <[email protected]>
Rendered