-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113
Comments
it's worth noting that in a Sliding Sync world, we don't need to worry about reliably persisting so much stuff (other than to-device messages!) given we can just re-request it from the server at launch. |
So in a sliding sync world you have an alternative solution here to guarantee at-least-once processing without having to persist to-device messages (which incurs perf hits as you have to hit the disk). The flow would look like:
This isn't possible in sync v2 because:
|
this is the kicker, though - for e2ee, processing more than once is a disaster, given it's a ratchet and the ratchet will get out of sync. my hunch is that we should grab the to-device msgs as rapidly as possible and chuck them reliably to disk, and then consume from there, rather than relying on the (admittedly cool) feature that sliding sync lets you rollback. |
If that is such a problem then persisting to device messages isn't enough, you need a two phase or atomic commit process to avoid reprocessing. E.g. write all the messages before you ack the token, then for each start a transaction and process a message and delete it then commit it. The processing part cannot store anything outside the transaction. It becomes a bit harder to manage. I'm surprised the protocol is so fragile that duplicate messages make it break though.. |
thinking more, yeah: i think Olm should handle duplicate messages okay, so in a v3 world we could indeed rerequest them from the server. |
Annoyingly, I got bitten by this one a second time (this time a lot worse, loosing the majority of enc. messages while I was away on holiday (~1.5k megolm sessions)). Linking here for reference -- https://github.com/matrix-org/element-web-rageshakes/issues/15185 |
It's worth noting that this will have been significantly exacerbated by some recent changes that add an extra check to the processing of incoming encrypted to-device messages (https://github.com/matrix-org/matrix-js-sdk/blame/v21.0.0/src/crypto/algorithms/olm.ts#L225-L243) |
It'll fix the crypto-related failure modes though, which is what this issue seems to be about (and I'm told that the rust matrix-sdk-crypto makes sure to persist the to-device messages it cares about, so IMHO this issue will be fixed by #21972). If to-device messages are being used for other things, I'd suggest opening separate issues for them specifically. |
Looking at this in more detail: the difference with the rust crypto implementation is not that it persists to-device messages per se, but that it does all processing of inbound to-device messages before returning control to the application - which means that the application will not make another (The main reason it is able to do this, whereas it is more difficult in the js-sdk, is that the js-sdk makes an outbound network request to fetch the sender's device keys before decryption, whereas the rust sdk delays this process until the keys are used.) So, the upshot remains: Element-R will fix this particular usage of to-device messages. Other applications which also use to-device messages may still be subject to the problem and will need separate fixes. |
@BillCarsonFr points out that we persist the sync token very early in /sync processing, so it doesn't really matter if the rest of the process is properly atomic: if the client is restarted between persisting that token and processing the keys, the keys will be dropped. |
tl;dr: JS SDK with Rust crypto does not have this problem because we only write AFTER processing /sync responses, once every 5 minutes. Okay, this is confusing so bear with my explanation. At first glance, it looks bad because this happens as soon as there's a response: // set the sync token NOW *before* processing the events. We do this so
// if something barfs on an event we can skip it rather than constantly
// polling with the same token.
this.client.store.setSyncToken(data.next_batch); However, the Later in the loop, we call if (data.to_device && Array.isArray(data.to_device.events) && data.to_device.events.length > 0) {
let toDeviceMessages: IToDeviceEvent[] = data.to_device.events.filter(noUnsafeEventProps);
if (this.syncOpts.cryptoCallbacks) {
toDeviceMessages = await this.syncOpts.cryptoCallbacks.preprocessToDeviceMessages(toDeviceMessages);
} Specifically, we Only AFTER all this is await'd do we do this: // tell databases that everything is now in a consistent state and can be saved.
await this.client.store.save(); This will NOT write to indexeddb until WRITE_DELAY_MS has elapsed, which defaults to 5 minutes. Only after that time will the token from private async doSyncToDatabase(userTuples: UserTuple[]): Promise<void> {
try {
const syncData = this.syncAccumulator.getJSON(true);
await Promise.all([
this.persistUserPresenceEvents(userTuples),
this.persistAccountData(syncData.accountData),
this.persistSyncData(syncData.nextBatch, syncData.roomsData),
]);
} finally {
this.syncToDatabasePromise = undefined;
}
} This is what happens when the client is restarted at various times in this process:
There's a complement-crypto test for this called |
Follow up Q: surely
This introduces a source of concurrency which could mean that:
|
TL;DR: this remains an issue in legacy crypto |
Steps to reproduce
I sent Rick some messages in an E2EE room for the first time in over a week, so it created a new megolm session. He couldn't decrypt them. It looks like it correctly sent it to him on vector.modular.im via to-device (although tracing to-device messages client->server->server->client is still incredibly hard; filed here).
Rick then turned on Element Desktop in the morning, which received a huge batch of 1515 to-device messages queued up overnight: https://rageshakes.element.io/api/listing/2022-08-19/081925-YR54YZWK/logs-0001.log.gz. The vast majority were keyshare requests from himself (presumably his Element Android trying to fix UISIs?)
Amongst these, there are a bunch of to-device messages from me:
of which only one
m.room.encrypted
is actually decrypted into anm.room_key
before Rick (quite reasonably) clicked 'upgrade' on Nightly at2022-08-19T08:11:41.387Z
, thus knifing it in the middle of working through the to-device backlog.The m.room_key which got decrypted looked like this, and was from a different session to the one with the UISIs:
Meanwhile, on next launch, there were neither m.room.encrypted or m.room_keys from me - i.e. the to-device messages with the keys got lost forever:
Looking at the code,
onToDeviceEvent()
callsevent.attemptDecryption()
which then asynchronously callsdecryptionLoop()
to work through decrypting the message... but at no point does the todevice message actually get persisted anywhere.To fix this, we should transactionally persist to-device messages before we consider that /sync request processed, and then reliably process them, much like Hydrogen does: it's classic 'what goes wrong' reliable transactions. Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.
Therefore, we need to switch js-sdk to a more hydrogen or rust-sdk style approach of persisting stuff received in /sync locally in order to reliably receive it from the server, and then process based on that local store, so that if the client explodes we don't drop data on the floor.
Outcome
What did you expect?
No UISIs
What happened instead?
UISIs when the recipient restarts while their app is decrypting messages.
Operating system
No response
Application version
Nightly
How did you install the app?
Nightly
Homeserver
matrix.org
Will you send logs?
Yes
The text was updated successfully, but these errors were encountered: