E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

jaller94 · 2023-01-10T14:35:42Z

Describe the bug
An appservice that enables E2BE-capabilities, may crash after a few hours. It seems to be based on a timer.

matrix-org/matrix-hookshot#609

To Reproduce

Write an encrypted appservice, e.g. Hookshot
Run it for a while (1 to 4 hours)

Expected behavior
Don't crash.

Log snippet
This Hookshot crash happened exactly one hour after starting the bot, making me think that it is a timeout or other internal event, instead of an incoming network event.

WARN 13:07:44:863 [Config] The `widgets.openIdOverrides` config value SHOULD NOT be used in a production environment.
INFO 13:07:44:873 [Appservice] Initialising Redis storage (on localhost:6379)
INFO 13:07:44:901 [Bridge] Starting up
INFO 13:07:44:901 [UserTokenStore] Loading token key file passkey.pem
INFO 13:07:44:920 [RedisASProvider] Successfully connected
INFO 13:07:44:942 [Bridge] Connecting to homeserver and fetching joined rooms..
INFO 13:07:44:966 [Config] Prefilling room membership for permissions for 0 rooms
INFO 13:07:45:174 [FeedConnection] Connection 0e343504299411063c7324cc5166baa8 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://chrpaul.de/feed.xml","label":"Christian Paul"}
INFO 13:07:45:176 [FeedConnection] Connection e125d0ada12e1f980c8dfdbaa4020344 created for !MBKReTGbuldfQteUbW:localhost, {"url":"https://www.tagesschau.de/xml/rss2/","label":"Tagesschau"}
INFO 13:07:45:232 [Bridge] Bridge is now ready. Found 2 connections
INFO 13:07:45:235 [ListenerService] Listening on http://0.0.0.0:9000 for webhooks
INFO 13:07:45:235 [ListenerService] Listening on http://127.0.0.1:9001 for metrics, provisioning
INFO 13:07:45:236 [ListenerService] Listening on http://0.0.0.0:9002 for widgets
INFO 13:09:15:146 [Appservice] Processing transaction 7935
INFO 13:09:15:147 [Appservice] Processing event of type m.room.message
INFO 13:09:15:160 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:09:15 +0000] "PUT /transactions/7935?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"
…
INFO 13:35:15:521 [Appservice] Processing transaction 8038
INFO 13:35:15:521 [Appservice] Processing ephemeral event of type m.presence
INFO 13:35:15:522 [Appservice] 127.0.0.1 - - [10/Jan/2023:12:35:15 +0000] "PUT /transactions/8038?access_token=%3Credacted%3E HTTP/1.1" 200 2 "-" "Synapse/1.74.0"

ERROR 14:07:45:815 [Appservice] Failed to set up crypto on intent @hookshot:localhost [Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }] {
  code: 'GenericFailure'
}
Error: IO error: could not acquire lock on "/home/jaller94/Git/matrix-hookshot/data/encryption/2d03b2ce09b8e551150e45d1418338aa0f43c531f79e8a34a1758bcbbbc850d0/matrix-sdk-crypto/db": Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
error Command failed with exit code 1.

Additional context
I've seen the same issue in a project different to Hookshot. I don't have access to their code, but as Element employees we can request the code of this customer appservice.

The text was updated successfully, but these errors were encountered:

turt2live · 2023-01-10T16:13:13Z

my hunch is this would be the Intent object expiring, but not cleaning up crypto properly, so when it gets recreated it has a conflicting lock deep within the rust-sdk.

Half-Shot · 2023-01-18T10:53:08Z

Having looked at this, I'm unsure. The only expiry I can see for intents is the LRU cache, which defaults to 10000 entries. I'd certainly never expect a inactive hookshot to get close to that number. LRU doesn't expire objects until the maximum object threshold is reached, so my gut feeling is it's not expiry that's causing it.

That said: @jaller94 mentioned that the timeout was exactly an hour, and the intents have a TTL of a hour.

The sled database is the thing that's crying out here (spacejam/sled#1234). Sounds like Rust would impliclty cause the DB to be dropped when it falls out of scope, but I don't really know how that works in a JS world. I presume the act of the Intent dropping would cause the OlmMachine to drop, which would cause the Sled DB to drop but maybe there is something missing here.

Half-Shot · 2023-01-18T15:28:50Z

Right, so. Lots of digging led me to realise that the LRU store will evict things when their ttl is up, regardless of how many items are in the store. This leads to Intents being deleted and immediately recreated, which leads to the Sled DB conflicting with itself as it hasn't been GC'd.

Ultimately, Sled is a bit poor for this use case and we should be using something like Redis or Postgres (both of which appear to be in progress for the rust-sdk). In the meantime, my proposals would be to either disable the LRU, or check the lock status on the Sled DB prior to loading crypto.

Half-Shot · 2023-01-18T16:53:13Z

Checking the lock status is horrible in Node. I'm going to push for matrix-org/matrix-rust-sdk#1256 support to ultimately fix our DB woes, but in the meantime look at disabling the LRU for encrypted bridges.

turt2live · 2023-01-18T17:12:23Z

Disabling the LRU has memory exhaustion concerns, fwiw. It currently patches over a memory leak that'd be ideal to fix, but so does just throwing away objects.

The rust-sdk's destructor should be able to handle the lock status internally?

turt2live · 2023-01-18T17:35:32Z

from oob: hookshot's usecase is covered by disabling the LRU, with consequences being known due to urgency.

It's not clear why the rust-sdk isn't destructing itself properly, but the hope is we can just abandon Sled instead of thinking about it.

AndrewFerr · 2023-04-14T14:32:29Z

This is fixed by using the SQLite-based crypto store that was recently added to the rust-sdk.

rust-sdk change: feat(crypto-nodejs): OlmMachine.initialize takes a new optional store type matrix-org/matrix-rust-sdk#1521
bot-sdk PR (only in the vector-im fork for now): Allow usage of the SQLite-based crypto store element-hq/matrix-bot-sdk#10

Half-Shot · 2023-05-18T14:27:12Z

I think we can close this now @turt2live

turt2live · 2023-05-18T16:04:13Z

I'm still seeing some crashes come up, though agreed that the majority of cases appear to be fixed.

AndrewFerr · 2023-06-01T14:56:48Z

@turt2live are the crashes you saw from t2bot bridges? Do you have any stack traces / error logs available?

turt2live · 2023-06-01T14:58:37Z

They were t2bot.io bridges and local development. I don't have stacktraces on hand, sorry (there's other problems with the bridges which are taking my attention).

jaller94 added the bug Something isn't working label Jan 10, 2023

jaller94 changed the title ~~E2EE-eanbled AppServices crash a while (IO error: could not acquire lock)~~ E2EE-enabled AppServices crash a while (IO error: could not acquire lock) Jan 10, 2023

jaller94 changed the title ~~E2EE-enabled AppServices crash a while (IO error: could not acquire lock)~~ E2BE-enabled AppServices crash a while (IO error: could not acquire lock) Jan 10, 2023

turt2live added the type:appservices Appservice APIs label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

jaller94 commented Jan 10, 2023 •

edited

Loading

turt2live commented Jan 10, 2023

Half-Shot commented Jan 18, 2023

Half-Shot commented Jan 18, 2023

Half-Shot commented Jan 18, 2023

turt2live commented Jan 18, 2023

turt2live commented Jan 18, 2023

AndrewFerr commented Apr 14, 2023

Half-Shot commented May 18, 2023

turt2live commented May 18, 2023

AndrewFerr commented Jun 1, 2023

turt2live commented Jun 1, 2023

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293

Comments

jaller94 commented Jan 10, 2023 • edited Loading

turt2live commented Jan 10, 2023

Half-Shot commented Jan 18, 2023

Half-Shot commented Jan 18, 2023

Half-Shot commented Jan 18, 2023

turt2live commented Jan 18, 2023

turt2live commented Jan 18, 2023

AndrewFerr commented Apr 14, 2023

Half-Shot commented May 18, 2023

turt2live commented May 18, 2023

AndrewFerr commented Jun 1, 2023

turt2live commented Jun 1, 2023

jaller94 commented Jan 10, 2023 •

edited

Loading