-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2BE-enabled AppServices crash a while (IO error: could not acquire lock) #293
Comments
my hunch is this would be the Intent object expiring, but not cleaning up crypto properly, so when it gets recreated it has a conflicting lock deep within the rust-sdk. |
Having looked at this, I'm unsure. The only expiry I can see for intents is the LRU cache, which defaults to 10000 entries. I'd certainly never expect a inactive hookshot to get close to that number. That said: @jaller94 mentioned that the timeout was exactly an hour, and the intents have a TTL of a hour. The sled database is the thing that's crying out here (spacejam/sled#1234). Sounds like Rust would impliclty cause the DB to be dropped when it falls out of scope, but I don't really know how that works in a JS world. I presume the act of the Intent dropping would cause the OlmMachine to drop, which would cause the Sled DB to drop but maybe there is something missing here. |
Right, so. Lots of digging led me to realise that the LRU store will evict things when their Ultimately, Sled is a bit poor for this use case and we should be using something like Redis or Postgres (both of which appear to be in progress for the rust-sdk). In the meantime, my proposals would be to either disable the LRU, or check the lock status on the Sled DB prior to loading crypto. |
Checking the lock status is horrible in Node. I'm going to push for matrix-org/matrix-rust-sdk#1256 support to ultimately fix our DB woes, but in the meantime look at disabling the LRU for encrypted bridges. |
Disabling the LRU has memory exhaustion concerns, fwiw. It currently patches over a memory leak that'd be ideal to fix, but so does just throwing away objects. The rust-sdk's destructor should be able to handle the lock status internally? |
from oob: hookshot's usecase is covered by disabling the LRU, with consequences being known due to urgency. It's not clear why the rust-sdk isn't destructing itself properly, but the hope is we can just abandon Sled instead of thinking about it. |
This is fixed by using the SQLite-based crypto store that was recently added to the rust-sdk.
|
I think we can close this now @turt2live |
I'm still seeing some crashes come up, though agreed that the majority of cases appear to be fixed. |
@turt2live are the crashes you saw from t2bot bridges? Do you have any stack traces / error logs available? |
They were t2bot.io bridges and local development. I don't have stacktraces on hand, sorry (there's other problems with the bridges which are taking my attention). |
Describe the bug
An appservice that enables E2BE-capabilities, may crash after a few hours. It seems to be based on a timer.
matrix-org/matrix-hookshot#609
To Reproduce
Expected behavior
Don't crash.
Log snippet
This Hookshot crash happened exactly one hour after starting the bot, making me think that it is a timeout or other internal event, instead of an incoming network event.
Additional context
I've seen the same issue in a project different to Hookshot. I don't have access to their code, but as Element employees we can request the code of this customer appservice.
The text was updated successfully, but these errors were encountered: