-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relayer startup is slow #3454
Comments
I think part of the problem is also that the interface to rocksdb isn't async. So we block when performing rocksdb IO, and we sometimes do this in loops From https://ryhl.io/blog/async-what-is-blocking/:
There seem to be places where we probably block for wayyy longer than 100 microseconds, like when we call this for the first time upon startup, and it'll loop through tens of thousands of message nonces without ever hitting an .await: hyperlane-monorepo/rust/agents/relayer/src/msg/processor.rs Lines 119 to 148 in dcb67e9
|
as part of this, it might be nice to allow relayer operators to opt out of merkle tree processing |
ah that's a good idea. #3414 is similar - we will no longer block on it, but still will do the work to eventually build the merkle tree. When we get closer to doing this we can consider the stakeholders & whether that's attractiv |
assume this means backfill processing? we still need forward fill merkle tree processing for the multisig ISMs |
chatted w/ @daniel-savu - we'll likely do this after the throughput. Plan is to:
|
Instrumented tokio and was able to confirm that rocks db IO is blocking, and there isn't really anything we can do about avoiding that. The message processor tasks have almost zero idle time even after 5 mins, and merkle processors aren't doing great either: Rocks db is write optimized and sync, which is essentially the opposite of what we need. Our writes happen after indexing and after confirming a submission, which are network-bound tasks themselves - the gain from having fast writes is almost zero. On the other hand, we currently do one read for every message ever sent that passes the relayer whitelist (millions at this point). Even after parallelizing the relayer runtime, it takes 8.5 mins to start submitting to high volume chains like Optimism. We have two DB IO bound processors per chains ( We're opting for a simpler approach now:
|
### Description - Started off adding tokio-metrics but then realised those are quite general, so while we do have instrumentation it's not exposed in our metrics endpoint - switched to adding [tokio-console](https://github.com/tokio-rs/console/tree/main), which does give insight into the lifetime of specific tasks, so we can check which ones take up a long time during relayer startup. These are only visible at the `dependencyTrace` log level, so don't affect performance in the `hyperlane` context. ### Drive-by changes <!-- Are there any minor or drive-by changes also included? --> ### Related issues - Helps debug #3454 and any future performance issues - Does half the work for #3239 (still need to expose these in the metrics endpoint and import the grafana template) ### Backward compatibility <!-- Are these changes backward compatible? Are there any infrastructure implications, e.g. changes that would prohibit deploying older commits using this infra tooling? Yes/No --> ### Testing <!-- What kind of testing have these changes undergone? None/Manual/Unit Tests -->
### Description - Started off adding tokio-metrics but then realised those are quite general, so while we do have instrumentation it's not exposed in our metrics endpoint - switched to adding [tokio-console](https://github.com/tokio-rs/console/tree/main), which does give insight into the lifetime of specific tasks, so we can check which ones take up a long time during relayer startup. These are only visible at the `dependencyTrace` log level, so don't affect performance in the `hyperlane` context. ### Drive-by changes <!-- Are there any minor or drive-by changes also included? --> ### Related issues - Helps debug #3454 and any future performance issues - Does half the work for #3239 (still need to expose these in the metrics endpoint and import the grafana template) ### Backward compatibility <!-- Are these changes backward compatible? Are there any infrastructure implications, e.g. changes that would prohibit deploying older commits using this infra tooling? Yes/No --> ### Testing <!-- What kind of testing have these changes undergone? None/Manual/Unit Tests -->
@tkporter @daniel-savu when you merge this can you ping @ltyu? syncing on sepolia was taking a long time for him, i think this addresses that |
@ltyu this has mostly been fixed, you can use the latest commit on |
@tkporter reported that startup seems to be slow again. Only running with a subset of chains seems to fix this, so it's probably due to the high number of chains the omniscient relayer is currently operating. 3 mins into a new relayer run, line 132 (the prepare task - here) takes most of the busy time: A view into one of the prepare task's lifecycle, showing how it takes up a lot of busy time on startup. With prepare tasks already being >20, it makes sense that some can't be scheduled because the machine doesn't have that many cores. |
Problem
Whitelist configuration
log here at 16:56:29 https://cloudlogging.app.goo.gl/XTcjMyFk8jN4DCe38Solution
Tasks
The text was updated successfully, but these errors were encountered: