Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relayer startup is slow #3454

Open
4 of 8 tasks
Tracked by #3321
tkporter opened this issue Mar 20, 2024 · 10 comments
Open
4 of 8 tasks
Tracked by #3321

Relayer startup is slow #3454

tkporter opened this issue Mar 20, 2024 · 10 comments

Comments

@tkporter
Copy link
Collaborator

tkporter commented Mar 20, 2024

Problem

Solution

  • Fast delivery as outlined here Epic: Refactor Agent Indexing: better indexing positioning, relayer fast-delivery #3414 is closely related, particularly moving away from the whole nonce-by-nonce reading in the db, will help us cut down the time to work on operations in the serial submitter. However we'd still be slow for building merkle trees
  • Maybe moving away from current thread flavor on tokio would help us build merkle trees more concurrently that we do now?

Tasks

Preview Give feedback
  1. relayer
    daniel-savu
@tkporter
Copy link
Collaborator Author

I think part of the problem is also that the interface to rocksdb isn't async. So we block when performing rocksdb IO, and we sometimes do this in loops

From https://ryhl.io/blog/async-what-is-blocking/:

To give a sense of scale of how much time is too much, a good rule of thumb is no more than 10 to 100 microseconds between each .await. That said, this depends on the kind of application you are writing.
I wonder if it makes sense to move our DB operations to a spawn_blocking closure or something?

There seem to be places where we probably block for wayyy longer than 100 microseconds, like when we call this for the first time upon startup, and it'll loop through tens of thousands of message nonces without ever hitting an .await:

fn try_get_unprocessed_message(&mut self) -> Result<Option<HyperlaneMessage>> {
loop {
// First, see if we can find the message so we can update the gauge.
if let Some(message) = self.db.retrieve_message_by_nonce(self.message_nonce)? {
// Update the latest nonce gauges
self.metrics
.max_last_known_message_nonce_gauge
.set(message.nonce as i64);
if let Some(metrics) = self.metrics.get(message.destination) {
metrics.set(message.nonce as i64);
}
// If this message has already been processed, on to the next one.
if !self
.db
.retrieve_processed_by_nonce(&self.message_nonce)?
.unwrap_or(false)
{
return Ok(Some(message));
} else {
debug!(nonce=?self.message_nonce, "Message already marked as processed in DB");
self.message_nonce += 1;
}
} else {
trace!(nonce=?self.message_nonce, "No message found in DB for nonce");
return Ok(None);
}
}
}
}

@nambrot
Copy link
Contributor

nambrot commented Apr 19, 2024

as part of this, it might be nice to allow relayer operators to opt out of merkle tree processing

@tkporter
Copy link
Collaborator Author

ah that's a good idea. #3414 is similar - we will no longer block on it, but still will do the work to eventually build the merkle tree. When we get closer to doing this we can consider the stakeholders & whether that's attractiv

@yorhodes
Copy link
Member

as part of this, it might be nice to allow relayer operators to opt out of merkle tree processing

assume this means backfill processing? we still need forward fill merkle tree processing for the multisig ISMs

@tkporter
Copy link
Collaborator Author

tkporter commented May 7, 2024

chatted w/ @daniel-savu - we'll likely do this after the throughput. Plan is to:

  1. just slap the multithreaded runtime on and see what gains (if any) we get over single thread, and if we run into any weird concurrency issues like deadlocks
  2. consider what options we have when it comes to blocking rocksdb interactions - it'd be nice if we can make db interactions async in some kind of way. Seems like the most common path is to just wrap db interactions with block_in_place? Some maybe useful resources:
    a. How to use rocksdb in/with async/await code? rust-rocksdb/rust-rocksdb#687
    b. Feature Request: Add an async interface rust-rocksdb/rust-rocksdb#822
    c. Remove blocking IO inside async code fedimint/fedimint#1528
    d. https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html
    e. Use block_in_place for synchronous Rocksdb operations fedimint/fedimint#1568
    f. A bonus, which I thought was interesting https://www.reddit.com/r/rust/comments/10pf7m8/fish_shell_porting_to_rust_from_c/j6kxeui/?context=3
  3. if blocking rocksdb interactions seems futile, it'd still be nice to make sure that in places like I describe here Relayer startup is slow #3454 (comment) that we yield frequently

@daniel-savu daniel-savu self-assigned this May 7, 2024
@daniel-savu daniel-savu moved this from Next Sprint to Sprint in Hyperlane Tasks May 7, 2024
@daniel-savu daniel-savu moved this from Sprint to In Progress in Hyperlane Tasks May 7, 2024
@daniel-savu
Copy link
Contributor

Instrumented tokio and was able to confirm that rocks db IO is blocking, and there isn't really anything we can do about avoiding that. The message processor tasks have almost zero idle time even after 5 mins, and merkle processors aren't doing great either:
Screenshot 2024-05-13 at 16 51 50 copy 2
Screenshot 2024-05-13 at 16 51 50 copy

Rocks db is write optimized and sync, which is essentially the opposite of what we need. Our writes happen after indexing and after confirming a submission, which are network-bound tasks themselves - the gain from having fast writes is almost zero.

On the other hand, we currently do one read for every message ever sent that passes the relayer whitelist (millions at this point). Even after parallelizing the relayer runtime, it takes 8.5 mins to start submitting to high volume chains like Optimism.

We have two DB IO bound processors per chains (message and merkle_tree), and 20 chains on the hyperlane context. This means we'd need 40 cores and growing to parallelize each chain, or shard by deploying on different machines. This is more trouble than it's worth for now.

We're opting for a simpler approach now:

  • instead of iterating the DB from nonce zero, store the last seen message and change the processor iteration logic to go forward-backward (always prioritizing more recent messages). Old messages will still take very long to reach, but those are unlikely to have become processable anyway - whereas recent messages are very likely to be processable and the main reason for prepare queue spikes.
  • start using a new db prefix that essentially stores a view of the main DB, but only with unprocessed messages. Old messages will be significantly quicker to process. Requirements
    • migration to populate the "view db"
    • new write / delete interface that updates both the regular and the view dbs
    • changed iteration logic in the processor to iterate the view db without a nonce
    • delivery would still be bound on IGP indexing, which is currently forward indexed and needs refactoring

@daniel-savu daniel-savu moved this from In Progress to In Review in Hyperlane Tasks May 20, 2024
daniel-savu added a commit that referenced this issue May 22, 2024
### Description

- Started off adding tokio-metrics but then realised those are quite
general, so while we do have instrumentation it's not exposed in our
metrics endpoint
- switched to adding
[tokio-console](https://github.com/tokio-rs/console/tree/main), which
does give insight into the lifetime of specific tasks, so we can check
which ones take up a long time during relayer startup. These are only
visible at the `dependencyTrace` log level, so don't affect performance
in the `hyperlane` context.

### Drive-by changes

<!--
Are there any minor or drive-by changes also included?
-->

### Related issues

- Helps debug
#3454 and any
future performance issues
- Does half the work for
#3239 (still
need to expose these in the metrics endpoint and import the grafana
template)

### Backward compatibility

<!--
Are these changes backward compatible? Are there any infrastructure
implications, e.g. changes that would prohibit deploying older commits
using this infra tooling?

Yes/No
-->

### Testing

<!--
What kind of testing have these changes undergone?

None/Manual/Unit Tests
-->
tkporter pushed a commit that referenced this issue May 24, 2024
### Description

- Started off adding tokio-metrics but then realised those are quite
general, so while we do have instrumentation it's not exposed in our
metrics endpoint
- switched to adding
[tokio-console](https://github.com/tokio-rs/console/tree/main), which
does give insight into the lifetime of specific tasks, so we can check
which ones take up a long time during relayer startup. These are only
visible at the `dependencyTrace` log level, so don't affect performance
in the `hyperlane` context.

### Drive-by changes

<!--
Are there any minor or drive-by changes also included?
-->

### Related issues

- Helps debug
#3454 and any
future performance issues
- Does half the work for
#3239 (still
need to expose these in the metrics endpoint and import the grafana
template)

### Backward compatibility

<!--
Are these changes backward compatible? Are there any infrastructure
implications, e.g. changes that would prohibit deploying older commits
using this infra tooling?

Yes/No
-->

### Testing

<!--
What kind of testing have these changes undergone?

None/Manual/Unit Tests
-->
@avious00
Copy link
Contributor

@tkporter @daniel-savu when you merge this can you ping @ltyu? syncing on sepolia was taking a long time for him, i think this addresses that

@daniel-savu
Copy link
Contributor

@ltyu this has mostly been fixed, you can use the latest commit on main (docker image 0cf692e-20240526-164442)

@daniel-savu
Copy link
Contributor

@tkporter reported that startup seems to be slow again. Only running with a subset of chains seems to fix this, so it's probably due to the high number of chains the omniscient relayer is currently operating. tokio-console indicates that
the prepare_tasks are the issue , since they take up the most busy time of the runtime, particularly at startup. I wasn't able to narrow this down further, although I suspect that we must be doing some CPU-intensive looping in there.

3 mins into a new relayer run, line 132 (the prepare task - here) takes most of the busy time:
Screenshot 2024-06-25 at 11 55 18

A view into one of the prepare task's lifecycle, showing how it takes up a lot of busy time on startup. With prepare tasks already being >20, it makes sense that some can't be scheduled because the machine doesn't have that many cores.
Screenshot 2024-06-25 at 12 08 30

@daniel-savu
Copy link
Contributor

Last time we looked into this (June 2024), the relayer was running ~25 chains and we got startup down to ~2 mins. Now we're running ~60 chains in the hyperlane relayer and startup time is ~10 mins. Should investigate again

@cmcewen cmcewen moved this from In Review to Backlog in Hyperlane Tasks Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants