Relayer startup is slow #3454

tkporter · 2024-03-20T11:19:00Z

Problem

Now that we have tens of thousands of messages from domains, it takes a couple minutes after starting up a relayer to get to a point where the serial submitter has been made aware of messages close to the tip.
This is probably a mix of:
- The current logic of "read from the DB starting at index 0, then 1, etc" in various tasks results in a ton of DB reads, so we're maybe IO bound here
- I think it probably doesn't help that we run tokio in the current thread mode (https://docs.rs/tokio/latest/tokio/runtime/enum.RuntimeFlavor.html#variant.CurrentThread, https://github.com/hyperlane-xyz/hyperlane-monorepo/blob/main/rust/agents/relayer/src/main.rs#L23) - would it help if we moved to multithread? As long as rocksdb allows for concurrent DB reads, I think this should help us out, as we tend to have a task per domain that reads all the messages / merkle tree insertions
Example:
- Upon starting up, we see the Whitelist configuration log here at 16:56:29 https://cloudlogging.app.goo.gl/XTcjMyFk8jN4DCe38
- We don't get to a point where we start working on operations in the serial submitter until 2 mins later at 16:58:33 https://cloudlogging.app.goo.gl/zdXTweyCqAKU8xjJ7. This is
- It seems we don't finish ingesting all messages into merkle trees until 17:00:41, a full 4 minutes after starting up! https://cloudlogging.app.goo.gl/xojyyApKJrwbrMyR7

Solution

Fast delivery as outlined here Epic: Refactor Agent Indexing: better indexing positioning, relayer fast-delivery #3414 is closely related, particularly moving away from the whole nonce-by-nonce reading in the db, will help us cut down the time to work on operations in the serial submitter. However we'd still be slow for building merkle trees
Maybe moving away from current thread flavor on tokio would help us build merkle trees more concurrently that we do now?

Tasks

Give feedback

Analyze the steps that happen during relayer startup (based on logs)
Parallellize relayer runtime
Instrument with tokio-metrics
Make db reads non-blocking - impossible without using a different DB
iterate forward-backware in processor task
crate "view" db that only stores unprocessed messages
Relayer startup is fast even with empty db #3885

relayer
E2E test for checking relayer fast startup #3951
Options

The text was updated successfully, but these errors were encountered:

tkporter · 2024-03-20T16:58:21Z

I think part of the problem is also that the interface to rocksdb isn't async. So we block when performing rocksdb IO, and we sometimes do this in loops

From https://ryhl.io/blog/async-what-is-blocking/:

To give a sense of scale of how much time is too much, a good rule of thumb is no more than 10 to 100 microseconds between each .await. That said, this depends on the kind of application you are writing.
I wonder if it makes sense to move our DB operations to a spawn_blocking closure or something?

There seem to be places where we probably block for wayyy longer than 100 microseconds, like when we call this for the first time upon startup, and it'll loop through tens of thousands of message nonces without ever hitting an .await:

hyperlane-monorepo/rust/agents/relayer/src/msg/processor.rs

Lines 119 to 148 in dcb67e9

    
               fn try_get_unprocessed_message(&mut self) -> Result<Option<HyperlaneMessage>> { 
        
                   loop { 
        
                       // First, see if we can find the message so we can update the gauge. 
        
                       if let Some(message) = self.db.retrieve_message_by_nonce(self.message_nonce)? { 
        
                           // Update the latest nonce gauges 
        
                           self.metrics 
        
                               .max_last_known_message_nonce_gauge 
        
                               .set(message.nonce as i64); 
        
                           if let Some(metrics) = self.metrics.get(message.destination) { 
        
                               metrics.set(message.nonce as i64); 
        
                           } 
        
                           // If this message has already been processed, on to the next one. 
        
                           if !self 
        
                               .db 
        
                               .retrieve_processed_by_nonce(&self.message_nonce)? 
        
                               .unwrap_or(false) 
        
                           { 
        
                               return Ok(Some(message)); 
        
                           } else { 
        
                               debug!(nonce=?self.message_nonce, "Message already marked as processed in DB"); 
        
                               self.message_nonce += 1; 
        
                           } 
        
                       } else { 
        
                           trace!(nonce=?self.message_nonce, "No message found in DB for nonce"); 
        
                           return Ok(None); 
        
                       } 
        
                   } 
        
               } 
        
           }

nambrot · 2024-04-19T20:58:10Z

as part of this, it might be nice to allow relayer operators to opt out of merkle tree processing

tkporter · 2024-04-23T15:56:56Z

ah that's a good idea. #3414 is similar - we will no longer block on it, but still will do the work to eventually build the merkle tree. When we get closer to doing this we can consider the stakeholders & whether that's attractiv

yorhodes · 2024-04-26T14:58:34Z

as part of this, it might be nice to allow relayer operators to opt out of merkle tree processing

assume this means backfill processing? we still need forward fill merkle tree processing for the multisig ISMs

tkporter · 2024-05-07T10:27:41Z

chatted w/ @daniel-savu - we'll likely do this after the throughput. Plan is to:

just slap the multithreaded runtime on and see what gains (if any) we get over single thread, and if we run into any weird concurrency issues like deadlocks
consider what options we have when it comes to blocking rocksdb interactions - it'd be nice if we can make db interactions async in some kind of way. Seems like the most common path is to just wrap db interactions with block_in_place? Some maybe useful resources:
a. How to use rocksdb in/with async/await code? rust-rocksdb/rust-rocksdb#687
b. Feature Request: Add an async interface rust-rocksdb/rust-rocksdb#822
c. Remove blocking IO inside async code fedimint/fedimint#1528
d. https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html
e. Use block_in_place for synchronous Rocksdb operations fedimint/fedimint#1568
f. A bonus, which I thought was interesting https://www.reddit.com/r/rust/comments/10pf7m8/fish_shell_porting_to_rust_from_c/j6kxeui/?context=3
if blocking rocksdb interactions seems futile, it'd still be nice to make sure that in places like I describe here Relayer startup is slow #3454 (comment) that we yield frequently

daniel-savu · 2024-05-14T14:57:45Z

Instrumented tokio and was able to confirm that rocks db IO is blocking, and there isn't really anything we can do about avoiding that. The message processor tasks have almost zero idle time even after 5 mins, and merkle processors aren't doing great either:

Rocks db is write optimized and sync, which is essentially the opposite of what we need. Our writes happen after indexing and after confirming a submission, which are network-bound tasks themselves - the gain from having fast writes is almost zero.

On the other hand, we currently do one read for every message ever sent that passes the relayer whitelist (millions at this point). Even after parallelizing the relayer runtime, it takes 8.5 mins to start submitting to high volume chains like Optimism.

We have two DB IO bound processors per chains (message and merkle_tree), and 20 chains on the hyperlane context. This means we'd need 40 cores and growing to parallelize each chain, or shard by deploying on different machines. This is more trouble than it's worth for now.

We're opting for a simpler approach now:

instead of iterating the DB from nonce zero, store the last seen message and change the processor iteration logic to go forward-backward (always prioritizing more recent messages). Old messages will still take very long to reach, but those are unlikely to have become processable anyway - whereas recent messages are very likely to be processable and the main reason for prepare queue spikes.
start using a new db prefix that essentially stores a view of the main DB, but only with unprocessed messages. Old messages will be significantly quicker to process. Requirements
- migration to populate the "view db"
- new write / delete interface that updates both the regular and the view dbs
- changed iteration logic in the processor to iterate the view db without a nonce
- delivery would still be bound on IGP indexing, which is currently forward indexed and needs refactoring

### Description - Started off adding tokio-metrics but then realised those are quite general, so while we do have instrumentation it's not exposed in our metrics endpoint - switched to adding [tokio-console](https://github.com/tokio-rs/console/tree/main), which does give insight into the lifetime of specific tasks, so we can check which ones take up a long time during relayer startup. These are only visible at the `dependencyTrace` log level, so don't affect performance in the `hyperlane` context. ### Drive-by changes  ### Related issues - Helps debug #3454 and any future performance issues - Does half the work for #3239 (still need to expose these in the metrics endpoint and import the grafana template) ### Backward compatibility  ### Testing

avious00 · 2024-05-25T02:03:48Z

@tkporter @daniel-savu when you merge this can you ping @ltyu? syncing on sepolia was taking a long time for him, i think this addresses that

daniel-savu · 2024-05-27T10:41:18Z

@ltyu this has mostly been fixed, you can use the latest commit on main (docker image 0cf692e-20240526-164442)

daniel-savu · 2024-06-26T10:11:37Z

@tkporter reported that startup seems to be slow again. Only running with a subset of chains seems to fix this, so it's probably due to the high number of chains the omniscient relayer is currently operating. tokio-console indicates that
the prepare_tasks are the issue , since they take up the most busy time of the runtime, particularly at startup. I wasn't able to narrow this down further, although I suspect that we must be doing some CPU-intensive looping in there.

3 mins into a new relayer run, line 132 (the prepare task - here) takes most of the busy time:

A view into one of the prepare task's lifecycle, showing how it takes up a lot of busy time on startup. With prepare tasks already being >20, it makes sense that some can't be scheduled because the machine doesn't have that many cores.

daniel-savu · 2024-10-02T10:48:14Z

Last time we looked into this (June 2024), the relayer was running ~25 chains and we got startup down to ~2 mins. Now we're running ~60 chains in the hyperlane relayer and startup time is ~10 mins. Should investigate again

tkporter added the agent label Mar 20, 2024

github-project-automation bot added this to Hyperlane Tasks Mar 20, 2024

tkporter mentioned this issue Mar 20, 2024

Epic: Agent Reliability #3321

Open

tkporter moved this to Next Sprint in Hyperlane Tasks Mar 20, 2024

This was referenced Mar 20, 2024

Instrument Tokio Tasks with tokio-metrics #3239

Open

Agent Flamegraphs #3455

Open

daniel-savu self-assigned this May 7, 2024

daniel-savu added the relayer label May 7, 2024

daniel-savu moved this from Next Sprint to Sprint in Hyperlane Tasks May 7, 2024

daniel-savu moved this from Sprint to In Progress in Hyperlane Tasks May 7, 2024

daniel-savu mentioned this issue May 10, 2024

feat: Relayer tokio task instrumentation #3760

Merged

daniel-savu mentioned this issue May 16, 2024

Forward-Backward Message Processor #3796

Closed

daniel-savu moved this from In Progress to In Review in Hyperlane Tasks May 20, 2024

cmcewen moved this from In Review to Backlog in Hyperlane Tasks Dec 27, 2024

cmcewen unassigned daniel-savu Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relayer startup is slow #3454

Relayer startup is slow #3454

tkporter commented Mar 20, 2024 •

edited by daniel-savu

Loading

Tasks

tkporter commented Mar 20, 2024

nambrot commented Apr 19, 2024

tkporter commented Apr 23, 2024

yorhodes commented Apr 26, 2024

tkporter commented May 7, 2024

daniel-savu commented May 14, 2024

avious00 commented May 25, 2024

daniel-savu commented May 27, 2024

daniel-savu commented Jun 26, 2024

daniel-savu commented Oct 2, 2024

Relayer startup is slow #3454

Relayer startup is slow #3454

Comments

tkporter commented Mar 20, 2024 • edited by daniel-savu Loading

Problem

Solution

Tasks

tkporter commented Mar 20, 2024

nambrot commented Apr 19, 2024

tkporter commented Apr 23, 2024

yorhodes commented Apr 26, 2024

tkporter commented May 7, 2024

daniel-savu commented May 14, 2024

avious00 commented May 25, 2024

daniel-savu commented May 27, 2024

daniel-savu commented Jun 26, 2024

daniel-savu commented Oct 2, 2024

tkporter commented Mar 20, 2024 •

edited by daniel-savu

Loading