Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward-Backward Message Processor #3796

Closed
Tracked by #3321
daniel-savu opened this issue May 16, 2024 · 0 comments · Fixed by #3775
Closed
Tracked by #3321

Forward-Backward Message Processor #3796

daniel-savu opened this issue May 16, 2024 · 0 comments · Fixed by #3775
Assignees
Labels

Comments

@daniel-savu
Copy link
Contributor

To lower the lead time until the relayer starts submitting messages, start iterating the message db from the highest known nonce instead of from zero, and then use a forward-backward strategy to eventually process the old messages as well. Context below

Instrumented tokio and was able to confirm that rocks db IO is blocking, and there isn't really anything we can do about avoiding that. The message processor tasks have almost zero idle time even after 5 mins, and merkle processors aren't doing great either:

Screenshot 2024-05-13 at 16 51 50 copy 2 Screenshot 2024-05-13 at 16 51 50 copy

Rocks db is write optimized and sync, which is essentially the opposite of what we need. Our writes happen after indexing and after confirming a submission, which are network-bound tasks themselves - the gain from having fast writes is almost zero.

On the other hand, we currently do one read for every message ever sent that passes the relayer whitelist (millions at this point). Even after parallelizing the relayer runtime, it takes 8.5 mins to start submitting to high volume chains like Optimism.

We have two DB IO bound processors per chains (message and merkle_tree), and 20 chains on the hyperlane context. This means we'd need 40 cores and growing to parallelize each chain, or shard by deploying on different machines. This is more trouble than it's worth for now.

We're opting for a simpler approach now:

  • instead of iterating the DB from nonce zero, store the last seen message and change the processor iteration logic to go forward-backward (always prioritizing more recent messages). Old messages will still take very long to reach, but those are unlikely to have become processable anyway - whereas recent messages are very likely to be processable and the main reason for prepare queue spikes.

  • start using a new db prefix that essentially stores a view of the main DB, but only with unprocessed messages. Old messages will be significantly quicker to process. Requirements

    • migration to populate the "view db"

    • new write / delete interface that updates both the regular and the view dbs

    • changed iteration logic in the processor to iterate the view db without a nonce

    • delivery would still be bound on IGP indexing, which is currently forward indexed and needs refactoring

Originally posted by @daniel-savu in #3454

@daniel-savu daniel-savu self-assigned this May 16, 2024
This was referenced May 16, 2024
@daniel-savu daniel-savu moved this to In Review in Hyperlane Tasks May 16, 2024
daniel-savu added a commit that referenced this issue May 24, 2024
### Description

- adds logic for iterating message nonces in the processor task in
forward-backward fashion
- removes all manual calls that were updating the next nonce to query.
This is done automatically in the iterator struct now
- stores the highest known message nonce in the db, which is used to
initialize the iterator

### Drive-by changes

- Converts the concrete HyperlaneDb type stored in the processor to a
trait obj, to enable mocking db responses

### Related issues

- Fixes #3796
- Fixes #3816

### Backward compatibility

Yes - if there is no db entry for the new prefix, the processor starts
from nonce zero, so no migration is required

### Testing

Unit testing + e2e
@github-project-automation github-project-automation bot moved this from In Review to Done in Hyperlane Tasks May 24, 2024
yorhodes pushed a commit that referenced this issue May 26, 2024
### Description

- adds logic for iterating message nonces in the processor task in
forward-backward fashion
- removes all manual calls that were updating the next nonce to query.
This is done automatically in the iterator struct now
- stores the highest known message nonce in the db, which is used to
initialize the iterator

### Drive-by changes

- Converts the concrete HyperlaneDb type stored in the processor to a
trait obj, to enable mocking db responses

### Related issues

- Fixes #3796
- Fixes #3816

### Backward compatibility

Yes - if there is no db entry for the new prefix, the processor starts
from nonce zero, so no migration is required

### Testing

Unit testing + e2e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant