Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.1: rpc: improve latency by not blocking worker threads polling IO notifications (backport of #3242) #4412

Open
wants to merge 1 commit into
base: v2.1
Choose a base branch
from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Jan 11, 2025

Problem

Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.

Summary of Changes

This PR makes some of the most CPU expensive RPC methods use tokio::task::spawn_blocking to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.

The methods changed so far include:

  • getMultipleAccounts
  • getProgramAccounts
  • getAccountInfo
  • getTokenAccountsByDelegate
  • getTokenAccountsByOwner

I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!

Test plan

Methodolgy for selection of CPU defaults

Run this blocks benchmark script while tweaking CPU params. This was run on a 48 CPU machine.

rpc_threads rpc_blocking_threads Average Median p90 p99
cpus cpus / 2 21880 22136 22546 22572
cpus cpus / 4 20617 ($${\color{green}-5.7\%}$$) 20627 ($${\color{green}-6.8\%}$$) 21040 ($${\color{green}-6.7\%}$$) 21149 ($${\color{green}-6.3\%}$$)
cpus cpus / 8 21366 ($${\color{green}-2.4\%}$$) 21367 ($${\color{green}-3.8\%}$$) 21434 ($${\color{green}-4.9\%}$$) 21477 ($${\color{green}-4.9\%}$$)
cpus / 2 cpus / 2 21642 ($${\color{green}-1.1\%}$$) 21525 ($${\color{green}-2.8\%}$$) 23202 ($${\color{red}+2.9\%}$$) 23235 ($${\color{red}+2.9\%}$$)
cpus / 2 cpus / 4 20033 ($${\color{green}-8.4\%}$$) 20044 ($${\color{green}-9.4\%}$$) 20430 ($${\color{green}-9.4\%}$$) 20598 ($${\color{green}-8.7\%}$$)

Methodology

Using this script for computing metrics: https://gist.github.com/steveluscher/b4959b9601093b0009f1d7646217b030, ran each of these account-cluster-bench suites before and after this PR:

  • account-info
  • block
  • blocks
  • first-available-block
  • multiple-accounts
  • slot
  • supply
  • token-accounts-by-delegate
  • token-accounts-by-owner
  • token-supply
  • transaction
  • transaction-parsed
  • version

Using a command similar to this:

 % (
       bash -c 'while ! curl -s localhost:8899/health | grep -q "ok"; do echo "Waiting for validator" && sleep 1; done;' \
           ${IFS# Set this higher if you want the test to run with more blocks having been committed } \
           && sleep 15 \
           && echo "Running bench" \
           && cd accounts-cluster-bench \
           && cargo run --release -- \
               -u l \
               --identity ~/.config/solana/id.json \
               ${IFS# Optional for benches that require token accounts} \
               ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4} \
               ${IFS# --mint UhrKsjtPJJ8ndhSdrcCbQaiw8L8a6gH1FbmtJ4XpVJR } \
               --iterations 100 \
               --num-rpc-bench-threads 100 \
               --rpc-bench supply 2>&1 \
           | grep -Po "Supply average success_time: \K(\d+)" \
           | ~/stats.sh 
   ) \
       & (
           (cd accounts-cluster-bench && cargo build --release) \
           && (
               cd validator \
                   && rm -rf test-ledger/ \
                   && cargo run --release \
                       --manifest-path=./Cargo.toml \
                       --bin solana-test-validator -- \
                           ${IFS# Put this in ~/fixtures/ } \
                           ${IFS# https://gist.github.com/steveluscher/19261b5321f56a89dc75804070b61dc4 } \
                           --account-dir ~/fixtures \
                           --quiet
               ) \
       )
Average: 34293.3
Median: 31708.5
90th Percentile: 44640
99th Percentile: 45166

Note

You can adjust the sleep 15 if you want the validator to stack up more slots before starting the bench.

Warning

When running benches that require token accounts, supply a mint, space, and actually create the token account using the fixture found here.

Results

Warning

These results are a little messed up, because what's actually happening here is that the benchmark script is spitting out averages in 3s windows. The avg/p50/p90/p99 of those numbers is what you're seeing in this table. Not correct, but directionally correct.

Note

Filling in this grid would take a long time, especially if run against a mainnet RPC with production traffic. We may just choose to land this as ‘certainly better, how much we can't say exactly.’

Suite Average Median p90 p99
account-info TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
block TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
blocks TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
first-available-block TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
multiple-accounts TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
slot TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
supply TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-accounts-by-delegate TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-accounts-by-owner TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
token-supply TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
transaction TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
transaction-parsed TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)
version TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$) TBD ($${\color{green}-99\%}$$)

…cations (#3242)

* rpc: limit the number of blocking threads in tokio runtime

By default tokio allows up to 512 blocking threas. We don't want that
many threads, as they'd slow down other validator threads.

* rpc: make getMultipleAccounts async

Make the function async and use tokio::task::spawn_blocking() to execute
CPU-bound code in background. This prevents stalling the worker threads
polling IO notifications and serving other non CPU-bound rpc methods.

* rpc: make getAccount async

* rpc: run get_filtered_program_accounts with task::spawn_blocking

get_filtered_program_accounts can be used to retrieve _a list_ of
accounts that match some filters. This is CPU bound and can block the
calling thread for a significant amount of time when copying many/large
accounts.

* rpc: use our custom runtime to spawn blocking tasks

Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn
blocking tasks from rpc methods.

* Make `get_blocks()` and `get_block()` yieldy

When these methods reach out to Blockstore, yield the thread

* Make `get_supply()` yieldy

When this method reaches out to accounts_db (through a call to `calculate_non_circulating_supply()`), yield the thread.

* Make `get_first_available_block()` yieldy

When this method reaches out to blockstore, yield the thread

* Make `get_transaction()` yieldy

When this method reaches out to blockstore, yield the thread

* Make `get_token_supply()` yieldy

When this method reaches out to methods on bank that do reads, yield the thread

* Make the choice of `cpus / 4` as the default for `rpc_blocking_threads`

* Encode blocks async

* Revert "Make `get_first_available_block()` yieldy"

This blockstore method doesn't actually do expensive reads.

This reverts commit 3bbc57f.

* Revert "Make `get_blocks()` and `get_block()` yieldy"

Kept the `spawn_blocking` around:

* Call to `get_rooted_block`
* Call to `get_complete_block`

This reverts commit 710f9c6.

* Revert "Make `get_token_supply()` yieldy"

* Reverted the change to `interest_bearing_config`
* Reverted moving `bank.get_account(&mint)` to the background pool

This reverts commit 02f5c94.

* Share spawned call to `calculate_non_circulating_supply` between `get_supply` and `get_largest_accounts`

* Create a shim for `get_filtered_indexed_accounts` that sends the work to the background thread internally

* Send call to `get_largest_accounts` to the background pool

---------

Co-authored-by: Steven Luscher <[email protected]>
(cherry picked from commit c6f3e1b)
@mergify mergify bot requested a review from a team as a code owner January 11, 2025 03:21
Copy link
Author

mergify bot commented Jan 11, 2025

If this PR represents a change to the public RPC API:

  1. Make sure it includes a complementary update to rpc-client/ (example)
  2. Open a follow-up PR to update the JavaScript client @solana/web3.js (example)

Thank you for keeping the RPC clients in sync with the server API @mergify[bot].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant