-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: improve latency by not blocking worker threads polling IO notifications #3242
rpc: improve latency by not blocking worker threads polling IO notifications #3242
Conversation
This is a rebase of solana-labs#34675 |
If this PR represents a change to the public RPC API:
Thank you for keeping the RPC clients in sync with the server API @alessandrod. |
Some other methods that should probably go in background (intel from RPC provider 🙏 )
|
validator/src/cli.rs
Outdated
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), | |||
rpc_blocking_threads: (num_cpus::get() / 2).to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be set a lot lower
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be set a lot lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak these and see how the choice of cpus affects the benchmark latencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll keep going with this.
- Apply the same mitigation to:
- getConfirmedBlocks
- getSupply
- getFirstAvailableBlock
- getConfirmedTransaction
- getBlock
- getTokenSupply
- Add
RpcBench
routines to cover all of them - Tweak the CPU numbers to find some lower bound
- Paste before/after benchmark results into the PR description
@@ -2325,6 +2335,7 @@ impl DefaultArgs { | |||
.retry_pool_max_size | |||
.to_string(), | |||
rpc_threads: num_cpus::get().to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll tweak these and see how the choice of cpus affects the benchmark latencies.
5435ef5
to
e1ab1b6
Compare
e1ab1b6
to
30e4e0a
Compare
@t-nelson @alessandrod: I've completed a first pass, and I'm working on blasting out some measurements right now. Feel free to ignore this, but if you would like to take an early crack at feedback, please feel free. Also: #4048 when you have a sec. |
Great stuff! Just one note about selecting the right number of rpc_threads: I think that we'll need help from operators to properly test that part, because the results you'll get from the bench wrt number of threads won't be representative of what happens while doing RPC requests against a mnb node, which has 400+ threads busy doing other stuff. As a default I think we should be fairly conservative and use small numbers. |
I've completed some measurements, as seen in the chart, but I can't say that I really know what to set it to. Given that this is configurable via CLI arguments, what's your guidance?
|
Hey @NicolasPennie, @WilfredAlmeida, would you be willing to build an RPC using this PR, tweak |
6fcf0b2
to
234f234
Compare
Computing the benchmarks using the script in the PR description will take a long time. That said, I have not run a benchmark locally yet that hasn't shown positive results. I invite you to take the position that this is ‘probably better’ without the need to quantify exactly how much, and to iterate from here. |
When this method reaches out to blockstore, yield the thread
When this method reaches out to methods on bank that do reads, yield the thread
b6b71a7
to
02f5c94
Compare
Updated!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great! Left some comments
rpc/src/rpc.rs
Outdated
})?) | ||
self.runtime | ||
.spawn_blocking(move || { | ||
bank.get_filtered_indexed_accounts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make get_filtered_indexed_accounts do the spawn_blocking
internally like get_filtered_program_accounts and get_filtered_spl_token_accounts_by_owner?
rpc/src/rpc.rs
Outdated
.spawn_blocking({ | ||
let bank = bank.clone(); | ||
move || { | ||
calculate_non_circulating_supply(&bank).map_err(|e| RpcCustomError::ScanError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_non_circulating_supply is called from some other methods, maybe we
should make it async?
rpc/src/rpc.rs
Outdated
.runtime | ||
.spawn_blocking({ | ||
let blockstore = Arc::clone(&self.blockstore); | ||
move || blockstore.get_first_available_block().unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for doing this in the bg pool? Looking at the
implementation it doesn't look slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. My operating assumption was that every time we talk to blockstore
we have to presume that it might block for {reasons}. Looking into this call for first_available_block
I can't see anything that would.
rpc/src/rpc.rs
Outdated
move || blockstore.get_rooted_block(slot, true) | ||
}) | ||
.await | ||
.expect("Failed to spawn blocking task"); | ||
self.check_blockstore_root(&result, slot)?; | ||
let encode_block = |confirmed_block: ConfirmedBlock| -> Result<UiConfirmedBlock> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoding probably needs to go in the blocking pool too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do! Is this just because it's CPU hungry, or because there's an IOWait in there somewhere I haven't found yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU hungry on large accounts (accounts can be up to 10MB)
rpc/src/rpc.rs
Outdated
.runtime | ||
.spawn_blocking({ | ||
let blockstore = Arc::clone(&self.blockstore); | ||
move || blockstore.get_first_available_block().unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here not sure that this needs to go in the bg pool
rpc/src/rpc.rs
Outdated
let blockstore = Arc::clone(&self.blockstore); | ||
move || -> Result<Vec<_>> { | ||
let blocks = blockstore | ||
.rooted_slot_iterator(max(start_slot, lowest_blockstore_slot)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: I think that this iterator is fast, it just gets a bunch of u64s from
rocks.
In general not sure that this function should be async, it has a confusing name
but it just returns a vec of slot numbers
rpc/src/rpc.rs
Outdated
async move { | ||
let unix_timestamp = self | ||
.runtime | ||
.spawn_blocking(move || bank.clock().unix_timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this should go in the bg pool? The clock sysvar account is
smol and pretty much guaranteed to be cached
rpc/src/rpc.rs
Outdated
let bank = Arc::clone(&bank); | ||
let mint = Pubkey::clone(mint); | ||
move || { | ||
bank.get_account(&mint).ok_or_else(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token accounts are small, so not sure that we should send this to the bg pool?
Especially considering it's memecoin szn so this is probably going to be called
a lot? We must be careful to not put too many things in the bg pool, or we
accidentally serialize frequent, fast requests behind chonky, slow ones we put
in the bg pool
This blockstore method doesn't actually do expensive reads. This reverts commit 3bbc57f.
Kept the `spawn_blocking` around: * Call to `get_rooted_block` * Call to `get_complete_block` This reverts commit 710f9c6.
* Reverted the change to `interest_bearing_config` * Reverted moving `bank.get_account(&mint)` to the background pool This reverts commit 02f5c94.
…_supply` and `get_largest_accounts`
… to the background thread internally
8e2335c
to
aa83125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK @alessandrod, ready for another pass. Let me know if I interpreted anything incorrectly.
rpc/src/rpc.rs
Outdated
.runtime | ||
.spawn_blocking({ | ||
let blockstore = Arc::clone(&self.blockstore); | ||
move || blockstore.get_first_available_block().unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. My operating assumption was that every time we talk to blockstore
we have to presume that it might block for {reasons}. Looking into this call for first_available_block
I can't see anything that would.
rpc/src/rpc.rs
Outdated
move || blockstore.get_rooted_block(slot, true) | ||
}) | ||
.await | ||
.expect("Failed to spawn blocking task"); | ||
self.check_blockstore_root(&result, slot)?; | ||
let encode_block = |confirmed_block: ConfirmedBlock| -> Result<UiConfirmedBlock> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do! Is this just because it's CPU hungry, or because there's an IOWait in there somewhere I haven't found yet?
} | ||
}; | ||
let accounts = if is_known_spl_token_id(program_id) | ||
let accounts = if is_known_spl_token_id(&program_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does getProgramAccounts actually return program data? If so, I suspect we
should do the encoding in the bg pool (programs are large)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not usually. It returns accounts owned by the specified program. If the specified program in question was ‘the loader program’ then yeah, you'd get program data, but most RPC would fatal with a scan aborted: The accumulated scan results exceeded the limit
before that happened (because of byte_limit_for_scan
).
.calculate_non_circulating_supply(&bank) | ||
.await | ||
.map_err(|e| RpcCustomError::ScanError { | ||
message: e.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Below where we call bank.get_largest_accounts - that one
probably needs to go in bg too? It does an account scan which I think is slow by
definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do. Maybe more concerning is that the largest_accounts_cache
doesn't appear to me to be a read-through cache. I think that given an empty cache, multiple callers may perform that fetch many times, rather than all locking on a single read.
rpc/src/rpc.rs
Outdated
move || blockstore.get_rooted_block(slot, true) | ||
}) | ||
.await | ||
.expect("Failed to spawn blocking task"); | ||
self.check_blockstore_root(&result, slot)?; | ||
let encode_block = |confirmed_block: ConfirmedBlock| -> Result<UiConfirmedBlock> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU hungry on large accounts (accounts can be up to 10MB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! We're almost there, just a couple more comments
self.blockstore.get_rooted_transaction(signature) | ||
}; | ||
let confirmed_transaction = self | ||
.runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about get_transaciton. It's one of those cases where ideally it
wouldn't block the io loop, but in practice most will be fast enough that if we
put them in the bg pool behind heavy stuff (like getProgramAccounts) then we
make it needlessly slow? Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran my benchmark script above, fwiw:
Before
Average: 9813.12µs
Median: 10554µs
90th Percentile: 11922µs
99th Percentile: 12271µs
After
Average: 6027.08µs
Median: 6400.5µs
90th Percentile: 8229µs
99th Percentile: 8284µs
This doesn't disprove your hypothesis (making it slower by putting it in a pool behind heavier stuff) but it puts numbers to the thing to help make the decision.
Do these numbers put it in the ‘fast enough, no point backgrounding’ bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are way larger than I thought - why is it so slow!? Def bg unless we find what makes it so slow and make it faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I left it as is – backgrounded.
Ready to re-review @alessandrod! I can't click the ‘request review’ button since this PR belongs to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing all this work! Code looks good. We'll likely need to tune more things as we gather feedback and do more benches, but no need to delay this further since it's already a huge improvement.
We were pinning to an ancient version of tokio because of allegged issues with RPC and the io loop. We've merged anza-xyz#3242 which should improve the io loop situation with RPCs. Update to tokio so we finally pick up tokio-rs/tokio#6512 which fixes bad perf for timeouts.
We were pinning to an ancient version of tokio because of allegged issues with RPC and the io loop. We've merged anza-xyz#3242 which should improve the io loop situation with RPCs. Update to tokio so we finally pick up tokio-rs/tokio#6512 which fixes bad perf for timeouts.
We were pinning to an ancient version of tokio because of allegged issues with RPC and the io loop. We've merged anza-xyz#3242 which should improve the io loop situation with RPCs. Update to tokio so we finally pick up tokio-rs/tokio#6512 which fixes bad perf for timeouts.
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
…cations (#3242) * rpc: limit the number of blocking threads in tokio runtime By default tokio allows up to 512 blocking threas. We don't want that many threads, as they'd slow down other validator threads. * rpc: make getMultipleAccounts async Make the function async and use tokio::task::spawn_blocking() to execute CPU-bound code in background. This prevents stalling the worker threads polling IO notifications and serving other non CPU-bound rpc methods. * rpc: make getAccount async * rpc: run get_filtered_program_accounts with task::spawn_blocking get_filtered_program_accounts can be used to retrieve _a list_ of accounts that match some filters. This is CPU bound and can block the calling thread for a significant amount of time when copying many/large accounts. * rpc: use our custom runtime to spawn blocking tasks Pass the custom runtime to JsonRpcRequestProcessor and use it to spawn blocking tasks from rpc methods. * Make `get_blocks()` and `get_block()` yieldy When these methods reach out to Blockstore, yield the thread * Make `get_supply()` yieldy When this method reaches out to accounts_db (through a call to `calculate_non_circulating_supply()`), yield the thread. * Make `get_first_available_block()` yieldy When this method reaches out to blockstore, yield the thread * Make `get_transaction()` yieldy When this method reaches out to blockstore, yield the thread * Make `get_token_supply()` yieldy When this method reaches out to methods on bank that do reads, yield the thread * Make the choice of `cpus / 4` as the default for `rpc_blocking_threads` * Encode blocks async * Revert "Make `get_first_available_block()` yieldy" This blockstore method doesn't actually do expensive reads. This reverts commit 3bbc57f. * Revert "Make `get_blocks()` and `get_block()` yieldy" Kept the `spawn_blocking` around: * Call to `get_rooted_block` * Call to `get_complete_block` This reverts commit 710f9c6. * Revert "Make `get_token_supply()` yieldy" * Reverted the change to `interest_bearing_config` * Reverted moving `bank.get_account(&mint)` to the background pool This reverts commit 02f5c94. * Share spawned call to `calculate_non_circulating_supply` between `get_supply` and `get_largest_accounts` * Create a shim for `get_filtered_indexed_accounts` that sends the work to the background thread internally * Send call to `get_largest_accounts` to the background pool --------- Co-authored-by: Steven Luscher <[email protected]> (cherry picked from commit c6f3e1b)
We were pinning to an ancient version of tokio because of allegged issues with RPC and the io loop. We've merged #3242 which should improve the io loop situation with RPCs. Update to tokio so we finally pick up tokio-rs/tokio#6512 which fixes bad perf for timeouts.
Problem
Some RPC operations are CPU bound and run for a significant amount of time. Those operations end up blocking worker threads that are also used to handle IO notifications, leading to notifications not being polled often enough and so for the whole RPC server to potentially become slow and exhibit high latency. When latency gets high enough it can exceed request timeouts, leading to failed requests.
Summary of Changes
This PR makes some of the most CPU expensive RPC methods use
tokio::task::spawn_blocking
to run cpu hungry code. This way the worker threads doing IO don't get blocked and latency is improved.The methods changed so far include:
getMultipleAccounts
getProgramAccounts
getAccountInfo
getTokenAccountsByDelegate
getTokenAccountsByOwner
I'm not super familiar with RPC so I've changed what looking at the code seems to be loading/copying a lot of data around. Please feel free to suggest more!
Test plan
Methodolgy for selection of CPU defaults
Run this
blocks
benchmark script while tweaking CPU params. This was run on a 48 CPU machine.rpc_threads
rpc_blocking_threads
Methodology
Using this script for computing metrics: https://gist.github.com/steveluscher/b4959b9601093b0009f1d7646217b030, ran each of these
account-cluster-bench
suites before and after this PR:account-info
block
blocks
first-available-block
multiple-accounts
slot
supply
token-accounts-by-delegate
token-accounts-by-owner
token-supply
transaction
transaction-parsed
version
Using a command similar to this:
Note
You can adjust the
sleep 15
if you want the validator to stack up more slots before starting the bench.Warning
When running benches that require token accounts, supply a
mint
,space
, and actually create the token account using the fixture found here.Results
Warning
These results are a little messed up, because what's actually happening here is that the benchmark script is spitting out averages in 3s windows. The avg/p50/p90/p99 of those numbers is what you're seeing in this table. Not correct, but directionally correct.
Note
Filling in this grid would take a long time, especially if run against a mainnet RPC with production traffic. We may just choose to land this as ‘certainly better, how much we can't say exactly.’
account-info
block
blocks
first-available-block
multiple-accounts
slot
supply
token-accounts-by-delegate
token-accounts-by-owner
token-supply
transaction
transaction-parsed
version