Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune performance of fetch pending hashes #13219

Closed
Elvis339 opened this issue Dec 9, 2024 · 10 comments
Closed

Tune performance of fetch pending hashes #13219

Elvis339 opened this issue Dec 9, 2024 · 10 comments
Labels
C-enhancement New feature or request C-perf A change motivated by improving speed, memory usage or disk footprint S-stale This issue/PR is stale and will close with no further activity

Comments

@Elvis339
Copy link
Contributor

Elvis339 commented Dec 9, 2024

Describe the feature

This issue builds on Profile tx fetching to improve transaction fetching performance.

Optimizations Applied

  • Removed nested call stacks in hash computations
  • Eliminated function passing through multiple levels
  • Converted deferred/batch cleanups to eager processing
  • Added strategic early returns
  • Inlining
  • Streamlined LRU cache access patterns

These changes resulted in ~51% performance improvement in transaction fetching operations.

Additional Consideration: LRU Usage

Currently, some fields like seen_hashes are using an LRU cache, but this might not be optimal.

Why LRU might not be ideal:

  1. Do we actually need ordering? Seems like main purpose is deduplication?
  2. LRU operations have overhead from maintaining order
  3. Cache eviction order might not align with transaction processing patterns

Would love feedback on whether ordering is actually important, as moving to a simpler data structure could yield sweet performance juice.

Additional context

Flamegraphs

Not Optimized

Image

Optimized

Image

Benchmark

Not Optimized

Image

Optimized

Image

@mattsse
Copy link
Collaborator

mattsse commented Dec 11, 2024

Do we actually need ordering? Seems like main purpose is deduplication?

we don't need ordering we only need to evict older hashes

These changes resulted in ~51% performance improvement in transaction fetching operations.

awesome

Converted deferred/batch cleanups to eager processing

that makes sense, no need to buffer them

@mattsse mattsse added C-perf A change motivated by improving speed, memory usage or disk footprint and removed S-needs-triage This issue needs to be labelled labels Dec 11, 2024
@Elvis339
Copy link
Contributor Author

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

@malik672
Copy link
Contributor

malik672 commented Dec 18, 2024

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

@malik672
Copy link
Contributor

@Elvis339 correct me if i'm wrong but in the optimized version the most resource intensive stuffs seem to be multi threading

@Elvis339
Copy link
Contributor Author

Elvis339 commented Dec 18, 2024

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more performant, honestly I'm not really doing any research on different data structure because I have more pressing issues atm.

if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

@malik672
Copy link
Contributor

malik672 commented Dec 18, 2024

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.

reth/crates/net/network/src/transactions/fetcher.rs

Line 410 in ef033ab

if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory

now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache

while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.

while the hash set gives you lot of cache misses the circular buffer won't

of course, this needs a benchmark all these are theory tbh.

@mattsse what do you think ?

@Elvis339
Copy link
Contributor Author

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.
reth/crates/net/network/src/transactions/fetcher.rs
Line 410 in ef033ab
if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory

now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache

while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.

while the hash set gives you lot of cache misses the circular buffer won't

of course, this needs a benchmark all these are theory tbh.

@mattsse what do you think ?

Interesting points about memory locality, I believe some of the statements about CPU optimization need clarification:

  1. "CPU can never optimize HashSet" is not entirely accurate. Modern CPUs have optimization techniques:

    • Hardware prefetchers can learn access patterns even for non-contiguous memory check SO. But, in general, you're right pointer chasing is detrimental from performance optimization perspective and having contagious data is definitely superior.
  2. Circular Buffers vs HashSets is more nuanced:

    • Yes, circular buffers can have better spatial locality. But HashSet is often using techniques to improve locality which can potentially be cache-friendly.

The core question isn't about picking one over the other universally, but rather identifying the most appropriate and simplest data structure for our specific use case. Looking at the current implementation, the LRU cache is performing redundant operations - maintaining ordering when we might not need it.

I agree with most of the statements you wrote, but disagree with some, I think you're on a good path and I like the idea of having CPU cache-friendly data structures!

You're absolutely right that benchmarking would tell us the real story. Would be cool to test with:

  • Different sizes of data (small/medium/large sets)
  • Various access patterns (sequential, random, mixed)
  • Different load factors to see when things break down

Got any specific benchmark scenarios in mind? Would be interesting to see how different patterns affect cache behavior.

@malik672
Copy link
Contributor

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.
reth/crates/net/network/src/transactions/fetcher.rs
Line 410 in ef033ab
if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory
now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache
while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.
while the hash set gives you lot of cache misses the circular buffer won't
of course, this needs a benchmark all these are theory tbh.
@mattsse what do you think ?

Interesting points about memory locality, I believe some of the statements about CPU optimization need clarification:

  1. "CPU can never optimize HashSet" is not entirely accurate. Modern CPUs have optimization techniques:

    • Hardware prefetchers can learn access patterns even for non-contiguous memory check SO. But, in general, you're right pointer chasing is detrimental from performance optimization perspective and having contagious data is definitely superior.
  2. Circular Buffers vs HashSets is more nuanced:

    • Yes, circular buffers can have better spatial locality. But HashSet is often using techniques to improve locality which can potentially be cache-friendly.

The core question isn't about picking one over the other universally, but rather identifying the most appropriate and simplest data structure for our specific use case. Looking at the current implementation, the LRU cache is performing redundant operations - maintaining ordering when we might not need it.

I agree with most of the statements you wrote, but disagree with some, I think you're on a good path and I like the idea of having CPU cache-friendly data structures!

You're absolutely right that benchmarking would tell us the real story. Would be cool to test with:

  • Different sizes of data (small/medium/large sets)
  • Various access patterns (sequential, random, mixed)
  • Different load factors to see when things break down

Got any specific benchmark scenarios in mind? Would be interesting to see how different patterns affect cache behavior.

I disagree with both 2 and 1, hash set is just a data structure with binding logic, it's nothing more than that, the circular buffer will always have a higher locality of reference than a HashMap assuming all things are equal

The CPU optimization technique in this case is the prefetch and it works on contiguous Data structure, that's why an array can be sometimes better than a HashMap when indexing depending on the size

circular buffer solves the LRU problem, they are like LRU only they use FIFO mechanism.

any benchmark should do tbh

Copy link
Contributor

github-actions bot commented Jan 9, 2025

This issue is stale because it has been open for 21 days with no activity.

@github-actions github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Jan 9, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in Reth Tracker Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement New feature or request C-perf A change motivated by improving speed, memory usage or disk footprint S-stale This issue/PR is stale and will close with no further activity
Projects
Archived in project
Development

No branches or pull requests

3 participants