Tune performance of fetch pending hashes #13219

Elvis339 · 2024-12-09T08:18:21Z

Describe the feature

This issue builds on Profile tx fetching to improve transaction fetching performance.

Optimizations Applied

Removed nested call stacks in hash computations
Eliminated function passing through multiple levels
Converted deferred/batch cleanups to eager processing
Added strategic early returns
Inlining
Streamlined LRU cache access patterns

These changes resulted in ~51% performance improvement in transaction fetching operations.

Additional Consideration: LRU Usage

Currently, some fields like seen_hashes are using an LRU cache, but this might not be optimal.

Why LRU might not be ideal:

Do we actually need ordering? Seems like main purpose is deduplication?
LRU operations have overhead from maintaining order
Cache eviction order might not align with transaction processing patterns

Would love feedback on whether ordering is actually important, as moving to a simpler data structure could yield sweet performance juice.

Additional context

Flamegraphs

Not Optimized

Optimized

Benchmark

Not Optimized

Optimized

The text was updated successfully, but these errors were encountered:

mattsse · 2024-12-11T10:40:43Z

Do we actually need ordering? Seems like main purpose is deduplication?

we don't need ordering we only need to evict older hashes

These changes resulted in ~51% performance improvement in transaction fetching operations.

awesome

Converted deferred/batch cleanups to eager processing

that makes sense, no need to buffer them

Elvis339 · 2024-12-11T12:48:12Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

malik672 · 2024-12-18T10:28:09Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

malik672 · 2024-12-18T10:48:30Z

@Elvis339 correct me if i'm wrong but in the optimized version the most resource intensive stuffs seem to be multi threading

Elvis339 · 2024-12-18T10:53:10Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more performant, honestly I'm not really doing any research on different data structure because I have more pressing issues atm.

reth/crates/net/network/src/transactions/fetcher.rs

Line 410 in ef033ab

    
           if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

malik672 · 2024-12-18T11:10:18Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.

reth/crates/net/network/src/transactions/fetcher.rs

Line 410 in ef033ab

if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory

now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache

while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.

while the hash set gives you lot of cache misses the circular buffer won't

of course, this needs a benchmark all these are theory tbh.

@mattsse what do you think ?

Elvis339 · 2024-12-18T11:38:07Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.
reth/crates/net/network/src/transactions/fetcher.rs
Line 410 in ef033ab
if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory

now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache

while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.

while the hash set gives you lot of cache misses the circular buffer won't

of course, this needs a benchmark all these are theory tbh.

@mattsse what do you think ?

Interesting points about memory locality, I believe some of the statements about CPU optimization need clarification:

"CPU can never optimize HashSet" is not entirely accurate. Modern CPUs have optimization techniques:
- Hardware prefetchers can learn access patterns even for non-contiguous memory check SO. But, in general, you're right pointer chasing is detrimental from performance optimization perspective and having contagious data is definitely superior.
Circular Buffers vs HashSets is more nuanced:
- Yes, circular buffers can have better spatial locality. But HashSet is often using techniques to improve locality which can potentially be cache-friendly.

The core question isn't about picking one over the other universally, but rather identifying the most appropriate and simplest data structure for our specific use case. Looking at the current implementation, the LRU cache is performing redundant operations - maintaining ordering when we might not need it.

I agree with most of the statements you wrote, but disagree with some, I think you're on a good path and I like the idea of having CPU cache-friendly data structures!

You're absolutely right that benchmarking would tell us the real story. Would be cool to test with:

Different sizes of data (small/medium/large sets)
Various access patterns (sequential, random, mixed)
Different load factors to see when things break down

Got any specific benchmark scenarios in mind? Would be interesting to see how different patterns affect cache behavior.

malik672 · 2024-12-18T12:00:14Z

we don't need ordering we only need to evict older hashes

That's very cool, I think this really improve perf. I will propose different data structure - need a bit more time for this.

a circular buffer will work here, will just evict older hashes as new one enters

Circular buffer definitely seems interesting, I did a quick check on the most used methods of hashes_pending_fetch and it seems to be iter, remove and len. From a quick check the remove in circular buffer data structure seems to be O(n) I feel like a simpler data structure such as hash set would be more useful.
reth/crates/net/network/src/transactions/fetcher.rs
Line 410 in ef033ab
if let (_, Some(evicted_hash)) = self.hashes_pending_fetch.insert_and_get_evicted(hash)

depends tbh a hash set is a pointer to a data on the heap, that means to remove you take a pointer from the stack and then using this pointer you remove a data at a precise location on the heap, the catch is the CPU can never optimize this, it always has to fetch the data from the main memory
now a circular buffer is interesting in the sense that, depending on the size of data you are storing (in this case hash, 33 bytes at most) will be more optimizable by the CPU than the hash set, this is it, your CPU fetches a data let say data A at Address 1, now the CPU does not fetch data in the exact size you want it fetches depending on the computer maybe 64 bytes, so when your CPU fetches data at Address 1, it automatically fetches data at Address 2 then gives you data at Address 1, stores data of address two into a cache, now it's expecting next time you request for data it already has it and just produces it from the cache
while the hash set can seem better it can't optimize for cache because it not contiguous and best perf optimization, is just you using the cache effectively.
while the hash set gives you lot of cache misses the circular buffer won't
of course, this needs a benchmark all these are theory tbh.
@mattsse what do you think ?

Interesting points about memory locality, I believe some of the statements about CPU optimization need clarification:

"CPU can never optimize HashSet" is not entirely accurate. Modern CPUs have optimization techniques:

Hardware prefetchers can learn access patterns even for non-contiguous memory check SO. But, in general, you're right pointer chasing is detrimental from performance optimization perspective and having contagious data is definitely superior.

Circular Buffers vs HashSets is more nuanced:

Yes, circular buffers can have better spatial locality. But HashSet is often using techniques to improve locality which can potentially be cache-friendly.

The core question isn't about picking one over the other universally, but rather identifying the most appropriate and simplest data structure for our specific use case. Looking at the current implementation, the LRU cache is performing redundant operations - maintaining ordering when we might not need it.

I agree with most of the statements you wrote, but disagree with some, I think you're on a good path and I like the idea of having CPU cache-friendly data structures!

You're absolutely right that benchmarking would tell us the real story. Would be cool to test with:

Different sizes of data (small/medium/large sets)

Various access patterns (sequential, random, mixed)

Different load factors to see when things break down

Got any specific benchmark scenarios in mind? Would be interesting to see how different patterns affect cache behavior.

I disagree with both 2 and 1, hash set is just a data structure with binding logic, it's nothing more than that, the circular buffer will always have a higher locality of reference than a HashMap assuming all things are equal

The CPU optimization technique in this case is the prefetch and it works on contiguous Data structure, that's why an array can be sometimes better than a HashMap when indexing depending on the size

circular buffer solves the LRU problem, they are like LRU only they use FIFO mechanism.

any benchmark should do tbh

github-actions · 2025-01-09T02:16:28Z

This issue is stale because it has been open for 21 days with no activity.

github-actions · 2025-01-17T01:59:24Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

Elvis339 added C-enhancement New feature or request S-needs-triage This issue needs to be labelled labels Dec 9, 2024

github-project-automation bot added this to Reth Tracker Dec 9, 2024

github-project-automation bot moved this to Todo in Reth Tracker Dec 9, 2024

Elvis339 mentioned this issue Dec 9, 2024

Micro optimizations for pending hash fetching #13220

Merged

mattsse added C-perf A change motivated by improving speed, memory usage or disk footprint and removed S-needs-triage This issue needs to be labelled labels Dec 11, 2024

hai-rise mentioned this issue Dec 22, 2024

Enable cache Optimization #13481

Closed

github-actions bot added the S-stale This issue/PR is stale and will close with no further activity label Jan 9, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025

github-project-automation bot moved this from Todo to Done in Reth Tracker Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune performance of fetch pending hashes #13219

Tune performance of fetch pending hashes #13219

Elvis339 commented Dec 9, 2024

mattsse commented Dec 11, 2024

Elvis339 commented Dec 11, 2024

malik672 commented Dec 18, 2024 •

edited

Loading

malik672 commented Dec 18, 2024

Elvis339 commented Dec 18, 2024 •

edited

Loading

malik672 commented Dec 18, 2024 •

edited

Loading

Elvis339 commented Dec 18, 2024

malik672 commented Dec 18, 2024

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 17, 2025

Tune performance of fetch pending hashes #13219

Tune performance of fetch pending hashes #13219

Comments

Elvis339 commented Dec 9, 2024

Describe the feature

Optimizations Applied

Additional Consideration: LRU Usage

Additional context

Flamegraphs

Not Optimized

Optimized

Benchmark

Not Optimized

Optimized

mattsse commented Dec 11, 2024

Elvis339 commented Dec 11, 2024

malik672 commented Dec 18, 2024 • edited Loading

malik672 commented Dec 18, 2024

Elvis339 commented Dec 18, 2024 • edited Loading

malik672 commented Dec 18, 2024 • edited Loading

Elvis339 commented Dec 18, 2024

malik672 commented Dec 18, 2024

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 17, 2025

malik672 commented Dec 18, 2024 •

edited

Loading

Elvis339 commented Dec 18, 2024 •

edited

Loading

malik672 commented Dec 18, 2024 •

edited

Loading