-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PodHashMap #97
Add PodHashMap #97
Conversation
the printout claims it's only doing 16K active entries but in reality it's doing 32K; I'll have to figure out some nicer perf tests/benchmarks if we decide to use this for the lwaftr binding table. I guess I should end by noting that if the cache misses parallelize, then this would be a fine binding table data structure when running multiple lwaftrs on one machine. The lookup appears to really be just some 70ns in the worst case of a cache miss, which is what we budgeted for. We'll need two tables, one for the from-IPv6 and the from-IPv4 sides, but that's fine. We appear to remain on target perf-wise. |
-- simulate only 16K active flows: i = bit.band(i, 0xffff) | ||
result = rhh:lookup(hash_i32(i), i) | ||
end | ||
stop = ffi.C.get_time_ns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can localize C
.
Looking good, promising results. |
This is awesome stuff, Andy!
I am fascinated by the idea of parallelizing cache lookups to increase throughput by amortizing latency costs :-). cc @alexandergall I see four main ways this can be done:
I have done an experiment with (4) ILP now and the results are interesting! This is inspired by an idea from Mike Pall. The idea is to have machine code that looks up multiple keys in such a way that the CPU can use ILP to issue the loads from L3/memory in parallel. In TCP/networking terms this would be "increasing the congestion window" by having multiple requests to the cache outstanding at the same time. That way if the latency is (say) 60ns then you would be getting the results of N lookups every 60ns instead of just one. For example if N=10 then you would be able to complete an L3 lookup on average every 6ns. I wrote some proof-of-concept assembler code to see if this works. (It does seem to.) I have three simple benchmarks: p1, p2, and p4. Each one fetches values from L3 cache in a loop. Parallelism is limited because the address of each fetch depends on the result of the previous one. This is compensated in p2 by issuing two separate fetches at a time and in p4 with four fetches at a time. The hypothesised runtime behavior is like this (to borrow a TCP-style diagram): I expect that each routine will have the same number of L3 cache misses, and each have the same L3 latency, but that throughput of lookup operations will be higher for the parallel versions (more results/second received from L3). I did this experiment. See code and results. Below is the overview. p1Inner loop:
PMU report for 100M lookups:
Result: ~54 cycles per lookup (and one L3 cache hit confirmed). (There is a misnomoer here: it would be more accurate to print p2Inner loop:
PMU report for 100M lookups:
Result: ~27 cycles per lookup (and one L3 cache hit confirmed). p4Inner loop:
PMU report for 100M lookups:
Result: ~14 cycles per lookup (and one L3 cache hit confirmed). EndSo! Looks like ILP can be used to issue parallel cache lookups and this can reduce the average lookup cost to far below the access time. Interesting to bring into a practical data structure? |
This code looks really nice!
In LuaJIT it seems like the worst-case time needs to also take into account the potential non-local effects of calling the lookup function, particularly the risk of branching off to a side trade (lukego/blog#8). I would want to be confident that the probing loop is predictable enough to LuaJIT (e.g. heavily biased towards one iteration count) that exits to side traces here would be rare. You might be able to check that with This seems important for these subroutines that can be called from tight traces. Alex Gall decided to hide such control flow from LuaJIT by writing it in C and calling it via FFI. (It's a pain in the ass to be sure.) |
Most of this is over my head :) I just performed the "insertion rate test" in isolation to see how the compiler deals with it, given my really bad experience with small inner loops in my own effort at this kind of thing. The code
generates almost 80 traces and many of them are side traces. I think that this is the reason for the relatively low rate (<4M per second on my machine) @andywingo did you try to call this code from an actual packet-processing loop yet? That's where the real performance hit would happen. |
On Tue 17 Nov 2015 08:19, Alexander Gall [email protected] writes:
Building the hash table is not something I want to time. The reason is
Nope! Going to give it a go soon. However it would be :lookup in the Andy |
I tried to do some prefetching just with LuaJIT and wasn't really getting the right thing, not yet anyway. Need to try it on a Haswell system tho. The latest commit adds the ability to save the PHM to disk and load it back with mmap. It also adds more minimal test programs. Having done this:
If I do -jdump I get this as the lookup loop: ->LOOP:
0bcafa99 mov r15d, ebp
0bcafa9c mov ebp, ebx
0bcafa9e shl ebp, 0x0f
0bcafaa1 not ebp
0bcafaa3 add ebp, ebx
0bcafaa5 mov r14d, ebp
0bcafaa8 shr r14d, 0x0a
0bcafaac xor r14d, ebp
0bcafaaf mov ebp, r14d
0bcafab2 shl ebp, 0x03
0bcafab5 add ebp, r14d
0bcafab8 mov r14d, ebp
0bcafabb shr r14d, 0x06
0bcafabf xor r14d, ebp
0bcafac2 mov ebp, r14d
0bcafac5 shl ebp, 0x0b
0bcafac8 not ebp
0bcafaca add ebp, r14d
0bcafacd mov r14d, ebp
0bcafad0 shr r14d, 0x10
0bcafad4 xor r14d, ebp
0bcafad7 or r14d, 0x80000000
0bcafade mov ebp, r14d
0bcafae1 and ebp, r8d
0bcafae4 movsxd r13, ebp
0bcafae7 imul r13, r13, +0x0c
0bcafaeb cmp r14d, [r13+rdi+0x0]
0bcafaf0 jz 0x0bca0034 ->9
0bcafaf6 add ebp, +0x01
0bcafaf9 and ebp, r8d
0bcafafc movsxd r13, ebp
0bcafaff imul r13, r13, +0x0c
0bcafb03 cmp r14d, [r13+rdi+0x0]
0bcafb08 jnz 0x0bca0038 ->10
0bcafb0e mov r13d, [r13+rdi+0x4]
0bcafb13 xorps xmm6, xmm6
0bcafb16 cvtsi2sd xmm6, r13
0bcafb1b xorps xmm7, xmm7
0bcafb1e cvtsi2sd xmm7, ebx
0bcafb22 ucomisd xmm7, xmm6
0bcafb26 jpe 0x0bca003c ->11
0bcafb2c jnz 0x0bca003c ->11
0bcafb32 add ebx, +0x01
0bcafb35 cmp ebx, eax
0bcafb37 jle 0x0bcafa99 ->LOOP
0bcafb3d jmp 0x0bca0040 ->12
---- TRACE 7 stop -> loop
```asm |
I added a lookup2 test. On this three-year-old Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, the best perf I got on lookup1 was 14.5M lookups/s, and with lookup2 I got 16.5. Here is the inner loop of the lookup2 test: for i = 1, count, 2 do
local i2 = i + 1
local hash1 = hash_i32(i)
local prefetch1 = rhh:prefetch(hash1)
local hash2 = hash_i32(i2)
local prefetch2 = rhh:prefetch(hash2)
result1 = rhh:lookup_with_prefetch(hash1, i, prefetch1)
result2 = rhh:lookup_with_prefetch(hash2, i2, prefetch2)
end Interestingly I couldn't get any prefetching benefit unless I hoisted the i2 = i + 1, I think because Lua's range inference isn't all that good and so it was inserting an overflow check for the "i+1" to make sure it stayed within the 32-bit range. I think that branch was stalling the memory prefetch; dunno tho. Here's the assembly for the inner loop: ->LOOP:
0bcaf984 mov r14d, ebp
0bcaf987 mov ebp, ebx
0bcaf989 mov r13d, r15d
0bcaf98c add r13d, +0x01
0bcaf990 jo 0x0bca0030 ->8
0bcaf996 mov ebx, r15d
0bcaf999 shl ebx, 0x0f
0bcaf99c not ebx
0bcaf99e add ebx, r15d
0bcaf9a1 mov r12d, ebx
0bcaf9a4 shr r12d, 0x0a
0bcaf9a8 xor r12d, ebx
0bcaf9ab mov ebx, r12d
0bcaf9ae shl ebx, 0x03
0bcaf9b1 add ebx, r12d
0bcaf9b4 mov r12d, ebx
0bcaf9b7 shr r12d, 0x06
0bcaf9bb xor r12d, ebx
0bcaf9be mov ebx, r12d
0bcaf9c1 shl ebx, 0x0b
0bcaf9c4 not ebx
0bcaf9c6 add ebx, r12d
0bcaf9c9 mov r12d, ebx
0bcaf9cc shr r12d, 0x10
0bcaf9d0 xor r12d, ebx
0bcaf9d3 mov ebx, r12d
0bcaf9d6 and ebx, r9d
0bcaf9d9 movsxd rbx, ebx
0bcaf9dc imul rbx, rbx, +0x0c
0bcaf9e0 mov edi, [rbx+r8]
0bcaf9e4 lea esi, [r15+0x1]
0bcaf9e8 mov ebx, esi
0bcaf9ea shl ebx, 0x0f
0bcaf9ed not ebx
0bcaf9ef add ebx, esi
0bcaf9f1 mov esi, ebx
0bcaf9f3 shr esi, 0x0a
0bcaf9f6 xor esi, ebx
0bcaf9f8 mov ebx, esi
0bcaf9fa shl ebx, 0x03
0bcaf9fd add ebx, esi
0bcaf9ff mov esi, ebx
0bcafa01 shr esi, 0x06
0bcafa04 xor esi, ebx
0bcafa06 mov ebx, esi
0bcafa08 shl ebx, 0x0b
0bcafa0b not ebx
0bcafa0d add ebx, esi
0bcafa0f mov esi, ebx
0bcafa11 shr esi, 0x10
0bcafa14 xor esi, ebx
0bcafa16 mov ebx, esi
0bcafa18 and ebx, r9d
0bcafa1b movsxd rbx, ebx
0bcafa1e imul rbx, rbx, +0x0c
0bcafa22 mov edx, [rbx+r8]
0bcafa26 mov r11d, r12d
0bcafa29 or r11d, 0x80000000
0bcafa30 mov ebx, r11d
0bcafa33 and ebx, r9d
0bcafa36 cmp r11d, edi
0bcafa39 jnz 0x0bca0034 ->9
0bcafa3f movsxd r10, ebx
0bcafa42 imul r10, r10, +0x0c
0bcafa46 mov r10d, [r10+r8+0x4]
0bcafa4b xorps xmm6, xmm6
0bcafa4e cvtsi2sd xmm6, r10
0bcafa53 xorps xmm7, xmm7
0bcafa56 cvtsi2sd xmm7, r15d
0bcafa5b ucomisd xmm7, xmm6
0bcafa5f jpe 0x0bca0038 ->10
0bcafa65 jnz 0x0bca0038 ->10
0bcafa6b mov r11d, esi
0bcafa6e or r11d, 0x80000000
0bcafa75 mov ebp, r9d
0bcafa78 and ebp, r11d
0bcafa7b cmp r11d, edx
0bcafa7e jnz 0x0bca003c ->11
0bcafa84 movsxd r10, ebp
0bcafa87 imul r10, r10, +0x0c
0bcafa8b mov r10d, [r10+r8+0x4]
0bcafa90 xorps xmm6, xmm6
0bcafa93 cvtsi2sd xmm6, r10
0bcafa98 xorps xmm7, xmm7
0bcafa9b cvtsi2sd xmm7, r13d
0bcafaa0 ucomisd xmm7, xmm6
0bcafaa4 jpe 0x0bca0040 ->12
0bcafaaa jnz 0x0bca0040 ->12
0bcafab0 add r15d, +0x02
0bcafab4 cmp r15d, eax
0bcafab7 jle 0x0bcaf984 ->LOOP
0bcafabd jmp 0x0bca0044 ->13
---- TRACE 7 stop -> loop |
A lookup4 test case aborts, like actually aborts the process :) Also some of the trace cmpiles abort because 'register coalescing too complex' :P |
OK, finally a lookup4 test that works. Inner loop: -- NOTE! Results don't flow out of this loop, so LuaJIT is free to
-- kill the whole loop. Currently that's not the case but you need
-- to verify the traces to ensure that all is well. Caveat emptor!
for i = 1, count, 4 do
local i2, i3, i4 = i+1, i+2, i+3
local prefetch1 = rhh:prefetch(hash_i32(i))
local prefetch2 = rhh:prefetch(hash_i32(i2))
local prefetch3 = rhh:prefetch(hash_i32(i3))
local prefetch4 = rhh:prefetch(hash_i32(i4))
rhh:lookup_with_prefetch(hash_i32(i), i, prefetch1)
rhh:lookup_with_prefetch(hash_i32(i2), i2, prefetch2)
rhh:lookup_with_prefetch(hash_i32(i3), i3, prefetch3)
rhh:lookup_with_prefetch(hash_i32(i4), i4, prefetch4)
end I verified that LuaJIT doesn't kill the loop body, even though it is technically dead. Otherwise having 4 result phi variables is too much of a burden on the compiler. Loop body: ->LOOP:
0bcaf3ac mov ebx, ebp
0bcaf3ae add ebx, +0x01
0bcaf3b1 jo 0x0bca0048 ->14
0bcaf3b7 mov r15d, ebp
0bcaf3ba add r15d, +0x02
0bcaf3be jo 0x0bca0048 ->14
0bcaf3c4 mov r14d, ebp
0bcaf3c7 add r14d, +0x03
0bcaf3cb jo 0x0bca0048 ->14
0bcaf3d1 mov r12d, ebp
0bcaf3d4 shl r12d, 0x0f
0bcaf3d8 not r12d
0bcaf3db add r12d, ebp
0bcaf3de mov r13d, r12d
0bcaf3e1 shr r13d, 0x0a
0bcaf3e5 xor r13d, r12d
0bcaf3e8 mov r12d, r13d
0bcaf3eb shl r12d, 0x03
0bcaf3ef add r12d, r13d
0bcaf3f2 mov r13d, r12d
0bcaf3f5 shr r13d, 0x06
0bcaf3f9 xor r13d, r12d
0bcaf3fc mov r12d, r13d
0bcaf3ff shl r12d, 0x0b
0bcaf403 not r12d
0bcaf406 add r12d, r13d
0bcaf409 mov r13d, r12d
0bcaf40c shr r13d, 0x10
0bcaf410 xor r13d, r12d
0bcaf413 mov r12d, r13d
0bcaf416 and r12d, r9d
0bcaf419 movsxd r12, r12d
0bcaf41c imul r12, r12, +0x0c
0bcaf420 mov edi, [r12+r8]
0bcaf424 mov [rsp+0x18], edi
0bcaf428 lea edi, [rbp+0x1]
0bcaf42b mov r12d, edi
0bcaf42e shl r12d, 0x0f
0bcaf432 not r12d
0bcaf435 add r12d, edi
0bcaf438 mov edi, r12d
0bcaf43b shr edi, 0x0a
0bcaf43e xor edi, r12d
0bcaf441 mov r12d, edi
0bcaf444 shl r12d, 0x03
0bcaf448 add r12d, edi
0bcaf44b mov edi, r12d
0bcaf44e shr edi, 0x06
0bcaf451 xor edi, r12d
0bcaf454 mov r12d, edi
0bcaf457 shl r12d, 0x0b
0bcaf45b not r12d
0bcaf45e add r12d, edi
0bcaf461 mov ecx, r12d
0bcaf464 shr ecx, 0x10
0bcaf467 xor ecx, r12d
0bcaf46a mov r12d, ecx
0bcaf46d and r12d, r9d
0bcaf470 movsxd r12, r12d
0bcaf473 imul r12, r12, +0x0c
0bcaf477 mov r12d, [r12+r8]
0bcaf47b lea esi, [rbp+0x2]
0bcaf47e mov edi, esi
0bcaf480 shl edi, 0x0f
0bcaf483 not edi
0bcaf485 add edi, esi
0bcaf487 mov esi, edi
0bcaf489 shr esi, 0x0a
0bcaf48c xor esi, edi
0bcaf48e mov edi, esi
0bcaf490 shl edi, 0x03
0bcaf493 add edi, esi
0bcaf495 mov esi, edi
0bcaf497 shr esi, 0x06
0bcaf49a xor esi, edi
0bcaf49c mov edi, esi
0bcaf49e shl edi, 0x0b
0bcaf4a1 not edi
0bcaf4a3 add edi, esi
0bcaf4a5 mov r11d, edi
0bcaf4a8 shr r11d, 0x10
0bcaf4ac xor r11d, edi
0bcaf4af mov edi, r11d
0bcaf4b2 and edi, r9d
0bcaf4b5 movsxd rdi, edi
0bcaf4b8 imul rdi, rdi, +0x0c
0bcaf4bc mov edi, [rdi+r8]
0bcaf4c0 lea edx, [rbp+0x3]
0bcaf4c3 mov esi, edx
0bcaf4c5 shl esi, 0x0f
0bcaf4c8 not esi
0bcaf4ca add esi, edx
0bcaf4cc mov edx, esi
0bcaf4ce shr edx, 0x0a
0bcaf4d1 xor edx, esi
0bcaf4d3 mov esi, edx
0bcaf4d5 shl esi, 0x03
0bcaf4d8 add esi, edx
0bcaf4da mov edx, esi
0bcaf4dc shr edx, 0x06
0bcaf4df xor edx, esi
0bcaf4e1 mov esi, edx
0bcaf4e3 shl esi, 0x0b
0bcaf4e6 not esi
0bcaf4e8 add esi, edx
0bcaf4ea mov edx, esi
0bcaf4ec shr edx, 0x10
0bcaf4ef xor edx, esi
0bcaf4f1 mov esi, edx
0bcaf4f3 and esi, r9d
0bcaf4f6 movsxd rsi, esi
0bcaf4f9 imul rsi, rsi, +0x0c
0bcaf4fd mov esi, [rsi+r8]
0bcaf501 or r13d, 0x80000000
0bcaf508 mov [rsp+0x14], r13d
0bcaf50d mov r10d, r13d
0bcaf510 and r10d, r9d
0bcaf513 mov [rsp+0x10], r10d
0bcaf518 cmp r13d, [rsp+0x18]
0bcaf51d jnz 0x0bca004c ->15
0bcaf523 movsxd r10, r10d
0bcaf526 imul r10, r10, +0x0c
0bcaf52a mov r10d, [r10+r8+0x4]
0bcaf52f xorps xmm6, xmm6
0bcaf532 cvtsi2sd xmm6, r10
0bcaf537 xorps xmm7, xmm7
0bcaf53a cvtsi2sd xmm7, ebp
0bcaf53e ucomisd xmm7, xmm6
0bcaf542 jpe 0x0bca0050 ->16
0bcaf548 jnz 0x0bca0050 ->16
0bcaf54e or ecx, 0x80000000
0bcaf554 mov [rsp+0xc], ecx
0bcaf558 mov r10d, ecx
0bcaf55b and r10d, r9d
0bcaf55e cmp ecx, r12d
0bcaf561 jnz 0x0bca0054 ->17
0bcaf567 movsxd rcx, r10d
0bcaf56a imul rcx, rcx, +0x0c
0bcaf56e mov ecx, [rcx+r8+0x4]
0bcaf573 xorps xmm6, xmm6
0bcaf576 cvtsi2sd xmm6, rcx
0bcaf57b xorps xmm7, xmm7
0bcaf57e cvtsi2sd xmm7, ebx
0bcaf582 ucomisd xmm7, xmm6
0bcaf586 jpe 0x0bca0058 ->18
0bcaf58c jnz 0x0bca0058 ->18
0bcaf592 or r11d, 0x80000000
0bcaf599 mov r10d, r11d
0bcaf59c and r10d, r9d
0bcaf59f cmp r11d, edi
0bcaf5a2 jnz 0x0bca005c ->19
0bcaf5a8 movsxd rcx, r10d
0bcaf5ab imul rcx, rcx, +0x0c
0bcaf5af mov ecx, [rcx+r8+0x4]
0bcaf5b4 xorps xmm6, xmm6
0bcaf5b7 cvtsi2sd xmm6, rcx
0bcaf5bc xorps xmm7, xmm7
0bcaf5bf cvtsi2sd xmm7, r15d
0bcaf5c4 ucomisd xmm7, xmm6
0bcaf5c8 jpe 0x0bca0060 ->20
0bcaf5ce jnz 0x0bca0060 ->20
0bcaf5d4 or edx, 0x80000000
0bcaf5da mov r11d, r9d
0bcaf5dd and r11d, edx
0bcaf5e0 cmp edx, esi
0bcaf5e2 jnz 0x0bca0064 ->21
0bcaf5e8 movsxd r10, r11d
0bcaf5eb imul r10, r10, +0x0c
0bcaf5ef mov r10d, [r10+r8+0x4]
0bcaf5f4 xorps xmm6, xmm6
0bcaf5f7 cvtsi2sd xmm6, r10
0bcaf5fc xorps xmm7, xmm7
0bcaf5ff cvtsi2sd xmm7, r14d
0bcaf604 ucomisd xmm7, xmm6
0bcaf608 jpe 0x0bca0068 ->22
0bcaf60e jnz 0x0bca0068 ->22
0bcaf614 add ebp, +0x04
0bcaf617 cmp ebp, eax
0bcaf619 jle 0x0bcaf3ac ->LOOP
0bcaf61f jmp 0x0bca006c ->23
---- TRACE 12 stop -> loop So, lots of side traces, but at least the prefetches all seem to happen unconditionally before the side traces, and cdata isn't transferred between traces so we don't have big GC issues. On the other hand, we have the classic LuaJIT problem that rejoining the loop means re-evaluating the loop header, which hoists a bunch of things like the :prefetch and :lookup_with_prefetch bindings, etc. Oh well. |
Interestingly, although LuaJIT is able to CSE the two hash_i32(i) calls, manually hoisting by declaring hash1, hash2, etc variables causes miscompilation somehow :( |
It looks like the first preload is spilled, actually. I think trying to preload 4 values is too much for poor LuaJIT :) |
Sorry for the digression.. I have the overwhelming urge to try and impose my own morality onto cold unfeeling silicon. Maybe should spin this idea off onto a separate Issue. Anyway: Morally, I don't see this kind of lookup table performance should be sensitive to cache latency. The connection from the CPU to the L3 cache is a long fat pipe. If we pipeline our requests then we should get high throughput. This requires us to have a mechanism to make pipelined requests and to know in advance which requests to make. It seems like we have both of those things: Haswell CPUs can have 72 memory loads in flight at the same time and with a hashtable based on open addressing and linear probing we should be able to calculate all potentially interesting addresses before we load anything from memory. The problem is a lot like browsing the web from Australia with high bandwidth and high latency. The browser needs to parse the HTML and request all of the images, etc, at the same time. Then it feels fast. If it would wait until they are really needed, e.g. when you scroll down on the page, then it would be unbearably slow and the image would never be there when your eyes go looking for it. So here is an idea for a step-by-step algorithm for high-throughput loads from L3 or DRAM:
Sane? I suppose that before an idea like this would become interesting one would want to know that performance was actually bottlenecked on L3 cache latency, which could perhaps be sanity checked e.g. by looking at PMU counters like instructions/cycle to see if the processor is stalled. I like admitted before, this may be a major off topic digression :). |
Yeah I see where you are going @lukego, and some of my tests have not been on haswell. So, to do this on a table with linear probing you would need to precompute the maximum displacement in your table, and then read in the maximum displacement for all packets, and I guess store those entries in a prefetch buffer; while in theory you have enough registers to do this, I don't feel comfortable relying on this in LuaJIT. I don't know whether the store would cause stalls; we'll have to see. I'll give it a poke and test on a Haswell machine :) |
New patches add API that works like this: rhh:load(filename)
local keys, results = rhh:prepare_lookup_bufs(stride)
local count = rhh.occupancy
for i = 1, count, stride do
local n = math.min(stride, count + 1 - i)
for j = 0, n-1 do
keys[j].hash = hash_i32(i+j)
keys[j].key = i+j
end
rhh:fill_lookup_bufs(keys, results, n)
for j = 0, n-1 do
local result = rhh:lookup_from_bufs(keys, results, j)
-- result is an index into `results'
end
end Unfortunately for a table of 1e7 elements and the hash function I have, the max_displacement is 18, so this means that the the results buf has (4 bytes for hash + 4 bytes for key + 4 bytes for value) * 18 = 216 bytes, so 4 cache misses at least per prefetch stride. Just the fill_lookup_bufs phase seems to max out at around 6M lookups per second on this older i7 machine, or around 1.5 GB/s, far below the max memory bandwidth for this chip; but I'll try on our new Haswell and see. Could be there's something that's introducing some memory dependency somewhere that we don't want. Could be I should rewrite fill_lookup_bufs to use ffi.copy instead of looping, but then we have to determine the split point in case of wraparound. |
Well, worst-case perf on an enormous table is about 66 ns/lookup currently. I tried using non-temporal loads but that requires write-combining memory, which seems really fiddly to set up. A shame, as with 32-byte entries and a max displacement of 9, each packet will cause 320 bytes to be streamed in from the hash table. It would be nice to avoid that kind of cache pollution, given that our strategy is to stream in and copy. I guess it's good to leave some potential perf improvements to the future though. Still, 66ns/entry for at least 5 LLC misses per entry is not that bad of a worst-case perf I guess. Given that with big tables we're probably missing the TLB cache as well, I'm going to tweak our AVX2 multi-copy routine to go wide instead of deep: instead of pipelining N fetches on up to 4 keys, it will pipeline 1 fetch on up to N keys. I wish dynasm actually supported ymm(n) / Rq(n) for n >= 8. grumble. |
Could be appropriate to file a bug on that to luajit/luajit repo and maybe somebody can help or point in the right direction (e.g. @corsix). |
Closing this PR as #163 is the actionable one. I'm a bit dissatisfied with the perf still but I have done all that I think is possible short of non-temporal loads, so time to leave this for some other day :) |
One more idle idea in passing... Could it be useful to represent a hashtable using a thousand dollars worth of ram (128GB)? I mean: would the extra address space for hash keys reduce the average number of indirections and misses? |
Multiplied by N to run N lwaftrs in parallel unless they can share this huge table... |
@lukego It's possible. Drastically decreasing the load factor -- like going down to 10% or less -- would decrease the amount of data you'd have to fetch. Like you might get down to a max displacement of 4 or less maybe with a good hash function. But perf has been surprisingly resistant to optimization -- I too believe we should be seeing 10ns lookups, not 70ns lookups, even in the worst case, and it's not clear that simply throwing memory at the problem would be sufficient. Would be funny to pay for memory knowing that the only reason you are doing so is to keep it empty :) Some day I will understand why it takes 500-700ns to load 3200 bytes into cache. It just doesn't seem right! |
@kbara random thought re: sharing. In an OpenStack context you could request a VM with (say) 8 CPU cores, 8 x 10G ports, and 100GB of RAM. Then in principle you could run multiple AFTR processes in there, each serving a different port, and sharing the table locally with an mmap'd file. (Under the hood you would have 8 Snabb NFV processes serving those 10G ports with our usual setup.) On the other hand this would be a bit monolithic compared with running a separate VM for each port (or pair of ports) and that might make it inconvenient to deploy e.g. single hardware failure would take out a lot of AFTR capacity and maybe you would end up having to mix cores from multiple NUMA nodes (either in the VM or between the VM and the hypervisor networking). Just thinking aloud :). |
@lukego yeah; the memory layout option of that is something I've been pondering in the back of my head for a few months. I hate the anti-process-isolation properties, but like not having a bunch of copies of the same large table. A single hardware failure would potentially take out a lot of AFTR capacity either way. I haven't pondered the OpenStack angle. :) |
I am contemplating ways of doing full VREG support for x64 dynasm. |
For anyone interested, see LuaJIT/LuaJIT#119. |
Just following up on this ancient thread: It was pointed out to me on Twitter that 7-cpu publishes empirically measured throughput (parallelism * latency) numbers for L1/L2/L3/RAM on various processors. See for example Skylake results. Could be useful for understanding how much parallelism is possible and likely to be beneficial for things like fetching hashtable entries. For example Skylake RAM latency is 51ns + 42 cycles (let's call it 65ns) and parallel random RAM read throughput is one result (64B cache line) every 5.9ns. So that sounds like approximately 11 parallel requests, which is not bad! This is also a significant speedup over Haswell which is rated at one result every 8ns according to the same site. |
This is a WIP PR -- it is an implementation of an open-addressed robin hood hash table with linear probing. The implementation is designed to operate on cdata objects.
There is also an attempt at a caching layer, the idea being that if only 1MB of your 40MB table is hot, you can get better perf by adding a cache table. That table would maximize cache line utilization and also reduce TLB misses. It works well on my i7 laptop but I think not so well on interlaken. The best-case perf of adding a cache should be significantly better but in the pessimal case that every lookup misses the cache, total perf would be worse. Dunno, it exposes the same interface, so you can choose.
Results on interlaken:
This PR also includes a simple hash function, one of Bob Jenkins'. Anyway. Hashing appears to be ridiculously cheap; it's memory that's costly. This E5-2620v3 appears to have a worst-case lookup rate of 15M lookups/second, or 66 nanoseconds/lookup. This is about the cost of a cache miss/RAM fetch per lookup. In this test, both keys and values are 4-byte integers; I think that given that the hash function is spreading out our lookups so that a cache line is only ever used for the hash and key, it doesn't matter how big the value is, for lookup time anyway. I could have made 100 byte values is what I'm saying here.
The cache didn't appear to do anything for me. I guess TLB misses aren't a consideration for us, and really if they are, we should be asking for huge pages instead.
Cc @lukego @kbara for their thoughts, if you have any :)