Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PodHashMap #97

Closed
wants to merge 57 commits into from
Closed

Add PodHashMap #97

wants to merge 57 commits into from

Conversation

wingo
Copy link

@wingo wingo commented Nov 15, 2015

This is a WIP PR -- it is an implementation of an open-addressed robin hood hash table with linear probing. The implementation is designed to operate on cdata objects.

There is also an attempt at a caching layer, the idea being that if only 1MB of your 40MB table is hot, you can get better perf by adding a cache table. That table would maximize cache line utilization and also reduce TLB misses. It works well on my i7 laptop but I think not so well on interlaken. The best-case perf of adding a cache should be significantly better but in the pessimal case that every lookup misses the cache, total perf would be worse. Dunno, it exposes the same interface, so you can choose.

Results on interlaken:

wingo@interlaken:~/snabbswitch/src$ ./snabb snsh apps/lwaftr/phmtest.lua
hash rate test
574.5279184241 million hashes per second (final result: 1543808256)
insertion rate test
5.0559188134633 million insertions per second
verification
lookup speed test (hits, uniform distribution)
15.615401383563 million lookups per second (final result: 4602468)
lookup speed test (hits, only 16K active entries)
29.995174646259 million lookups per second (final result: 8149854)
lookup speed test (hits, only 16K active entries, 64K entry cache)
32.036958553015 million lookups per second (final result: 23391)
lookup speed test (warm cache hits, only 16K active entries, 64K entry cache)
32.969044296113 million lookups per second (final result: 23391)
cache usage: 32767/65536
cache verification
success

This PR also includes a simple hash function, one of Bob Jenkins'. Anyway. Hashing appears to be ridiculously cheap; it's memory that's costly. This E5-2620v3 appears to have a worst-case lookup rate of 15M lookups/second, or 66 nanoseconds/lookup. This is about the cost of a cache miss/RAM fetch per lookup. In this test, both keys and values are 4-byte integers; I think that given that the hash function is spreading out our lookups so that a cache line is only ever used for the hash and key, it doesn't matter how big the value is, for lookup time anyway. I could have made 100 byte values is what I'm saying here.

The cache didn't appear to do anything for me. I guess TLB misses aren't a consideration for us, and really if they are, we should be asking for huge pages instead.

Cc @lukego @kbara for their thoughts, if you have any :)

@wingo
Copy link
Author

wingo commented Nov 15, 2015

the printout claims it's only doing 16K active entries but in reality it's doing 32K; I'll have to figure out some nicer perf tests/benchmarks if we decide to use this for the lwaftr binding table.

I guess I should end by noting that if the cache misses parallelize, then this would be a fine binding table data structure when running multiple lwaftrs on one machine. The lookup appears to really be just some 70ns in the worst case of a cache miss, which is what we budgeted for. We'll need two tables, one for the from-IPv6 and the from-IPv4 sides, but that's fine. We appear to remain on target perf-wise.

-- simulate only 16K active flows: i = bit.band(i, 0xffff)
result = rhh:lookup(hash_i32(i), i)
end
stop = ffi.C.get_time_ns()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can localize C.

@dpino
Copy link
Member

dpino commented Nov 16, 2015

Looking good, promising results.

@lukego
Copy link

lukego commented Nov 17, 2015

This is awesome stuff, Andy!

I guess I should end by noting that if the cache misses parallelize, then this would be a fine binding table data structure when running multiple lwaftrs on one machine.

I am fascinated by the idea of parallelizing cache lookups to increase throughput by amortizing latency costs :-). cc @alexandergall

I see four main ways this can be done:

  1. Parallel requests from separate cores.
  2. Parallel requests from hyperthreads.
  3. PREFETCH instructions to fire off loads.
  4. Instruction-level parallelism (ILP) between load instructions.

I have done an experiment with (4) ILP now and the results are interesting! This is inspired by an idea from Mike Pall.

The idea is to have machine code that looks up multiple keys in such a way that the CPU can use ILP to issue the loads from L3/memory in parallel. In TCP/networking terms this would be "increasing the congestion window" by having multiple requests to the cache outstanding at the same time. That way if the latency is (say) 60ns then you would be getting the results of N lookups every 60ns instead of just one. For example if N=10 then you would be able to complete an L3 lookup on average every 6ns.

I wrote some proof-of-concept assembler code to see if this works. (It does seem to.) I have three simple benchmarks: p1, p2, and p4. Each one fetches values from L3 cache in a loop. Parallelism is limited because the address of each fetch depends on the result of the previous one. This is compensated in p2 by issuing two separate fetches at a time and in p4 with four fetches at a time.

The hypothesised runtime behavior is like this (to borrow a TCP-style diagram):

scan_20151117_0001

I expect that each routine will have the same number of L3 cache misses, and each have the same L3 latency, but that throughput of lookup operations will be higher for the parallel versions (more results/second received from L3).

I did this experiment. See code and results. Below is the overview.

p1

Inner loop:

   | mov r8, [rsi+r8*8]

PMU report for 100M lookups:

EVENT                                             TOTAL       /loop
cycles                                    5,352,323,813      53.523
ref_cycles                                4,015,400,520      40.154
instructions                                402,419,855       4.024
mem_load_uops_retired.l1_hit                      1,314       0.000
mem_load_uops_retired.l2_hit                     17,900       0.000
mem_load_uops_retired.l3_hit                 99,727,124       0.997
mem_load_uops_retired.l3_miss                   257,335       0.003
loop                                        100,000,000       1.000

Result: ~54 cycles per lookup (and one L3 cache hit confirmed).

(There is a misnomoer here: it would be more accurate to print /lookup rather than /loop.)

p2

Inner loop:

   | mov r8, [rsi+r8*8]
   | mov r9, [rsi+r9*8]

PMU report for 100M lookups:

EVENT                                             TOTAL       /loop
cycles                                    2,724,715,550      27.247
ref_cycles                                2,044,031,688      20.440
instructions                                301,660,952       3.017
mem_load_uops_retired.l1_hit                     18,096       0.000
mem_load_uops_retired.l2_hit                  7,019,903       0.070
mem_load_uops_retired.l3_hit                 92,723,309       0.927
mem_load_uops_retired.l3_miss                   241,453       0.002
loop                                        100,000,000       1.000

Result: ~27 cycles per lookup (and one L3 cache hit confirmed).

p4

Inner loop:

   | mov r8, [rsi+r8*8]
   | mov r9, [rsi+r9*8]
   | mov r10, [rsi+r10*8]
   | mov r11, [rsi+r11*8]

PMU report for 100M lookups:

EVENT                                             TOTAL       /loop
cycles                                    1,394,710,512      13.947
ref_cycles                                1,046,561,472      10.466
instructions                                250,792,091       2.508
mem_load_uops_retired.l1_hit                     11,931       0.000
mem_load_uops_retired.l2_hit                     33,882       0.000
mem_load_uops_retired.l3_hit                 99,679,009       0.997
mem_load_uops_retired.l3_miss                   276,827       0.003
loop                                        100,000,000       1.000

Result: ~14 cycles per lookup (and one L3 cache hit confirmed).

End

So! Looks like ILP can be used to issue parallel cache lookups and this can reduce the average lookup cost to far below the access time. Interesting to bring into a practical data structure?

@lukego
Copy link

lukego commented Nov 17, 2015

This code looks really nice!

This E5-2620v3 appears to have a worst-case lookup rate of 15M lookups/second, or 66 nanoseconds/lookup.

In LuaJIT it seems like the worst-case time needs to also take into account the potential non-local effects of calling the lookup function, particularly the risk of branching off to a side trade (lukego/blog#8). I would want to be confident that the probing loop is predictable enough to LuaJIT (e.g. heavily biased towards one iteration count) that exits to side traces here would be rare. You might be able to check that with traceprof (snabbco#623) to see whether one looping trace dominates the time spent. (Maybe this could be influence by throwing memory at the problem: an application that is already using a whole CPU core should not be shy about using a couple of gigabytes of memory.)

This seems important for these subroutines that can be called from tight traces. Alex Gall decided to hide such control flow from LuaJIT by writing it in C and calling it via FFI. (It's a pain in the ass to be sure.)

@alexandergall
Copy link

Most of this is over my head :) I just performed the "insertion rate test" in isolation to see how the compiler deals with it, given my really bad experience with small inner loops in my own effort at this kind of thing. The code

for i = 1, count do
   local h = hash_i32(i)
   local v = bit.bnot(i)
   rhh:add(h, i, v)
end

generates almost 80 traces and many of them are side traces. I think that this is the reason for the relatively low rate (<4M per second on my machine) @andywingo did you try to call this code from an actual packet-processing loop yet? That's where the real performance hit would happen.

@wingo
Copy link
Author

wingo commented Nov 17, 2015

On Tue 17 Nov 2015 08:19, Alexander Gall [email protected] writes:

The code

for i = 1, count do
local h = hash_i32(i)
local v = bit.bnot(i)
rhh:add(h, i, v)
end

generates almost 80 traces and many of them are side traces.

Building the hash table is not something I want to time. The reason is
that it is growing the backing store of the hash table by 2 at every
time, and in my use case I build the hash table beforehand and add/remove
to it relatively rarely. To eliminate the extra traces you could call
rhh:resize(bit.lshift(1, 24)) or so before adding elements.

@andywingo did you try to call this code from an actual
packet-processing loop yet?

Nope! Going to give it a go soon. However it would be :lookup in the
loop, not :add, and I think we'll get one hot trace out of that.

Andy

@wingo
Copy link
Author

wingo commented Nov 17, 2015

I tried to do some prefetching just with LuaJIT and wasn't really getting the right thing, not yet anyway. Need to try it on a Haswell system tho.

The latest commit adds the ability to save the PHM to disk and load it back with mmap. It also adds more minimal test programs.

Having done this:

./snabb snsh apps/lwaftr/test_phm_create.lua 1e7 foo.phm
./snabb snsh apps/lwaftr/test_phm_lookup1.lua foo.phm

If I do -jdump I get this as the lookup loop:

->LOOP:
0bcafa99  mov r15d, ebp
0bcafa9c  mov ebp, ebx
0bcafa9e  shl ebp, 0x0f
0bcafaa1  not ebp
0bcafaa3  add ebp, ebx
0bcafaa5  mov r14d, ebp
0bcafaa8  shr r14d, 0x0a
0bcafaac  xor r14d, ebp
0bcafaaf  mov ebp, r14d
0bcafab2  shl ebp, 0x03
0bcafab5  add ebp, r14d
0bcafab8  mov r14d, ebp
0bcafabb  shr r14d, 0x06
0bcafabf  xor r14d, ebp
0bcafac2  mov ebp, r14d
0bcafac5  shl ebp, 0x0b
0bcafac8  not ebp
0bcafaca  add ebp, r14d
0bcafacd  mov r14d, ebp
0bcafad0  shr r14d, 0x10
0bcafad4  xor r14d, ebp
0bcafad7  or r14d, 0x80000000
0bcafade  mov ebp, r14d
0bcafae1  and ebp, r8d
0bcafae4  movsxd r13, ebp
0bcafae7  imul r13, r13, +0x0c
0bcafaeb  cmp r14d, [r13+rdi+0x0]
0bcafaf0  jz 0x0bca0034 ->9
0bcafaf6  add ebp, +0x01
0bcafaf9  and ebp, r8d
0bcafafc  movsxd r13, ebp
0bcafaff  imul r13, r13, +0x0c
0bcafb03  cmp r14d, [r13+rdi+0x0]
0bcafb08  jnz 0x0bca0038    ->10
0bcafb0e  mov r13d, [r13+rdi+0x4]
0bcafb13  xorps xmm6, xmm6
0bcafb16  cvtsi2sd xmm6, r13
0bcafb1b  xorps xmm7, xmm7
0bcafb1e  cvtsi2sd xmm7, ebx
0bcafb22  ucomisd xmm7, xmm6
0bcafb26  jpe 0x0bca003c    ->11
0bcafb2c  jnz 0x0bca003c    ->11
0bcafb32  add ebx, +0x01
0bcafb35  cmp ebx, eax
0bcafb37  jle 0x0bcafa99    ->LOOP
0bcafb3d  jmp 0x0bca0040    ->12
---- TRACE 7 stop -> loop
```asm

@wingo
Copy link
Author

wingo commented Nov 17, 2015

I added a lookup2 test. On this three-year-old Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, the best perf I got on lookup1 was 14.5M lookups/s, and with lookup2 I got 16.5. Here is the inner loop of the lookup2 test:

   for i = 1, count, 2 do
      local i2 = i + 1
      local hash1 = hash_i32(i)
      local prefetch1 = rhh:prefetch(hash1)
      local hash2 = hash_i32(i2)
      local prefetch2 = rhh:prefetch(hash2)

      result1 = rhh:lookup_with_prefetch(hash1, i, prefetch1)
      result2 = rhh:lookup_with_prefetch(hash2, i2, prefetch2)
   end

Interestingly I couldn't get any prefetching benefit unless I hoisted the i2 = i + 1, I think because Lua's range inference isn't all that good and so it was inserting an overflow check for the "i+1" to make sure it stayed within the 32-bit range. I think that branch was stalling the memory prefetch; dunno tho. Here's the assembly for the inner loop:

->LOOP:
0bcaf984  mov r14d, ebp
0bcaf987  mov ebp, ebx
0bcaf989  mov r13d, r15d
0bcaf98c  add r13d, +0x01
0bcaf990  jo 0x0bca0030 ->8
0bcaf996  mov ebx, r15d
0bcaf999  shl ebx, 0x0f
0bcaf99c  not ebx
0bcaf99e  add ebx, r15d
0bcaf9a1  mov r12d, ebx
0bcaf9a4  shr r12d, 0x0a
0bcaf9a8  xor r12d, ebx
0bcaf9ab  mov ebx, r12d
0bcaf9ae  shl ebx, 0x03
0bcaf9b1  add ebx, r12d
0bcaf9b4  mov r12d, ebx
0bcaf9b7  shr r12d, 0x06
0bcaf9bb  xor r12d, ebx
0bcaf9be  mov ebx, r12d
0bcaf9c1  shl ebx, 0x0b
0bcaf9c4  not ebx
0bcaf9c6  add ebx, r12d
0bcaf9c9  mov r12d, ebx
0bcaf9cc  shr r12d, 0x10
0bcaf9d0  xor r12d, ebx
0bcaf9d3  mov ebx, r12d
0bcaf9d6  and ebx, r9d
0bcaf9d9  movsxd rbx, ebx
0bcaf9dc  imul rbx, rbx, +0x0c
0bcaf9e0  mov edi, [rbx+r8]
0bcaf9e4  lea esi, [r15+0x1]
0bcaf9e8  mov ebx, esi
0bcaf9ea  shl ebx, 0x0f
0bcaf9ed  not ebx
0bcaf9ef  add ebx, esi
0bcaf9f1  mov esi, ebx
0bcaf9f3  shr esi, 0x0a
0bcaf9f6  xor esi, ebx
0bcaf9f8  mov ebx, esi
0bcaf9fa  shl ebx, 0x03
0bcaf9fd  add ebx, esi
0bcaf9ff  mov esi, ebx
0bcafa01  shr esi, 0x06
0bcafa04  xor esi, ebx
0bcafa06  mov ebx, esi
0bcafa08  shl ebx, 0x0b
0bcafa0b  not ebx
0bcafa0d  add ebx, esi
0bcafa0f  mov esi, ebx
0bcafa11  shr esi, 0x10
0bcafa14  xor esi, ebx
0bcafa16  mov ebx, esi
0bcafa18  and ebx, r9d
0bcafa1b  movsxd rbx, ebx
0bcafa1e  imul rbx, rbx, +0x0c
0bcafa22  mov edx, [rbx+r8]
0bcafa26  mov r11d, r12d
0bcafa29  or r11d, 0x80000000
0bcafa30  mov ebx, r11d
0bcafa33  and ebx, r9d
0bcafa36  cmp r11d, edi
0bcafa39  jnz 0x0bca0034    ->9
0bcafa3f  movsxd r10, ebx
0bcafa42  imul r10, r10, +0x0c
0bcafa46  mov r10d, [r10+r8+0x4]
0bcafa4b  xorps xmm6, xmm6
0bcafa4e  cvtsi2sd xmm6, r10
0bcafa53  xorps xmm7, xmm7
0bcafa56  cvtsi2sd xmm7, r15d
0bcafa5b  ucomisd xmm7, xmm6
0bcafa5f  jpe 0x0bca0038    ->10
0bcafa65  jnz 0x0bca0038    ->10
0bcafa6b  mov r11d, esi
0bcafa6e  or r11d, 0x80000000
0bcafa75  mov ebp, r9d
0bcafa78  and ebp, r11d
0bcafa7b  cmp r11d, edx
0bcafa7e  jnz 0x0bca003c    ->11
0bcafa84  movsxd r10, ebp
0bcafa87  imul r10, r10, +0x0c
0bcafa8b  mov r10d, [r10+r8+0x4]
0bcafa90  xorps xmm6, xmm6
0bcafa93  cvtsi2sd xmm6, r10
0bcafa98  xorps xmm7, xmm7
0bcafa9b  cvtsi2sd xmm7, r13d
0bcafaa0  ucomisd xmm7, xmm6
0bcafaa4  jpe 0x0bca0040    ->12
0bcafaaa  jnz 0x0bca0040    ->12
0bcafab0  add r15d, +0x02
0bcafab4  cmp r15d, eax
0bcafab7  jle 0x0bcaf984    ->LOOP
0bcafabd  jmp 0x0bca0044    ->13
---- TRACE 7 stop -> loop

@wingo
Copy link
Author

wingo commented Nov 17, 2015

A lookup4 test case aborts, like actually aborts the process :) Also some of the trace cmpiles abort because 'register coalescing too complex' :P

@wingo
Copy link
Author

wingo commented Nov 17, 2015

OK, finally a lookup4 test that works. Inner loop:

   -- NOTE!  Results don't flow out of this loop, so LuaJIT is free to
   -- kill the whole loop.  Currently that's not the case but you need
   -- to verify the traces to ensure that all is well.  Caveat emptor!
   for i = 1, count, 4 do
      local i2, i3, i4 = i+1, i+2, i+3
      local prefetch1 = rhh:prefetch(hash_i32(i))
      local prefetch2 = rhh:prefetch(hash_i32(i2))
      local prefetch3 = rhh:prefetch(hash_i32(i3))
      local prefetch4 = rhh:prefetch(hash_i32(i4))

      rhh:lookup_with_prefetch(hash_i32(i), i, prefetch1)
      rhh:lookup_with_prefetch(hash_i32(i2), i2, prefetch2)
      rhh:lookup_with_prefetch(hash_i32(i3), i3, prefetch3)
      rhh:lookup_with_prefetch(hash_i32(i4), i4, prefetch4)
   end

I verified that LuaJIT doesn't kill the loop body, even though it is technically dead. Otherwise having 4 result phi variables is too much of a burden on the compiler. Loop body:

->LOOP:
0bcaf3ac  mov ebx, ebp
0bcaf3ae  add ebx, +0x01
0bcaf3b1  jo 0x0bca0048 ->14
0bcaf3b7  mov r15d, ebp
0bcaf3ba  add r15d, +0x02
0bcaf3be  jo 0x0bca0048 ->14
0bcaf3c4  mov r14d, ebp
0bcaf3c7  add r14d, +0x03
0bcaf3cb  jo 0x0bca0048 ->14
0bcaf3d1  mov r12d, ebp
0bcaf3d4  shl r12d, 0x0f
0bcaf3d8  not r12d
0bcaf3db  add r12d, ebp
0bcaf3de  mov r13d, r12d
0bcaf3e1  shr r13d, 0x0a
0bcaf3e5  xor r13d, r12d
0bcaf3e8  mov r12d, r13d
0bcaf3eb  shl r12d, 0x03
0bcaf3ef  add r12d, r13d
0bcaf3f2  mov r13d, r12d
0bcaf3f5  shr r13d, 0x06
0bcaf3f9  xor r13d, r12d
0bcaf3fc  mov r12d, r13d
0bcaf3ff  shl r12d, 0x0b
0bcaf403  not r12d
0bcaf406  add r12d, r13d
0bcaf409  mov r13d, r12d
0bcaf40c  shr r13d, 0x10
0bcaf410  xor r13d, r12d
0bcaf413  mov r12d, r13d
0bcaf416  and r12d, r9d
0bcaf419  movsxd r12, r12d
0bcaf41c  imul r12, r12, +0x0c
0bcaf420  mov edi, [r12+r8]
0bcaf424  mov [rsp+0x18], edi
0bcaf428  lea edi, [rbp+0x1]
0bcaf42b  mov r12d, edi
0bcaf42e  shl r12d, 0x0f
0bcaf432  not r12d
0bcaf435  add r12d, edi
0bcaf438  mov edi, r12d
0bcaf43b  shr edi, 0x0a
0bcaf43e  xor edi, r12d
0bcaf441  mov r12d, edi
0bcaf444  shl r12d, 0x03
0bcaf448  add r12d, edi
0bcaf44b  mov edi, r12d
0bcaf44e  shr edi, 0x06
0bcaf451  xor edi, r12d
0bcaf454  mov r12d, edi
0bcaf457  shl r12d, 0x0b
0bcaf45b  not r12d
0bcaf45e  add r12d, edi
0bcaf461  mov ecx, r12d
0bcaf464  shr ecx, 0x10
0bcaf467  xor ecx, r12d
0bcaf46a  mov r12d, ecx
0bcaf46d  and r12d, r9d
0bcaf470  movsxd r12, r12d
0bcaf473  imul r12, r12, +0x0c
0bcaf477  mov r12d, [r12+r8]
0bcaf47b  lea esi, [rbp+0x2]
0bcaf47e  mov edi, esi
0bcaf480  shl edi, 0x0f
0bcaf483  not edi
0bcaf485  add edi, esi
0bcaf487  mov esi, edi
0bcaf489  shr esi, 0x0a
0bcaf48c  xor esi, edi
0bcaf48e  mov edi, esi
0bcaf490  shl edi, 0x03
0bcaf493  add edi, esi
0bcaf495  mov esi, edi
0bcaf497  shr esi, 0x06
0bcaf49a  xor esi, edi
0bcaf49c  mov edi, esi
0bcaf49e  shl edi, 0x0b
0bcaf4a1  not edi
0bcaf4a3  add edi, esi
0bcaf4a5  mov r11d, edi
0bcaf4a8  shr r11d, 0x10
0bcaf4ac  xor r11d, edi
0bcaf4af  mov edi, r11d
0bcaf4b2  and edi, r9d
0bcaf4b5  movsxd rdi, edi
0bcaf4b8  imul rdi, rdi, +0x0c
0bcaf4bc  mov edi, [rdi+r8]
0bcaf4c0  lea edx, [rbp+0x3]
0bcaf4c3  mov esi, edx
0bcaf4c5  shl esi, 0x0f
0bcaf4c8  not esi
0bcaf4ca  add esi, edx
0bcaf4cc  mov edx, esi
0bcaf4ce  shr edx, 0x0a
0bcaf4d1  xor edx, esi
0bcaf4d3  mov esi, edx
0bcaf4d5  shl esi, 0x03
0bcaf4d8  add esi, edx
0bcaf4da  mov edx, esi
0bcaf4dc  shr edx, 0x06
0bcaf4df  xor edx, esi
0bcaf4e1  mov esi, edx
0bcaf4e3  shl esi, 0x0b
0bcaf4e6  not esi
0bcaf4e8  add esi, edx
0bcaf4ea  mov edx, esi
0bcaf4ec  shr edx, 0x10
0bcaf4ef  xor edx, esi
0bcaf4f1  mov esi, edx
0bcaf4f3  and esi, r9d
0bcaf4f6  movsxd rsi, esi
0bcaf4f9  imul rsi, rsi, +0x0c
0bcaf4fd  mov esi, [rsi+r8]
0bcaf501  or r13d, 0x80000000
0bcaf508  mov [rsp+0x14], r13d
0bcaf50d  mov r10d, r13d
0bcaf510  and r10d, r9d
0bcaf513  mov [rsp+0x10], r10d
0bcaf518  cmp r13d, [rsp+0x18]
0bcaf51d  jnz 0x0bca004c    ->15
0bcaf523  movsxd r10, r10d
0bcaf526  imul r10, r10, +0x0c
0bcaf52a  mov r10d, [r10+r8+0x4]
0bcaf52f  xorps xmm6, xmm6
0bcaf532  cvtsi2sd xmm6, r10
0bcaf537  xorps xmm7, xmm7
0bcaf53a  cvtsi2sd xmm7, ebp
0bcaf53e  ucomisd xmm7, xmm6
0bcaf542  jpe 0x0bca0050    ->16
0bcaf548  jnz 0x0bca0050    ->16
0bcaf54e  or ecx, 0x80000000
0bcaf554  mov [rsp+0xc], ecx
0bcaf558  mov r10d, ecx
0bcaf55b  and r10d, r9d
0bcaf55e  cmp ecx, r12d
0bcaf561  jnz 0x0bca0054    ->17
0bcaf567  movsxd rcx, r10d
0bcaf56a  imul rcx, rcx, +0x0c
0bcaf56e  mov ecx, [rcx+r8+0x4]
0bcaf573  xorps xmm6, xmm6
0bcaf576  cvtsi2sd xmm6, rcx
0bcaf57b  xorps xmm7, xmm7
0bcaf57e  cvtsi2sd xmm7, ebx
0bcaf582  ucomisd xmm7, xmm6
0bcaf586  jpe 0x0bca0058    ->18
0bcaf58c  jnz 0x0bca0058    ->18
0bcaf592  or r11d, 0x80000000
0bcaf599  mov r10d, r11d
0bcaf59c  and r10d, r9d
0bcaf59f  cmp r11d, edi
0bcaf5a2  jnz 0x0bca005c    ->19
0bcaf5a8  movsxd rcx, r10d
0bcaf5ab  imul rcx, rcx, +0x0c
0bcaf5af  mov ecx, [rcx+r8+0x4]
0bcaf5b4  xorps xmm6, xmm6
0bcaf5b7  cvtsi2sd xmm6, rcx
0bcaf5bc  xorps xmm7, xmm7
0bcaf5bf  cvtsi2sd xmm7, r15d
0bcaf5c4  ucomisd xmm7, xmm6
0bcaf5c8  jpe 0x0bca0060    ->20
0bcaf5ce  jnz 0x0bca0060    ->20
0bcaf5d4  or edx, 0x80000000
0bcaf5da  mov r11d, r9d
0bcaf5dd  and r11d, edx
0bcaf5e0  cmp edx, esi
0bcaf5e2  jnz 0x0bca0064    ->21
0bcaf5e8  movsxd r10, r11d
0bcaf5eb  imul r10, r10, +0x0c
0bcaf5ef  mov r10d, [r10+r8+0x4]
0bcaf5f4  xorps xmm6, xmm6
0bcaf5f7  cvtsi2sd xmm6, r10
0bcaf5fc  xorps xmm7, xmm7
0bcaf5ff  cvtsi2sd xmm7, r14d
0bcaf604  ucomisd xmm7, xmm6
0bcaf608  jpe 0x0bca0068    ->22
0bcaf60e  jnz 0x0bca0068    ->22
0bcaf614  add ebp, +0x04
0bcaf617  cmp ebp, eax
0bcaf619  jle 0x0bcaf3ac    ->LOOP
0bcaf61f  jmp 0x0bca006c    ->23
---- TRACE 12 stop -> loop

So, lots of side traces, but at least the prefetches all seem to happen unconditionally before the side traces, and cdata isn't transferred between traces so we don't have big GC issues. On the other hand, we have the classic LuaJIT problem that rejoining the loop means re-evaluating the loop header, which hoists a bunch of things like the :prefetch and :lookup_with_prefetch bindings, etc. Oh well.

@wingo
Copy link
Author

wingo commented Nov 17, 2015

Interestingly, although LuaJIT is able to CSE the two hash_i32(i) calls, manually hoisting by declaring hash1, hash2, etc variables causes miscompilation somehow :(

@wingo
Copy link
Author

wingo commented Nov 17, 2015

It looks like the first preload is spilled, actually. I think trying to preload 4 values is too much for poor LuaJIT :)

@lukego
Copy link

lukego commented Nov 19, 2015

Sorry for the digression.. I have the overwhelming urge to try and impose my own morality onto cold unfeeling silicon. Maybe should spin this idea off onto a separate Issue.

Anyway: Morally, I don't see this kind of lookup table performance should be sensitive to cache latency.

The connection from the CPU to the L3 cache is a long fat pipe. If we pipeline our requests then we should get high throughput. This requires us to have a mechanism to make pipelined requests and to know in advance which requests to make. It seems like we have both of those things: Haswell CPUs can have 72 memory loads in flight at the same time and with a hashtable based on open addressing and linear probing we should be able to calculate all potentially interesting addresses before we load anything from memory.

The problem is a lot like browsing the web from Australia with high bandwidth and high latency. The browser needs to parse the HTML and request all of the images, etc, at the same time. Then it feels fast. If it would wait until they are really needed, e.g. when you scroll down on the page, then it would be unbearably slow and the image would never be there when your eyes go looking for it.

So here is an idea for a step-by-step algorithm for high-throughput loads from L3 or DRAM:

  1. For N(=10) packets, calculate the hashes and determine every memory address that we will potentially load while looking for the values. (Since we are operating on a struct link it should be convenient to find the N packets together in a batch.)
  2. Copy all of that memory into a special "lookup struct" in parallel.
  3. Perform the N actual lookups on this local lookup struct (now in L1 cache) without reference to the main table memory.

Sane?

I suppose that before an idea like this would become interesting one would want to know that performance was actually bottlenecked on L3 cache latency, which could perhaps be sanity checked e.g. by looking at PMU counters like instructions/cycle to see if the processor is stalled.

I like admitted before, this may be a major off topic digression :).

@wingo
Copy link
Author

wingo commented Nov 19, 2015

Yeah I see where you are going @lukego, and some of my tests have not been on haswell. So, to do this on a table with linear probing you would need to precompute the maximum displacement in your table, and then read in the maximum displacement for all packets, and I guess store those entries in a prefetch buffer; while in theory you have enough registers to do this, I don't feel comfortable relying on this in LuaJIT. I don't know whether the store would cause stalls; we'll have to see.

I'll give it a poke and test on a Haswell machine :)

@wingo
Copy link
Author

wingo commented Nov 19, 2015

New patches add API that works like this:

   rhh:load(filename)
   local keys, results = rhh:prepare_lookup_bufs(stride)
   local count = rhh.occupancy
   for i = 1, count, stride do
      local n = math.min(stride, count + 1 - i)
      for j = 0, n-1 do
         keys[j].hash = hash_i32(i+j)
         keys[j].key = i+j
      end
      rhh:fill_lookup_bufs(keys, results, n)
      for j = 0, n-1 do
         local result = rhh:lookup_from_bufs(keys, results, j)
         -- result is an index into `results'
      end
   end

Unfortunately for a table of 1e7 elements and the hash function I have, the max_displacement is 18, so this means that the the results buf has (4 bytes for hash + 4 bytes for key + 4 bytes for value) * 18 = 216 bytes, so 4 cache misses at least per prefetch stride. Just the fill_lookup_bufs phase seems to max out at around 6M lookups per second on this older i7 machine, or around 1.5 GB/s, far below the max memory bandwidth for this chip; but I'll try on our new Haswell and see. Could be there's something that's introducing some memory dependency somewhere that we don't want. Could be I should rewrite fill_lookup_bufs to use ffi.copy instead of looping, but then we have to determine the split point in case of wraparound.

@wingo
Copy link
Author

wingo commented Dec 9, 2015

Well, worst-case perf on an enormous table is about 66 ns/lookup currently. I tried using non-temporal loads but that requires write-combining memory, which seems really fiddly to set up. A shame, as with 32-byte entries and a max displacement of 9, each packet will cause 320 bytes to be streamed in from the hash table. It would be nice to avoid that kind of cache pollution, given that our strategy is to stream in and copy. I guess it's good to leave some potential perf improvements to the future though. Still, 66ns/entry for at least 5 LLC misses per entry is not that bad of a worst-case perf I guess.

Given that with big tables we're probably missing the TLB cache as well, I'm going to tweak our AVX2 multi-copy routine to go wide instead of deep: instead of pipelining N fetches on up to 4 keys, it will pipeline 1 fetch on up to N keys. I wish dynasm actually supported ymm(n) / Rq(n) for n >= 8. grumble.

@lukego
Copy link

lukego commented Dec 9, 2015

Given that with big tables we're probably missing the TLB cache as well, I'm going to tweak our AVX2 multi-copy routine to go wide instead of deep: instead of pipelining N fetches on up to 4 keys, it will pipeline 1 fetch on up to N keys. I wish dynasm actually supported ymm(n) / Rq(n) for n >= 8. grumble.

Could be appropriate to file a bug on that to luajit/luajit repo and maybe somebody can help or point in the right direction (e.g. @corsix).

@wingo wingo mentioned this pull request Dec 9, 2015
@wingo
Copy link
Author

wingo commented Dec 9, 2015

Closing this PR as #163 is the actionable one. I'm a bit dissatisfied with the perf still but I have done all that I think is possible short of non-temporal loads, so time to leave this for some other day :)

@wingo wingo closed this Dec 9, 2015
@wingo wingo deleted the wip-phm branch December 9, 2015 15:17
@lukego
Copy link

lukego commented Dec 9, 2015

One more idle idea in passing...

Could it be useful to represent a hashtable using a thousand dollars worth of ram (128GB)? I mean: would the extra address space for hash keys reduce the average number of indirections and misses?

@kbara
Copy link

kbara commented Dec 9, 2015

Multiplied by N to run N lwaftrs in parallel unless they can share this huge table...

@wingo
Copy link
Author

wingo commented Dec 9, 2015

@lukego It's possible. Drastically decreasing the load factor -- like going down to 10% or less -- would decrease the amount of data you'd have to fetch. Like you might get down to a max displacement of 4 or less maybe with a good hash function. But perf has been surprisingly resistant to optimization -- I too believe we should be seeing 10ns lookups, not 70ns lookups, even in the worst case, and it's not clear that simply throwing memory at the problem would be sufficient. Would be funny to pay for memory knowing that the only reason you are doing so is to keep it empty :)

Some day I will understand why it takes 500-700ns to load 3200 bytes into cache. It just doesn't seem right!

@lukego
Copy link

lukego commented Dec 10, 2015

@kbara random thought re: sharing. In an OpenStack context you could request a VM with (say) 8 CPU cores, 8 x 10G ports, and 100GB of RAM. Then in principle you could run multiple AFTR processes in there, each serving a different port, and sharing the table locally with an mmap'd file. (Under the hood you would have 8 Snabb NFV processes serving those 10G ports with our usual setup.)

On the other hand this would be a bit monolithic compared with running a separate VM for each port (or pair of ports) and that might make it inconvenient to deploy e.g. single hardware failure would take out a lot of AFTR capacity and maybe you would end up having to mix cores from multiple NUMA nodes (either in the VM or between the VM and the hypervisor networking).

Just thinking aloud :).

@kbara
Copy link

kbara commented Dec 10, 2015

@lukego yeah; the memory layout option of that is something I've been pondering in the back of my head for a few months. I hate the anti-process-isolation properties, but like not having a bunch of copies of the same large table. A single hardware failure would potentially take out a lot of AFTR capacity either way. I haven't pondered the OpenStack angle. :)

@corsix
Copy link

corsix commented Dec 12, 2015

Given that with big tables we're probably missing the TLB cache as well, I'm going to tweak our AVX2 multi-copy routine to go wide instead of deep: instead of pipelining N fetches on up to 4 keys, it will pipeline 1 fetch on up to N keys. I wish dynasm actually supported ymm(n) / Rq(n) for n >= 8. grumble.

Could be appropriate to file a bug on that to luajit/luajit repo and maybe somebody can help or point in the right direction (e.g. @corsix).

I am contemplating ways of doing full VREG support for x64 dynasm.

@corsix
Copy link

corsix commented Dec 25, 2015

I am contemplating ways of doing full VREG support for x64 dynasm.

For anyone interested, see LuaJIT/LuaJIT#119.

@lukego
Copy link

lukego commented Dec 6, 2016

Just following up on this ancient thread:

It was pointed out to me on Twitter that 7-cpu publishes empirically measured throughput (parallelism * latency) numbers for L1/L2/L3/RAM on various processors. See for example Skylake results. Could be useful for understanding how much parallelism is possible and likely to be beneficial for things like fetching hashtable entries.

For example Skylake RAM latency is 51ns + 42 cycles (let's call it 65ns) and parallel random RAM read throughput is one result (64B cache line) every 5.9ns. So that sounds like approximately 11 parallel requests, which is not bad! This is also a significant speedup over Haswell which is rated at one result every 8ns according to the same site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants