-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure loop alignment's performance impact on Microbenchmarks #44051
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
All measurements done on:
QuickSortSpanThe benchmark did regressed after my 32B method alignment changes as I pointed in DrewScoggins/performance-2#2319 (comment). I thought that the loop alignment changes would help but because of #43713 we don’t record the loops in the benchmarks in IniArrayThis benchmark regressed after my 32B method alignment changes as I pointed out in DrewScoggins/performance-2#2267 (comment). I did tested my loop alignment changes on this benchmark and it gives us 2X speed improvement as I wrote in DrewScoggins/performance-2#2267 (comment). SequenceCompareToThis benchmark shows regression after my 32B method alignment changes but it is in the bimodal level. I have given my analysis in DrewScoggins/performance-2#2286 (comment), but there is not much actionable at this point. I didn’t get chance to verify my loop alignment changes on this one yet. LoopReturnThis benchmark shows 20% win after my loop alignment changes. I have shared some data in DrewScoggins/performance-2#2674 (comment). |
All measurements done on:
CSieveThis benchmark shows around 6% improvement with loop alignment changes:
Assembly code G_M3098_IG04 and G_M3098_IG08; =========================== 32B boundary ===========================
;; bbWeight=4 PerfScore 2.00
G_M3098_IG04: ;; offset=0022H
00007ffb`a5116b42 4D63C8 movsxd r9, r8d
00007ffb`a5116b45 42C644081001 mov byte ptr [rax+r9+16], 1
00007ffb`a5116b4b 41FFC0 inc r8d
00007ffb`a5116b4e 4181F8FE1F0000 cmp r8d, 0x1FFE
00007ffb`a5116b55 7EEB jle SHORT G_M3098_IG04
;; bbWeight=16 PerfScore 44.00
G_M3098_IG05: ;; offset=0037H
00007ffb`a5116b57 41B802000000 mov r8d, 2
;; bbWeight=4 PerfScore 1.00
G_M3098_IG06: ;; offset=003DH
00007ffb`a5116b5d 4D63C8 movsxd r9, r8d
; =========================== 32B boundary ===========================
00007ffb`a5116b60 42807C081000 cmp byte ptr [rax+r9+16], 0
00007ffb`a5116b66 7424 je SHORT G_M3098_IG10
;; bbWeight=16 PerfScore 52.00
G_M3098_IG07: ;; offset=0048H
00007ffb`a5116b68 478D0C00 lea r9d, [r8+r8]
00007ffb`a5116b6c 4181F9FE1F0000 cmp r9d, 0x1FFE
00007ffb`a5116b73 7F15 jg SHORT G_M3098_IG09
;; bbWeight=8 PerfScore 14.00
G_M3098_IG08: ;; offset=0055H
00007ffb`a5116b75 4D63D1 movsxd r10, r9d
00007ffb`a5116b78 42C644101000 mov byte ptr [rax+r10+16], 0
00007ffb`a5116b7e 4503C8 add r9d, r8d
; =========================== 32B boundary ===========================
00007ffb`a5116b81 4181F9FE1F0000 cmp r9d, 0x1FFE
00007ffb`a5116b88 7EEB jle SHORT G_M3098_IG08
Assembly code G_M3098_IG04 and G_M3098_IG08; =========================== 32B boundary ===========================
00007ffb`a50f6f62 align
00007ffb`a50f6f62 align
00007ffb`a50f6f62 align
;; bbWeight=4 PerfScore 5.00
G_M3098_IG04: ;; offset=0022H
00007ffb`a50f6f62 4D63C8 movsxd r9, r8d
00007ffb`a50f6f65 42C644081001 mov byte ptr [rax+r9+16], 1
00007ffb`a50f6f6b 41FFC0 inc r8d
00007ffb`a50f6f6e 4181F8FE1F0000 cmp r8d, 0x1FFE
00007ffb`a50f6f75 7EEB jle SHORT G_M3098_IG04
;; bbWeight=16 PerfScore 44.00
G_M3098_IG05: ;; offset=0037H
00007ffb`a50f6f77 41B802000000 mov r8d, 2
;; bbWeight=4 PerfScore 1.00
G_M3098_IG06: ;; offset=003DH
00007ffb`a50f6f7d 4D63C8 movsxd r9, r8d
; =========================== 32B boundary ===========================
00007ffb`a50f6f80 42807C081000 cmp byte ptr [rax+r9+16], 0
00007ffb`a50f6f86 7443 je SHORT G_M3098_IG10
;; bbWeight=16 PerfScore 52.00
G_M3098_IG07: ;; offset=0048H
00007ffb`a50f6f88 478D0C00 lea r9d, [r8+r8]
00007ffb`a50f6f8c 4181F9FE1F0000 cmp r9d, 0x1FFE
00007ffb`a50f6f93 7F34 jg SHORT G_M3098_IG09
00007ffb`a50f6f95 6666660F1F840000000000 align
; =========================== 32B boundary ===========================
00007ffb`a50f6fa0 align
00007ffb`a50f6fa0 align
;; bbWeight=8 PerfScore 20.00
G_M3098_IG08: ;; offset=0060H
00007ffb`a50f6fa0 4D63D1 movsxd r10, r9d
00007ffb`a50f6fa3 42C644101000 mov byte ptr [rax+r10+16], 0
00007ffb`a50f6fa9 4503C8 add r9d, r8d
00007ffb`a50f6fac 4181F9FE1F0000 cmp r9d, 0x1FFE
00007ffb`a50f6fb3 7EEB jle SHORT G_M3098_IG08
HeapSortI don't see much changes in this benchmark, however I do see loops getting aligned which makes me wonder if there is some data alignment that is going wrong. Below is section of assembly code of Assembly code G_M60435_IG04 (before loop alignment)G_M60435_IG04: ;; offset=003EH
00007ff8`d10e6ebe 453BD0 cmp r10d, r8d
; =========================== 32B boundary ===========================
00007ff8`d10e6ec1 0F83F3000000 jae G_M60435_IG13
00007ff8`d10e6ec7 4963F2 movsxd rsi, r10d
00007ff8`d10e6eca 8B74B110 mov esi, dword ptr [rcx+4*rsi+16]
00007ff8`d10e6ece 413BF1 cmp esi, r9d
00007ff8`d10e6ed1 7D26 jge SHORT G_M60435_IG05
00007ff8`d10e6ed3 453BD8 cmp r11d, r8d
00007ff8`d10e6ed6 0F83DE000000 jae G_M60435_IG13
00007ff8`d10e6edc 4D63DB movsxd r11, r11d
00007ff8`d10e6edf 4289749910 mov dword ptr [rcx+4*r11+16], esi
; =========================== 32B boundary ===========================
00007ff8`d10e6ee4 458BDA mov r11d, r10d
00007ff8`d10e6ee7 458BD3 mov r10d, r11d
00007ff8`d10e6eea 41C1EA1F shr r10d, 31
00007ff8`d10e6eee 4503D3 add r10d, r11d
00007ff8`d10e6ef1 41D1FA sar r10d, 1
00007ff8`d10e6ef4 4585D2 test r10d, r10d
00007ff8`d10e6ef7 7FC5 jg SHORT G_M60435_IG04
;; bbWeight=16 PerfScore 188.00
Assembly code G_M60435_IG04 (before loop alignment)G_M60435_IG04: ;; offset=0040H
00007ffb`a7d77060 453BD0 cmp r10d, r8d
00007ffb`a7d77063 0F8316010000 jae G_M60435_IG13
00007ffb`a7d77069 4963F2 movsxd rsi, r10d
00007ffb`a7d7706c 8B74B110 mov esi, dword ptr [rcx+4*rsi+16]
00007ffb`a7d77070 413BF1 cmp esi, r9d
00007ffb`a7d77073 7D26 jge SHORT G_M60435_IG05
00007ffb`a7d77075 453BD8 cmp r11d, r8d
00007ffb`a7d77078 0F8301010000 jae G_M60435_IG13
00007ffb`a7d7707e 4D63DB movsxd r11, r11d
; =========================== 32B boundary ===========================
00007ffb`a7d77081 4289749910 mov dword ptr [rcx+4*r11+16], esi
00007ffb`a7d77086 458BDA mov r11d, r10d
00007ffb`a7d77089 458BD3 mov r10d, r11d
00007ffb`a7d7708c 41C1EA1F shr r10d, 31
00007ffb`a7d77090 4503D3 add r10d, r11d
00007ffb`a7d77093 41D1FA sar r10d, 1
00007ffb`a7d77096 4585D2 test r10d, r10d
00007ffb`a7d77099 7FC5 jg SHORT G_M60435_IG04 MulMatrixI see slight improvement of approx. 1% (min/max improves as well). Before:
After:
Below is some portion of assembly code for Assembly code after loop alignmentG_M2371_IG05: ;; offset=0075H
00007ffb`a6d76dd5 4D63D9 movsxd r11, r9d
00007ffb`a6d76dd8 4E8B54D910 mov r10, gword ptr [rcx+8*r11+16]
00007ffb`a6d76ddd 498BF2 mov rsi, r10
; ............................... 32B boundary ...............................
00007ffb`a6d76de0 8B5E08 mov ebx, dword ptr [rsi+8]
00007ffb`a6d76de3 3BC3 cmp eax, ebx
00007ffb`a6d76de5 0F83EF0A0000 jae G_M2371_IG98
00007ffb`a6d76deb 44894CBE10 mov dword ptr [rsi+4*rdi+16], r9d
00007ffb`a6d76df0 4A8B5CDA10 mov rbx, gword ptr [rdx+8*r11+16]
00007ffb`a6d76df5 8B7308 mov esi, dword ptr [rbx+8]
00007ffb`a6d76df8 3BC6 cmp eax, esi
00007ffb`a6d76dfa 0F83DA0A0000 jae G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0) 32B boundary ...............................
00007ffb`a6d76e00 896CBB10 mov dword ptr [rbx+4*rdi+16], ebp
00007ffb`a6d76e04 4F8B5CD810 mov r11, gword ptr [r8+8*r11+16]
00007ffb`a6d76e09 8BF5 mov esi, ebp
00007ffb`a6d76e0b 410374BA10 add esi, dword ptr [r10+4*rdi+16]
00007ffb`a6d76e10 413B4308 cmp eax, dword ptr [r11+8]
00007ffb`a6d76e14 0F83C00A0000 jae G_M2371_IG98
00007ffb`a6d76e1a 418974BB10 mov dword ptr [r11+4*rdi+16], esi
00007ffb`a6d76e1f 41FFC1 inc r9d
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (inc: 2) 32B boundary ...............................
00007ffb`a6d76e22 4183F94B cmp r9d, 75
00007ffb`a6d76e26 7CAD jl SHORT G_M2371_IG05
...
...
G_M2371_IG65: ;; offset=07BDH
00007ffb`a6d7751d 4C63D0 movsxd r10, eax
; ............................... 32B boundary ...............................
00007ffb`a6d77520 4F8B54D010 mov r10, gword ptr [r8+8*r10+16]
00007ffb`a6d77525 413B7A08 cmp edi, dword ptr [r10+8]
00007ffb`a6d77529 0F83AB030000 jae G_M2371_IG98
00007ffb`a6d7752f 4863F7 movsxd rsi, edi
00007ffb`a6d77532 4D8D54B210 lea r10, bword ptr [r10+4*rsi+16]
00007ffb`a6d77537 418B32 mov esi, dword ptr [r10]
00007ffb`a6d7753a 4863D8 movsxd rbx, eax
00007ffb`a6d7753d 488B5CD910 mov rbx, gword ptr [rcx+8*rbx+16]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 2) 32B boundary ...............................
00007ffb`a6d77542 443B5B08 cmp r11d, dword ptr [rbx+8]
00007ffb`a6d77546 0F838E030000 jae G_M2371_IG98
00007ffb`a6d7754c 4963EB movsxd rbp, r11d
00007ffb`a6d7754f 8B5CAB10 mov ebx, dword ptr [rbx+4*rbp+16]
00007ffb`a6d77553 498BE9 mov rbp, r9
00007ffb`a6d77556 3B7D08 cmp edi, dword ptr [rbp+8]
00007ffb`a6d77559 0F837B030000 jae G_M2371_IG98
00007ffb`a6d7755f 4C63F7 movsxd r14, edi
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movsxd: 2) 32B boundary ...............................
00007ffb`a6d77562 420FAF5CB510 imul ebx, dword ptr [rbp+4*r14+16]
00007ffb`a6d77568 03F3 add esi, ebx
00007ffb`a6d7756a 418932 mov dword ptr [r10], esi
00007ffb`a6d7756d FFC0 inc eax
00007ffb`a6d7756f 83F84B cmp eax, 75
00007ffb`a6d77572 7CA9 jl SHORT G_M2371_IG65 Assembly code after loop alignment00007ffb`a6d870f9 0F1F8000000000 align
; ............................... 32B boundary ...............................
00007ffb`a6d87100 align
00007ffb`a6d87100 align
;; bbWeight=4 PerfScore 51.00
G_M2371_IG05: ;; offset=0080H
00007ffb`a6d87100 4D63D9 movsxd r11, r9d
00007ffb`a6d87103 4E8B54D910 mov r10, gword ptr [rcx+8*r11+16]
00007ffb`a6d87108 498BF2 mov rsi, r10
00007ffb`a6d8710b 8B5E08 mov ebx, dword ptr [rsi+8]
00007ffb`a6d8710e 3BC3 cmp eax, ebx
00007ffb`a6d87110 0F833C0D0000 jae G_M2371_IG98
00007ffb`a6d87116 44894CBE10 mov dword ptr [rsi+4*rdi+16], r9d
00007ffb`a6d8711b 4A8B5CDA10 mov rbx, gword ptr [rdx+8*r11+16]
; ............................... 32B boundary ...............................
00007ffb`a6d87120 8B7308 mov esi, dword ptr [rbx+8]
00007ffb`a6d87123 3BC6 cmp eax, esi
00007ffb`a6d87125 0F83270D0000 jae G_M2371_IG98
00007ffb`a6d8712b 896CBB10 mov dword ptr [rbx+4*rdi+16], ebp
00007ffb`a6d8712f 4F8B5CD810 mov r11, gword ptr [r8+8*r11+16]
00007ffb`a6d87134 8BF5 mov esi, ebp
00007ffb`a6d87136 410374BA10 add esi, dword ptr [r10+4*rdi+16]
00007ffb`a6d8713b 413B4308 cmp eax, dword ptr [r11+8]
00007ffb`a6d8713f 0F830D0D0000 jae G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 5) 32B boundary ...............................
00007ffb`a6d87145 418974BB10 mov dword ptr [r11+4*rdi+16], esi
00007ffb`a6d8714a 41FFC1 inc r9d
00007ffb`a6d8714d 4183F94B cmp r9d, 75
00007ffb`a6d87151 7CAD jl SHORT G_M2371_IG05
...
...
00007ffb`a6d87878 0F1F840000000000 align
; ............................... 32B boundary ...............................
00007ffb`a6d87880 align
00007ffb`a6d87880 align
;; bbWeight=16 PerfScore 172.00
G_M2371_IG65: ;; offset=0800H
00007ffb`a6d87880 4C63D0 movsxd r10, eax
00007ffb`a6d87883 4F8B54D010 mov r10, gword ptr [r8+8*r10+16]
00007ffb`a6d87888 413B7A08 cmp edi, dword ptr [r10+8]
00007ffb`a6d8788c 0F8350040000 jae G_M2371_IG98
00007ffb`a6d87892 4863F7 movsxd rsi, edi
00007ffb`a6d87895 4D8D54B210 lea r10, bword ptr [r10+4*rsi+16]
00007ffb`a6d8789a 418B32 mov esi, dword ptr [r10]
00007ffb`a6d8789d 4863D8 movsxd rbx, eax
; ............................... 32B boundary ...............................
00007ffb`a6d878a0 488B5CD910 mov rbx, gword ptr [rcx+8*rbx+16]
00007ffb`a6d878a5 443B5B08 cmp r11d, dword ptr [rbx+8]
00007ffb`a6d878a9 0F8333040000 jae G_M2371_IG98
00007ffb`a6d878af 4963EB movsxd rbp, r11d
00007ffb`a6d878b2 8B5CAB10 mov ebx, dword ptr [rbx+4*rbp+16]
00007ffb`a6d878b6 498BE9 mov rbp, r9
00007ffb`a6d878b9 3B7D08 cmp edi, dword ptr [rbp+8]
00007ffb`a6d878bc 0F8320040000 jae G_M2371_IG98
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 2) 32B boundary ...............................
00007ffb`a6d878c2 4C63F7 movsxd r14, edi
00007ffb`a6d878c5 420FAF5CB510 imul ebx, dword ptr [rbp+4*r14+16]
00007ffb`a6d878cb 03F3 add esi, ebx
00007ffb`a6d878cd 418932 mov dword ptr [r10], esi
00007ffb`a6d878d0 FFC0 inc eax
00007ffb`a6d878d2 83F84B cmp eax, 75
00007ffb`a6d878d5 7CA9 jl SHORT G_M2371_IG65 Base64EncodeInPlaceI noticed that the benchmark code is aligned at 32B boundary. The loop inside the benchmark is big and it doesn't meet the current threshold of 96 bytes (3 * 32B chunks) to get aligned. However, looking at the bimodal behavior of test, it looks like this benchmark is hitting data alignment issue as I called out in DrewScoggins/performance-2#2291 (comment). IndexerCheckPathLengthMy method alignment changes affected this benchmark because the loop was earlier in one chunk of 32B and it got pushed to be in two chunks instead. As described here, DrewScoggins/performance-2#2290 (comment), the loop alignment should have got back the performance, but I am not sure if it helps because of benchSparseMultThe loop alignment changes has no effect on loops because it is already aligned correctly. You can see the perf numbers in DrewScoggins/performance-2#2271 (comment). That makes me wonder if this is another memory alignment related issue that depends on the alignment of arrays created for the benchmark System.Collections.IterateForEach.ImmutableArrayI am still investigating this one. As seen in my comments in DrewScoggins/performance-2#2285 (comment), the loop alignment won't help here because the loop is already aligned correctly. System.Collections.ContainsTrue.SpanAs seen in data I shared at DrewScoggins/performance-2#2284 (comment) , I see negligible improvement in this benchmark but was hoping to see some more. However, as Andy calls out:
System.Collections.ContainsKeyFalse<Int32, Int32>.ImmutableDictionaryThis one surprisingly sees regression as I pointed in DrewScoggins/performance-2#2281 (comment) . Andy made a good point if it is suffering from JCC erratum. I am doing some experiments to test this out. |
@dotnet/jit-contrib , @adamsitnik |
@kunalspathak thanks for the great analysis and being so transparent with all the data! I think that it would be valuable to run all the benchmarks we have with and without your loop alignment changes. Using the ResultsComparer could give us a good overview of how many benchmarks would improve and regress in total. The tool also allows getting top X best|worst differences so it could help us to find other benchmarks affected by the alignment. git clone https://github.com/dotnet/performance.git
cd performance
py .\scripts\benchmarks_ci.py -f net6.0 --filter * --bdn-arguments "--artifacts C:\results\without" --corerun $pathToCoreRunWithoutLoopAlignmentChanges
py .\scripts\benchmarks_ci.py -f net6.0 --filter * --bdn-arguments "--artifacts C:\results\with" --corerun $pathToCoreRunWithLoopAlignmentChanges
cd src\tools\ResultsComparer\
dotnet run -c Release --base "C:\results\without" --diff "C:\results\with" --threshold 3% --noise 1ns Please be warned, running all of them takes around 5-6 hours ;) |
Thanks a lot @adamsitnik for the suggestion. I ran the benchmarks for various heuristics that I was trying and the |
System.Collections.IndexerSet.Span(Size: 512)This benchmark improved by approx. 36%.
The benchmark appears to be bimodal as seen below, and data alignment could also be the contributor to the bimodality. But on my machine, I saw consistent improvement after loop alignment changes. Looking at the disassembly, the improvements were coming out because of alignment of loop Assembly code before loop alignment; Assembly listing for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; Lcl frame size = 0
G_M6697_IG01: ;; offset=0000H
;; bbWeight=1 PerfScore 0.00
G_M6697_IG02: ;; offset=0000H
00007ffb`a51132a0 488B4110 mov rax, gword ptr [rcx+16]
00007ffb`a51132a4 4885C0 test rax, rax
00007ffb`a51132a7 7506 jne SHORT G_M6697_IG04
;; bbWeight=1 PerfScore 3.25
G_M6697_IG03: ;; offset=0009H
00007ffb`a51132a9 33D2 xor rdx, rdx
00007ffb`a51132ab 33C9 xor ecx, ecx
00007ffb`a51132ad EB07 jmp SHORT G_M6697_IG05
;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG04: ;; offset=000FH
00007ffb`a51132af 488D5010 lea rdx, bword ptr [rax+16]
00007ffb`a51132b3 8B4808 mov ecx, dword ptr [rax+8]
;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG05: ;; offset=0016H
00007ffb`a51132b6 33C0 xor eax, eax
00007ffb`a51132b8 85C9 test ecx, ecx
00007ffb`a51132ba 7E10 jle SHORT G_M6697_IG07
;; bbWeight=1 PerfScore 1.50
G_M6697_IG06: ;; offset=001CH
00007ffb`a51132bc 4C63C0 movsxd r8, eax
00007ffb`a51132bf 4533C9 xor r9d, r9d
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (xor: 2) 32B boundary ...............................
00007ffb`a51132c2 46890C82 mov dword ptr [rdx+4*r8], r9d
00007ffb`a51132c6 FFC0 inc eax
00007ffb`a51132c8 3BC1 cmp eax, ecx
00007ffb`a51132ca 7CF0 jl SHORT G_M6697_IG06
;; bbWeight=4 PerfScore 12.00
G_M6697_IG07: ;; offset=002CH
00007ffb`a51132cc 33C0 xor eax, eax
;; bbWeight=1 PerfScore 0.25
G_M6697_IG08: ;; offset=002EH
00007ffb`a51132ce C3 ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code 47, prolog size 0, PerfScore 25.20, instruction count 19 (MethodHash=9a77e5d6) for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; ============================================================ Assembly code after loop alignment; Assembly listing for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
; Lcl frame size = 0
G_M6697_IG01: ;; offset=0000H
;; bbWeight=1 PerfScore 0.00
G_M6697_IG02: ;; offset=0000H
00007ffb`a5133580 488B4110 mov rax, gword ptr [rcx+16]
00007ffb`a5133584 4885C0 test rax, rax
00007ffb`a5133587 7506 jne SHORT G_M6697_IG04
;; bbWeight=1 PerfScore 3.25
G_M6697_IG03: ;; offset=0009H
00007ffb`a5133589 33D2 xor rdx, rdx
00007ffb`a513358b 33C9 xor ecx, ecx
00007ffb`a513358d EB07 jmp SHORT G_M6697_IG05
;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG04: ;; offset=000FH
00007ffb`a513358f 488D5010 lea rdx, bword ptr [rax+16]
00007ffb`a5133593 8B4808 mov ecx, dword ptr [rax+8]
;; bbWeight=0.50 PerfScore 1.25
G_M6697_IG05: ;; offset=0016H
00007ffb`a5133596 33C0 xor eax, eax
00007ffb`a5133598 85C9 test ecx, ecx
00007ffb`a513359a 7E1F jle SHORT G_M6697_IG07
; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 4 bytes, loopsize= 16 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (Span)
00007ffb`a513359c 0F1F4000 align
; ............................... 32B boundary ...............................
;; bbWeight=1 PerfScore 1.75
G_M6697_IG06: ;; offset=0020H
00007ffb`a51335a0 4C63C0 movsxd r8, eax
00007ffb`a51335a3 4533C9 xor r9d, r9d
00007ffb`a51335a6 46890C82 mov dword ptr [rdx+4*r8], r9d
00007ffb`a51335aa FFC0 inc eax
00007ffb`a51335ac 3BC1 cmp eax, ecx
00007ffb`a51335ae 7CF0 jl SHORT G_M6697_IG06
;; bbWeight=4 PerfScore 12.00
G_M6697_IG07: ;; offset=0030H
00007ffb`a51335b0 33C0 xor eax, eax
;; bbWeight=1 PerfScore 0.25
G_M6697_IG08: ;; offset=0032H
00007ffb`a51335b2 C3 ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code 51, prolog size 0, PerfScore 26.95, instruction count 20 (MethodHash=9a77e5d6) for method System.Collections.IndexerSet`1[Int32][System.Int32]:Span():int:this
; ============================================================ |
PerfLabTests.GetMemberBefore loop alignment:
After loop alignment:
It is surprising that the Allocated is different between 2 runs and not sure why. I tried dumping disassembly of all the methods to see which methods got loop aligned, and there were not much benchmark related. Methods where alignment was done; alignment= 2 bytes, loopsize= 75 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (TryInsert)
; alignment= 2 bytes, loopsize= 48 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (MoveNext)
; alignment= 7 bytes, loopsize= 21 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 11, alignmentBoundary= 32B in (Contains)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (CopyTo)
; alignment= 3 bytes, loopsize= 43 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (MoveNext)
; alignment= 2 bytes, loopsize= 67 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 29, alignmentBoundary= 32B in (WidenAsciiToUtf16)
; alignment= 2 bytes, loopsize= 53 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 11, alignmentBoundary= 32B in (ToArray)
; alignment= 3 bytes, loopsize= 43 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 21, alignmentBoundary= 32B in (MoveNext)
; alignment= 1 bytes, loopsize= 122 bytes, minBlocksNeeded= 4, extraBytesNotInLoop= 6, alignmentBoundary= 16B in (Add)
; alignment= 7 bytes, loopsize= 61 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 3, alignmentBoundary= 32B in (GenerateRandomId)
; alignment= 3 bytes, loopsize= 90 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 6, alignmentBoundary= 16B in (ReportIfAny)
; alignment= 12 bytes, loopsize= 19 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 13, alignmentBoundary= 32B in (OverheadActionNoUnroll)
; alignment= 12 bytes, loopsize= 19 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 13, alignmentBoundary= 32B in (WorkloadActionNoUnroll)
; alignment= 3 bytes, loopsize= 80 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 16, alignmentBoundary= 32B in (Variance)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (SumWithoutOutliers)
; alignment= 6 bytes, loopsize= 64 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 0, alignmentBoundary= 32B in (InverseTwoTailStudent)
; alignment= 2 bytes, loopsize= 91 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 5, alignmentBoundary= 16B in (Print)
; alignment= 1 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1, alignmentBoundary= 32B in (CopyTo)
; alignment= 7 bytes, loopsize= 56 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 8, alignmentBoundary= 16B in (ToArray) Same thing with |
In theory, it should never happen, but it can be caused by the Tiered JIT background thread allocating memory like in dotnet/BenchmarkDotNet#1542 As long as it's rare you can ignore it (we would need to use a memory profiler and see what exactly is being allocated in both runs) |
Most of the analysis is done and "loop alignment" feature will be ON in .NET 6.0. |
As mentioned in #43227, one of the task we have identified for stabilizing the performance is performing loop alignment for hot loops. Since the alignment can be very sensitive for the performance, this issue tracks the investigation work we will be doing to measure the impact of loop alignment work on Microbenchmarks. All the findings will be tracked in this issue.
Currently, we are tracking the benchmarks that our performance team has identified to have effect of alignment. The list of issues can be seen at https://github.com/drewScoggins/performance-2/issues?q=is%3Aopen+is%3Aissue+label%3AAlignment
The text was updated successfully, but these errors were encountered: