-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add scalar evolution analysis and do IV widening based on it #97865
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis adds a new phase meant for optimizing induction variables. It adds infrastructure for SSA-based analysis of induction variables (scalar evolution analysis), and uses it to do induction variable widening. [MethodImpl(MethodImplOptions.NoInlining)]
static int Foo(int[] arr)
{
int sum = 0;
foreach (int v in arr)
{
sum += v;
}
return sum;
} goes from ; Assembly listing for method ConsoleApp34.Program:Foo(int[]):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T02] ( 4, 7 ) ref -> rcx class-hnd single-def <int[]>
; V01 loc0 [V01,T01] ( 4, 10 ) int -> rax
; V02 loc1 [V02,T00] ( 5, 17 ) int -> rdx
; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V04 cse0 [V04,T03] ( 3, 6 ) int -> r8 "CSE - aggressive"
;
; Lcl frame size = 40
G_M8112_IG01:
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M8112_IG02:
xor eax, eax
xor edx, edx
mov r8d, dword ptr [rcx+0x08]
test r8d, r8d
jle SHORT G_M8112_IG04
align [0 bytes for IG03]
;; size=13 bbWeight=1 PerfScore 3.75
G_M8112_IG03:
mov r10d, edx
add eax, dword ptr [rcx+4*r10+0x10]
inc edx
cmp r8d, edx
jg SHORT G_M8112_IG03
;; size=15 bbWeight=4 PerfScore 19.00
G_M8112_IG04:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 37, prolog size 4, PerfScore 24.25, instruction count 14, allocated bytes for code 37 (MethodHash=d1cce04f) for method ConsoleApp34.Program:Foo(int[]):int (FullOpts)
; ============================================================ to ; V00 arg0 [V00,T02] ( 4, 7 ) ref -> rcx class-hnd single-def <int[]>
; V01 loc0 [V01,T01] ( 4, 10 ) int -> rax
;* V02 loc1 [V02,T04] ( 0, 0 ) int -> zero-ref
; V03 OutArgs [V03 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V04 tmp1 [V04,T00] ( 5, 17 ) long -> r8 "Widened primary induction variable"
; V05 cse0 [V05,T03] ( 3, 6 ) int -> rdx "CSE - aggressive"
;
; Lcl frame size = 40
G_M8112_IG01:
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M8112_IG02:
xor eax, eax
mov edx, dword ptr [rcx+0x08]
test edx, edx
jle SHORT G_M8112_IG04
xor r8d, r8d
jmp SHORT G_M8112_IG03
align [0 bytes for IG03]
;; size=14 bbWeight=1 PerfScore 5.75
G_M8112_IG03:
add eax, dword ptr [rcx+4*r8+0x10]
inc r8d
cmp edx, r8d
jg SHORT G_M8112_IG03
;; size=13 bbWeight=4 PerfScore 18.00
G_M8112_IG04:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 36, prolog size 4, PerfScore 25.25, instruction count 14, allocated bytes for code 36 (MethodHash=d1cce04f) for method ConsoleApp34.Program:Foo(int[]):int (FullOpts) where we were able to drop a zero extension of the index inside the loop. In the future I plan to build strength reduction on top of the same analysis package. The analysis is inspired by [1] and by LLVM's scalar evolution package. It provides a small IR that represents the evolving value of IR nodes inside loops. At the core of this IR is the notion of an "add recurrence", which describes a changing value as Analysis exampleFor the IR for the above, the analysis produces the following: Analyzing scalar evolution in L00 header: BB03
Members (1): BB03
Entry: BB02 -> BB03
Exit: BB03 -> BB04
Back: BB03 -> BB03
BB03 [0001] [006..016) -> BB03,BB04 (cond), preds={BB02,BB03} succs={BB04,BB03}
STMT00009 ( ??? ... ??? )
N004 ( 0, 0) [000045] DA--------- ▌ STORE_LCL_VAR int V01 loc0 d:3
N003 ( 0, 0) [000044] ----------- └──▌ PHI int
N001 ( 0, 0) [000050] ----------- pred BB03 ├──▌ PHI_ARG int V01 loc0 u:4
N002 ( 0, 0) [000047] ----------- pred BB02 └──▌ PHI_ARG int V01 loc0 u:2
[000047] => 0
STMT00007 ( ??? ... ??? )
N004 ( 0, 0) [000041] DA--------- ▌ STORE_LCL_VAR int V02 loc1 d:3
N003 ( 0, 0) [000040] ----------- └──▌ PHI int
N001 ( 0, 0) [000051] ----------- pred BB03 ├──▌ PHI_ARG int V02 loc1 u:4
N002 ( 0, 0) [000048] ----------- pred BB02 └──▌ PHI_ARG int V02 loc1 u:2
[000051] => <L00, 1, 1>
[000048] => 0
[000041] => <L00, 0, 1> For example, here it was able to show that STMT00003 ( 0x006[E-] ... 0x00B )
N015 ( 8, 9) [000015] DA--GO----- ▌ STORE_LCL_VAR int V01 loc0 d:4
N014 ( 8, 9) [000014] ----GO----- └──▌ ADD int
N012 ( 6, 7) [000033] ----GO-N--- ├──▌ COMMA int
N001 ( 0, 0) [000025] ----------- │ ├──▌ NOP void
N011 ( 6, 7) [000034] n---GO----- │ └──▌ IND int
N010 ( 3, 5) [000032] -----O----- │ └──▌ ARR_ADDR byref int[]
N009 ( 3, 5) [000031] -------N--- │ └──▌ ADD byref
N002 ( 1, 1) [000022] ----------- │ ├──▌ LCL_VAR ref V00 arg0 u:1
N008 ( 4, 5) [000030] -------N--- │ └──▌ ADD long
N006 ( 3, 4) [000028] -------N--- │ ├──▌ LSH long
N004 ( 2, 3) [000026] ---------U- │ │ ├──▌ CAST long <- uint
N003 ( 1, 1) [000023] ----------- │ │ │ └──▌ LCL_VAR int V02 loc1 u:3
N005 ( 1, 1) [000027] -------N--- │ │ └──▌ CNS_INT long 2
N007 ( 1, 1) [000029] ----------- │ └──▌ CNS_INT long 16
N013 ( 1, 1) [000009] ----------- └──▌ LCL_VAR int V01 loc0 u:3 (last use)
[000022] => V00.1
[000023] => <L00, 0, 1>
[000026] => <L00, 0, 1>
[000027] => 2
[000028] => <L00, 0, 4>
[000029] => 16
[000030] => <L00, 16, 4>
[000031] => <L00, (V00.1 + 16), 4>
[000032] => <L00, (V00.1 + 16), 4> This one is more interesting since we can see hints of how strength reduction is going to utilize the information. In particular, the analysis was able to show that the address STMT00004 ( 0x00C[E-] ... 0x00F )
N004 ( 3, 3) [000019] DA--------- ▌ STORE_LCL_VAR int V02 loc1 d:4
N003 ( 3, 3) [000018] ----------- └──▌ ADD int
N001 ( 1, 1) [000016] ----------- ├──▌ LCL_VAR int V02 loc1 u:3 (last use)
N002 ( 1, 1) [000017] ----------- └──▌ CNS_INT int 1
[000016] => <L00, 0, 1>
[000017] => 1
[000018] => <L00, 1, 1>
[000019] => <L00, 1, 1>
STMT00002 ( 0x010[E-] ... 0x014 )
N005 ( 7, 7) [000008] ---X------- ▌ JTRUE void
N004 ( 5, 5) [000007] J--X---N--- └──▌ GT int
N002 ( 3, 3) [000006] ---X------- ├──▌ ARR_LENGTH int
N001 ( 1, 1) [000005] ----------- │ └──▌ LCL_VAR ref V00 arg0 u:1
N003 ( 1, 1) [000004] ----------- └──▌ LCL_VAR int V02 loc1 u:4
[000005] => V00.1
[000004] => <L00, 1, 1> I do not think we can enable this by default yet, for two reasons:
Also, there is work necessary to properly prove that some analysis and folding is ok in the face of potential overflows. Still, I do not see why the use in this PR for IV widening should not be correct, so I want to see what CI finds. The implementation makes use of the fact that
[1] Michael Wolfe. 1992. Beyond induction variables. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation (PLDI '92). Association for Computing Machinery, New York, NY, USA, 162–174. https://doi.org/10.1145/143095.143131
|
Diff results for #97865Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,507,286 contexts (1,007,092 MinOpts, 1,500,194 FullOpts). MISSED contexts: base: 27 (0.00%), diff: 32 (0.00%) Overall (+1,719,060 bytes)
FullOpts (+1,719,060 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,517,881 contexts (991,070 MinOpts, 1,526,811 FullOpts). MISSED contexts: 28 (0.00%) Overall (+572,703 bytes)
FullOpts (+572,703 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,270,841 contexts (932,669 MinOpts, 1,338,172 FullOpts). MISSED contexts: base: 26 (0.00%), diff: 29 (0.00%) Overall (+1,124,576 bytes)
FullOpts (+1,124,576 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,341,080 contexts (938,449 MinOpts, 1,402,631 FullOpts). MISSED contexts: base: 34 (0.00%), diff: 37 (0.00%) Overall (+1,320,572 bytes)
FullOpts (+1,320,572 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,512,182 contexts (997,391 MinOpts, 1,514,791 FullOpts). MISSED contexts: base: 29 (0.00%), diff: 30 (0.00%) Overall (-18,030 bytes)
FullOpts (-18,030 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.78% to +2.55%)
MinOpts (-0.00% to +0.01%)
FullOpts (+0.80% to +3.29%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.82% to +2.31%)
FullOpts (+0.84% to +2.94%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.78% to +2.27%)
FullOpts (+0.80% to +3.18%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.77% to +2.53%)
FullOpts (+0.79% to +3.08%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.83% to +2.44%)
FullOpts (+0.85% to +2.80%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.68% to +2.26%)
FullOpts (+0.70% to +2.91%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.72% to +2.06%)
FullOpts (+0.74% to +2.58%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.31% to +1.70%)
FullOpts (+0.33% to +2.77%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.48% to +1.86%)
FullOpts (+0.50% to +2.77%)
Details here |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
Test failures look like #97892. |
We currently support scaled addressing modes when the index also needs an extension through contained `BFIZ` nodes. However, we did not support scaled addressing modes if the index was 64 bits. This adds that support as a natural extension to the `GT_LEA`.
Diff results for #97865Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.30% to +1.63%)
MinOpts (-0.03% to +0.00%)
FullOpts (+0.42% to +2.10%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.33% to +1.50%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.47% to +1.91%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.37% to +1.43%)
MinOpts (-0.04% to +0.01%)
FullOpts (+0.42% to +2.02%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.39% to +1.50%)
MinOpts (-0.03% to +0.00%)
FullOpts (+0.42% to +1.96%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.43% to +1.31%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.47% to +1.79%)
Details here Throughput diffs for linux/arm ran on windows/x86MinOpts (-0.02% to -0.00%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.01%)
MinOpts (+0.00% to +0.02%)
FullOpts (+0.01%)
Details here |
Diff results for #97865Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,505,877 contexts (1,007,092 MinOpts, 1,498,785 FullOpts). MISSED contexts: base: 1,433 (0.06%), diff: 1,441 (0.06%) Overall (+904,416 bytes)
MinOpts (-22,888 bytes)
FullOpts (+927,304 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,516,325 contexts (991,070 MinOpts, 1,525,255 FullOpts). MISSED contexts: 1,584 (0.06%) Overall (+565,740 bytes)
FullOpts (+565,740 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,270,092 contexts (932,669 MinOpts, 1,337,423 FullOpts). MISSED contexts: base: 772 (0.03%), diff: 778 (0.03%) Overall (+539,840 bytes)
MinOpts (-21,812 bytes)
FullOpts (+561,652 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,339,802 contexts (938,449 MinOpts, 1,401,353 FullOpts). MISSED contexts: base: 1,309 (0.06%), diff: 1,315 (0.06%) Overall (+659,716 bytes)
MinOpts (-21,684 bytes)
FullOpts (+681,400 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,510,841 contexts (997,391 MinOpts, 1,513,450 FullOpts). MISSED contexts: base: 1,370 (0.05%), diff: 1,371 (0.05%) Overall (-20,210 bytes)
FullOpts (-20,210 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.30% to +1.63%)
MinOpts (-0.03% to +0.00%)
FullOpts (+0.42% to +2.10%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.33% to +1.50%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.47% to +1.91%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.37% to +1.43%)
MinOpts (-0.04% to +0.01%)
FullOpts (+0.42% to +2.02%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.39% to +1.50%)
MinOpts (-0.03% to +0.00%)
FullOpts (+0.42% to +1.96%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.43% to +1.31%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.47% to +1.79%)
Details here Throughput diffs for linux/arm ran on windows/x86MinOpts (-0.02% to -0.00%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.01%)
MinOpts (+0.00% to +0.02%)
FullOpts (+0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.32% to +1.53%)
MinOpts (+0.01% to +0.08%)
FullOpts (+0.42% to +1.96%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.31% to +1.43%)
MinOpts (+0.00% to +0.01%)
FullOpts (+0.44% to +1.75%)
Details here |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib PTAL @BruceForstall @AndyAyersMS Diffs. I spent a long time analyzing these, see #97865 (comment) and #97865 (comment). There were a bunch of regressions when IV widening in cases where the same local is the primary IV of multiple loops. This particularly affects cloning. I've disabled this case for now; I want to look closer at the case and figure out how to enable it separately. The initial implementation always placed the initialization of the widened IV in the preheader. However, if the reaching def from outside the loop comes from a block much further back, then that means the lifetime of the widened IV is shorter than the old IV. In a lot of cases that resulted in suboptimal register allocation of the widened IV (because other live variables ended up taking the "good" register before we got to the initialization of the widened IV). So this implementation tries to place the initialization of the widened IV in the same spot as the old narrow IV, and then does a best-effort pass to replace uses of the narrow IV with the wide IV on the way to the loop. That seems to work well, but it's a bit of a hack -- I hope to reevaluate it in the future, perhaps with some LSRA work. Throughput impact is in the 0.2-0.3% range for most collections, with benchmarks.run_pgo being a bit worse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first review pass and it looks good overall. Will look again later.
I wonder if you might save some TP if you kept track of the uses during the locate phase and used that to drive replacement.
More generally if we want to recognize secondary IVs it seems like walking forward from the primary IV appearances might be a good plan. Perhaps when we get to strength reduction?
// => <L, 0, <L, 1, 1>> | ||
// | ||
// The main issue is that it requires cache invalidation afterwards | ||
// and turning the recursive result into an addrec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how prevalent these more complicated IVs might be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked, but I have some (hacky) code lying around for it that I was planning to look at the impact of later.
There are some other notable missing things in the SCEV IR, like a subtraction operator, so loops that subtract by an invariant local are not representable (subtractions with a constant are usually represented as GT_ADD
so those are ok).
} | ||
|
||
return nullptr; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoisting has a more refined notion of loop invariance that can handle more tree shapes; any thought of using something like that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean by switching to use VN? I suppose that would be possible (although somewhat annoying since the VN wouldn't give me things like "reaching def from outside the loop" that I'm making use of here). I think it would mainly get us the ability to reason about loop invariant loads (so you could have e.g. <L00, 0, IND(x)>
, provided we also started representing loads in the SCEV IR).
I pushed a change that switches the profitability check to use the tree list and also saves all statements in which appearences were found such that the replacement step doesn't need to walk all the loop IR again. The savings are relatively modest (TP diff) -- I think most TP is eaten by the DFS and loop finding. But indeed it seems like we can reuse it for strength reduction as well, to limit what nodes we invoke the scalar evolution analysis on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Will be interesting to see how this evolves when you start to look for secondary IVs.
I've started looking at this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments:
- It would be valuable for this code to have more design documentation. E.g., a Markdown document (in the JIT "doc" tree) on SCEV, on the IV widening algorithm design, etc. This could also take the form of more extensive design-oriented comments.
- Would it make sense to extract the SCEV code to separate scev.h/cpp files? That is, separate the analysis and optimization code.
- Add cScev/dScev debug dumpers?
- Is there any useful dumper to be added to clrjt.natvis?
- To match LLVM conventions, should the SCEV display the addition operator (even though that is currently the only one), e.g.,
<loop, start, +, step>
or<start, +, step>(loop)
?
// Parameters: | ||
// scev - The scev node. | ||
// | ||
void ScalarEvolutionContext::DumpScev(Scev* scev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this can't be static
for some reason? If it was static it would be easier to call from a debugger, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing was that I wanted to display the specific loop for ScevAddRec
, and ScevAddRec
itself doesn't have the loop field. However, I just added that loop field as a debug-only field, and made the dumping a member method on Scev
. So now we have cScev
and dScev
.
} | ||
|
||
//------------------------------------------------------------------------ | ||
// optBestEffortReplaceNarrowIVUsesWith: Try to find and replace uses of the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this "best effort"?
It seems awkward to end the function name with "With".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to optBestEffortReplaceNarrowIVUses
, and added a remarks:
// Remarks:
// This function is best effort; it might not find all uses of the provided
// SSA num, particularly because it does not follow into joins. Note that we
// only use this to replace uses of the narrow IV outside the loop; inside
// the loop we do ensure that all uses/defs are replaced.
// Keeping it best-effort outside the loop is ok; there is no correctness
// issue since we do not invalidate the value of the old narrow IV in any
// way, but it may mean we end up leaving the narrow IV live concurrently
// with the new widened IV, increasing register pressure.
Makes sense, but personally I prefer to have it in the relevant files, so I added some documentation there.
Yeah, seems good to me. I split it into scev.h, scev.cpp and inductionvariableopts.cpp.
Done.
Added some basic display of the SCEV oper/type.
I actually matched the notation from the paper (I think LLVM uses curly braces as well, and has a different representation of chains of recurrences). If we end up with more complex recurrences then we can consider changing the notation. |
SPMI 0 sized diffs look like #98996. All CI jobs were green last time it ran, and since then I only reorganized some things without any diffs, so the failures/timeouts look unrelated. |
This adds a new phase meant for optimizing induction variables. It adds infrastructure for SSA-based analysis of induction variables (scalar evolution analysis), and uses it to do induction variable widening.
For example, with this optimization, codegen for
goes from
to
where we were able to drop a zero extension of the index inside the loop. In the future I plan to build strength reduction on top of the same analysis package.
The analysis is inspired by [1] and by LLVM's scalar evolution package. It provides a small IR that represents the evolving value of IR nodes inside loops. At the core of this IR is the notion of an "add recurrence", which describes a changing value as$start + N * step$ , where N is the iteration index. Currently only simple add recurrences are supported where the start and step are either constants or invariant locals, but the framework generalizes nicely to allow chains of recurrences if we wish to support that. The IR also supports constants, invariant locals and operators on top of these (casts, adds, multiplications and shifts).
<loop, start, step>
; the value of such an add recurrence isAnalysis example
For the IR for the above, the analysis produces the following:
For example, here it was able to show that
V02
, the index, is a primary induction variable; it is an add recurrence that starts at 0 and steps by 1 every iteration of the loop. It also showed that the value that comes from the backedge is similarly an add recurrence, except that it starts at 1 in the first loop.This one is more interesting since we can see hints of how strength reduction is going to utilize the information. In particular, the analysis was able to show that the address
[000032]
is also an add recurrence; it starts at value(V00.1 + 16)
(the address of the first array element) and steps by4
in every iteration.Here the analysis shows that the array object is invariant (otherwise analysis would fail) and that the compared index is an add recurrence that starts at 1.
When the induction variable has uses after the loop the widening pass will store the widened version back to the old local. This is only possible if all exits where the old local is live-in are not critical blocks in the sense that all their preds must come from inside the loop.
Exit canonicalization ensure sthat this is usually the case, but RBO/assertion prop may have uncovered new natural loops, so we still have to repeat the check.
[1] Michael Wolfe. 1992. Beyond induction variables. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation (PLDI '92). Association for Computing Machinery, New York, NY, USA, 162–174. https://doi.org/10.1145/143095.143131