-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Optimize redundant memory barriers on arm/arm64 #60219
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsFixes #60023 class Prog
{
volatile int a;
void Test1(Prog p) => p.a = a;
void Test2() => a++;
} Codegen diff ; Method Prog:Test1(Prog):this
G_M46938_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M46938_IG02:
ldr w0, [x0,#8]
- dmb ishld
dmb ish
str w0, [x1,#8]
;; bbWeight=1 PerfScore 24.00
G_M46938_IG03:
ldp fp, lr, [sp],#16
ret lr
;; bbWeight=1 PerfScore 2.00
-; Total bytes of code: 32
+; Total bytes of code: 28
; Method Prog:Test2():this
G_M48563_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M48563_IG02:
ldr w1, [x0,#8]
- dmb ishld
+ dmb ish
add w1, w1, #1
- dmb ish
str w1, [x0,#8]
;; bbWeight=1 PerfScore 24.50
G_M48563_IG03:
ldp fp, lr, [sp],#16
ret lr
;; bbWeight=1 PerfScore 2.00
-; Total bytes of code: 36
+; Total bytes of code: 32
|
/azp run runtime-coreclr jitstressregs, runtime-coreclr outerloop, runtime-coreclr jitstress2-jitstressregs, runtime-coreclr gcstress0x3-gcstress0xc |
PTAL @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but may want to get another review.
cc @echesakovMSFT |
@EgorBo In a case when a volatile write followed by a volatile read (it is an inversion of your example) - we will still insert a barrier before the write and after the read to guarantee their release/acquire semantics. Is this correct interpretation of the change? |
volatile int a;
int Test2()
{
a = 42; // store
return a; // load
} In this case we emit: stp fp, lr, [sp,#-16]!
mov fp, sp
G_M18266_IG02:
mov w1, #42
dmb ish
str w1, [x0,#8]
ldr w0, [x0,#8]
dmb ishld
G_M18266_IG03:
ldp fp, lr, [sp],#16
ret lr and nothing is optimized in this case since there are memory accesses in-between two barriers here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
#ifdef TARGET_ARMARCH | ||
// Reset emitLastMemBarrier for new IG | ||
emitLastMemBarrier = nullptr; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be:
if (!emitAdd) {
emitLastMemBarrier = nullptr;
}
because if we're saving this IG for the purpose of creating a new, "overflow" IG, we're not creating a label, so it should be treated as contiguous.
It's unfortunate that this "peephole optimization" is spread around the emitter instead of being handled entirely within codegen. E.g., instead of emitLastMemBarrier
, couldn't there be a codegen variable that keeps track of the last memory barrier, and clear that at BasicBlock boundaries (not IG boundaries)? You're leveraging a convenient chokepoint in the emitter, appendToCurIG
, to clear this if a load/store is found, but maybe there is something not much more difficult in codegen only (and not touching the emitter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do that in codegen first but it didn't look safe to me - I wasn't sure if I remove "GTF_VOLATILE" flag from a load it won't be then optimized further in emitter or in other place of codegen. I chose emitter because on arm the only instructions we need memorybarrier for is ldr and str (and their variations) unlike x86 where it's way more complicated. And it's where other last-chance peepholes live.
because if we're saving this IG for the purpose of creating a new, "overflow" IG, we're not creating a label, so it should be treated as contiguous.
Yeah, I kept that in mind, but it didn't produce any diffs so I decided to leave it as is. I was mostly interested in optimizing things like volatileFiled++
or volatileFiled*=10
etc that we have in some places in Task. Fortunately, we don't emit many memory barriers on arm - for performance-critical code we mostly use Volatile.Read/Write
that doesn't produce them.
Fixes #60023
Fixes #60021
This PR removes redundant load-only or full memory barriers if we already emitted one in current IG and there were no memory accesses after it. By "memory accesses" I mean all instructions in this table instrsarm64.h (and instrsarm.h for arm32) with "ST" or "LD" in the 3rd column.
Examples:
Codegen diff
Diffs (aljit):
coreclr_tests.pmi.Linux.arm64.checked.mch:
Detail diffs
libraries.crossgen2.Linux.arm64.checked.mch:
Detail diffs
libraries.pmi.Linux.arm64.checked.mch:
Detail diffs
libraries_tests.pmi.Linux.arm64.checked.mch:
Detail diffs