-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943
Comments
This is useful also for the constant case (that is, when For instance, if t.wall&hasMonotonic { generate this:
while this could actually be a single (In fact, testing against highest bit could also be done with |
I started playing with this, figuring it'd be an easy rule to add. (Haha.) The natural place to put this is AMD64.rules. However, the rule ends up being really contorted, due to ordering problems--despite my best efforts, the shift gets lowered first, which means the rule needs to match the fully expanded shift pattern. And this will be fragile--it'll be broken by any changes to how shifts get lowered (which might well be the cause of #18946). I see three mediocre options. (1) Add a generic OpBitTest, and write a generic matching rule. Let each architecture lower BitTest however appropriate, possibly just expanding it back out to shifts and compares. Upsides: Unified, simple. Downsides: Extra work for architectures without bit test instructions, might still need arch-specific rules as well to catch opportunities introduced after generic rewrites. (2) Split the arch-specific optimization and lowering into multiple rewrite passes. This could then be done in an early arch pass. Upsides: More targeted, no extra generic ops. Downsides: Having multiple passes seems complicated (how are they specified? do we run each pass once or cycle through them to a fixed point?). (3) Add some kind of complexity ordering to rulegen, such that more specific rules always match before more general ones. Upsides: Magic. Downsides: Magic, generality is not obviously well-defined, might be harder to implement and to understand, might make rewrite pass slower. Input requested, both specific to this optimization and more generally on the design question. |
(4) Force through the contorted rule and rely on the test to detect breakage. I'm leaning towards (1), guarded by a condition checking arch support for bittest. |
Yes, the logic needed for >63 bit shifts throws a wrench into the rewrite
rules.
That same wrench hits the bit test rules also, so that's unsurprising in
retrospect.
I don't want to do 2 or 3, that's a lot of generic complication to solve
this particular problem (unless there's lots of other cases lurking like
this). Either 1 or 4 is ok with me.
…On Sun, Feb 5, 2017 at 5:52 PM, Josh Bleecher Snyder < ***@***.***> wrote:
(4) Force through the contorted rule and rely on the test to detect
breakage.
I'm leaning towards (1), guarded by a condition checking arch support for
bittest.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18943 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGkgIHLNU3H4M_naMSq7hGGbkNjZHIfSks5rZnzTgaJpZM4L3dEU>
.
|
...and I just found out what you meant by this, the hard way. :) Bummer. I don't think BTQ buys us anything here in the general case after all. Fallback plan: Implement only for constant b. Shouldn't be too hard, since I have the general case (mis-)implemented. |
Even in the non-constant case, there needs to be a way to special-case when the compiler knows that B is < 63. This is exactly the same special-case that is required for other bit-level operations like rotate (#18940), and in this case BTQ helps a lot. So I wouldn't throw the code away. I'm not qualified to comment on the options you proposed, but I'll notice that many platforms have bit-level operations, or can do bit operations more efficiently than generic AND/OR operations. For instance, ARM32 can materialise single bit masks (while it cannot materialise generic 32-bit words), though maybe the ARM backend has already enough information to do that. I'll check this later today (wanna see the code generated by |
CL https://golang.org/cl/36329 mentions this issue. |
CL coming soon. It handles:
It doesn't handle:
I leave the decision about other architectures up to @cherrymui and others. Memory args is a general problem. The other patterns would make a good project for someone getting up to speed on SSA rules, now that the trail has been broken. Marking this as Suggested. |
Updates golang#18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08
Updates #18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08 Reviewed-on: https://go-review.googlesource.com/36329 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
I have this partially working (i.e., handling the other cases mentioned in the original report, amd64 only). Might be a while before I send the CL since I'm going on vacation soon. |
@cespare still planning on sending the CL? |
Sorry, I missed a few weeks of notifications due to gmail misconfiguration. I had forgotten about this. Thanks for reminding me. I'll get back to it shortly, but not before the freeze (tomorrow). If this is important to anyone, please feel free to implement it before I send a CL -- I'm mostly doing it to learn about SSA anyway. |
@cespare any news on this? Do you have some code somewhere that we can work on? |
I haven't made time to work on this again. I have a partially working version (somewhere, can't find the branch right now) but it wasn't anything special, just a few fairly obvious rules. Please feel free to pick this up and send a CL. |
Change https://golang.org/cl/94766 mentions this issue: |
Change https://golang.org/cl/94764 mentions this issue: |
Spotted while working on #18943, it triggers once during bootstrap. Change-Id: Ia4330ccc6395627c233a8eb4dcc0e3e2a770bea7 Reviewed-on: https://go-review.googlesource.com/94764 Reviewed-by: Keith Randall <[email protected]>
Is it worth porting these rules to 386, since it should be entirely straightforward? |
Possibly, but I personally never use the 386 backend. |
Pardon the uninformed drive-by question.. In the CR description, it says "Go1 benchmarks seem unaffected, and I would be surprised otherwise". I would have expected this sort of change to be done to improve performance, but that statement indicates both that it does not improve performance and that an improvement was not expected. What is the benefit of these changes and how are they being prioritized if there's no measurable effect on the end program? I've seen a number of similar changes, and I was really curious about it. Thanks! |
It doesn't improve Go1 because Go1 doesn't contain any hot-loop related to bit manipulation. It does make the code more compact, make it use less registers (so improve register pressure), and improve performance of specific code that does lots of bit manipulation. Basically this is in a "long tail" of things that do slight improvements, and eventually all small benefits add up. |
Keep in mind the "Go1 benchmarks seem unaffected" and "performance seem unaffected" are two very different statements. We don't want to let deficiencies in our benchmarks (of which there are many) impede progress. That said, for performance improvement CLs it is always nice to at least have a microbenchmark that demonstrates improvement. Measurements of code size improvements are also useful. For small changes with no obvious downsides (like this one), sometimes we Just Do It™®©. |
Makes sense, thank you for the explanation! I would have expected the crypto benchmarks to show something, but this makes sense. |
Yeah. I only use it when I need to test something on a 32 bit platform easily. I wonder whether anyone cares much about it (performance-wise). |
The x86 instruction set has a BT (bit test) opcode, which is the moral equivalent of A & (1<<B), with no write back and only carry flag affected.
LLVM uses it when it detects (at least) one of the following patterns:
(note that the bitwise not is not emitted, it's used to reverse the conditional branch)
Currently, Go generates the whole shift sequence, which is much longer and wastes two registers (the intermediate shift result, and ECX as shift count -- the latter because regalloc always emit a MOV to populate ECX when needed for shifting).
The text was updated successfully, but these errors were encountered: