Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: x*x*x*x -> (x*x)*(x*x) transformation #65141

Closed
ShreyasJejurkar opened this issue Feb 10, 2022 · 18 comments
Closed

JIT: x*x*x*x -> (x*x)*(x*x) transformation #65141

ShreyasJejurkar opened this issue Feb 10, 2022 · 18 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner

Comments

@ShreyasJejurkar
Copy link
Contributor

ShreyasJejurkar commented Feb 10, 2022

As usual, was playing with the JIT ASM compiler for C# and found an interesting optimization that can be done for codegen.

Demo link -> https://godbolt.org/z/1v9sMz7Y5 (contains both C# and C++ compiler example)

As you can see above, in C#, we always move the parameter to the eax and call imul on it as many times as there is multiplication. This is ok when the multiplication is having 2,3 or even 5 with itself, but when it exceeds there is possible optimization can be made just like gcc one.

If you look at the GCC compiler, whenever it finds there is the multiplication of number itself in even number of times, before even moving parameter to eax, it does imul of it first and then moves it to the eax , resulting in a less imul instruction.

This is applicable for multiple even multiplications (2^n). Just update the code with multiple (num * num) and see gcc will always result in one less instruction compared to C# ones!

I know it's not a groundbreaking optimization idea, but still one lesser assembly instruction.

cc @EgorBo @AndyAyersMS

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 10, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@EgorBo
Copy link
Member

EgorBo commented Feb 10, 2022

Thanks, I love the fact it's now possible to compare C++ and C# back to back on the same godbolt link! But we need to fix the ordering 🙂

As for the request - yeah, this optimization is useful when you need to compute Power function Math.Pow(X, const) (especially for floats in ffast-math) but we don't expand it for const power and we don't even have Math.Pow for integers, so I assume this pattern is highly unlikely to be seen in the wild, unless you point me to some place where we can meet it 🙂

@EgorBo EgorBo changed the title A chance for optimization. JIT: x*x*x*x -> (x*x)*(x*x) transformation Feb 10, 2022
@KieranFoot
Copy link

@EgorBo (x*x)*(x*x) reduces the instruction count nicely.

Is there a reason why there is no int version of Math.Pow() ? Is there simply no requirement for it?

@EgorBo
Copy link
Member

EgorBo commented Feb 10, 2022

@EgorBo (x*x)*(x*x) reduces the instruction count nicely.

technically it removes just one imul which is not super expensive 🤷‍♂️ (Lat=3, Rec.TP=1 on Zen3)

Is there a reason why there is no int version of Math.Pow() ? Is there simply no requirement for it?

Feel free to file an API proposal, it seems like developers (and even us) sometimes use double-one and cast back to int
image

@ShreyasJejurkar
Copy link
Contributor Author

While checking the above stuff, found another one. Why there is a huge codegen difference in this. Although in the end, both are doing the same stuff.

https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQEwEYCwAUJgMwAEOZAwmQN5FmMXkA2A9jAOZkCyAFLARkYAVzABKOgyYzMAdmFiyAKkVgVajaLABuaYwC++ssdJl2XMgBUBMIdsn1CM2Qp4QEACwB0AISicPCIsfNpwapKq7l5+AUEhYRF6zkxGhAZAA

@EgorBo
Copy link
Member

EgorBo commented Feb 10, 2022

While checking the above stuff, found another one. Why there is a huge codegen difference in this. Although in the end, both are doing the same stuff.

https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQEwEYCwAUJgMwAEOZAwmQN5FmMXkA2A9jAOZkCyAFLARkYAVzABKOgyYzMAdmFiyAKkVgVajaLABuaYwC++ssdJl2XMgBUBMIdsn1CM2Qp4QEACwB0AISicPCIsfNpwapKq7l5+AUEhYRF6zkxGhAZAA

it's 32bit. 64bit integers are terrible for 32bit targets. Either switch to x64 on sharplab or return back to godbolt 🙂

@ShreyasJejurkar
Copy link
Contributor Author

@KieranFoot
Copy link

@EgorBo it removed the movapd and one mulsd.

As you say, this isn't so significant for a one off call, but if an app were doing this many times it would be a good perf improvement.

@SingleAccretion SingleAccretion added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 10, 2022
@ghost
Copy link

ghost commented Feb 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

As usual, was playing with the JIT ASM compiler for C# and found an interesting optimization that can be done for codegen.

Demo link -> https://godbolt.org/z/1v9sMz7Y5 (contains both C# and C++ compiler example)

As you can see above, in C#, we always move the parameter to the eax and call imul on it as many times as there is multiplication. This is ok when the multiplication is having 2,3 or even 5 with itself, but when it exceeds there is possible optimization can be made just like gcc one.

If you look at the GCC compiler, whenever it finds there is the multiplication of number itself in even number of times, before even moving parameter to eax, it does imul of it first and then moves it to the eax , resulting in a less imul instruction.

This is applicable for multiple even multiplications (2^n). Just update the code with multiple (num * num) and see gcc will always result in one less instruction compared to C# ones!

I know it's not a groundbreaking optimization idea, but still one lesser assembly instruction.

cc @EgorBo @AndyAyersMS

Author: ShreyasJejurkar
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@tannergooding
Copy link
Member

Is there a reason why there is no int version of Math.Pow() ? Is there simply no requirement for it?

Simply put, because it isn't useful in practice. -- There is a related discussion here dotnet/csharplang#2585

Integers have a very narrow range comparatively and the number of valid inputs are fairly exact and limited. Most inputs will actually throw or are better handled a different way.

Ignoring x^1 and 1^x (which are just x and 1, respectively), there are 48,036 valid powers that are between 0 and int.MaxValue.

Now that might sound like a lot, but lets actually break it down:

  • 2 goes up to the power of 30
  • 3 goes up to the power of 19
  • 4 goes up to the power of 15
  • 5 goes up to the power of 13
  • 6 and 7 go up to the power of 11
  • 8 goes up to the power of 10
  • 9 and 10 go up to the power of 9
  • 11 through 14 go up to the power of 8
  • 15 through 21 go up to the power of 7
  • 22 through 35 go up to the power of 6
  • 36 through 73 go up to the power of 5
  • 74 through 215 go up to the power of 4
  • 216 through 1290 go up to the power of 3
  • 1291 through 46340 go up to the power of 2

This means that there is a very very small window of valid integer values that can be represented as powers in the first place; most of which (~45,000 of them; which is ~93.7%) are only valid as a power of two (e.g. x * x).

These values are also very specific and represent a very small portion of the overall 2.14 billion positive integers you can otherwise have. Their usage outside 2^x and 10^x are also going to be fairly limited, both of which have better alternatives that are perhaps more clear (e.g. 2 << x for powers of two or 10_000_000 or similar for powers of ten).

@tannergooding
Copy link
Member

tannergooding commented Feb 10, 2022

It is not worthwhile, in my opinion, to expose a Math.Pow function for integers that will only be able to handle approx. 3000 scenarios (and where otherwise just x * x is better). At that point you're simply better off manually handling your scenario if the perf difference is enough to be meaningful

@AndyAyersMS
Copy link
Member

There is a different message here perhaps -- I have worried in the past that the jit's tree costing/balancing is too oriented towards reducing register consumption. Tree height reduction (see for example) is often useful if we think there are enough registers around to support a more aggressive computation shape.

@bartonjs
Copy link
Member

x1 * x2 * x3 * x4 being re-evaluated from ((x1 * x2) * x3) * x4 to (x1 * x2) * (x3 * x4) isn't valid in a checked scope.

Let x1 = 0x2000_0000, x2 = 2, x3 = 4, x4 = 0.

  • x1 * x2 = 0x4000_0000
    • x1 * x2 * x3 = 0x0000_0001_0000_0000
      • Both Int32 and UInt32 trip OverflowException here
    • x3 * x4 = 0
      • (x1 * x2) * (x3 * x4) = 0
      • No OverflowException

For the checked scope case, there's not only a difference in whether or not an exception was thrown, but whether or not the x4 expression should have evaluated.

Though it looks like the issue is really only about computing x^4 (so x2=x1, x3=x1, x4=x1). Assuming it's already bound to a single local, so there's not a side effect lost by the exception, or gained by it not happening), I can't think how it'd be a problem for that so long as the overflow flags are checked after both halves.

@bartonjs
Copy link
Member

@tannergooding You forgot all 2^32 - 1 values of 0^x = 1, and 0^0 whose value, apparently, leaves room for debate. 😏

@tannergooding
Copy link
Member

sigh 😆

@AndyAyersMS
Copy link
Member

@bartonjs agreed life is more complicated with checked math. But the jit doesn't optimize checked constructs, by and large.

@JulieLeeMSFT
Copy link
Member

From the discussion it is unlikely that we will optimize it for integer.
@ShreyasJejurkar is it ok to close this issue?

@ShreyasJejurkar
Copy link
Contributor Author

By looking at the discussion and resulting codegen, it's just one instruction less which is imul, if it's not making a huge perf difference, then am totally ok to close this issue.

Thanks, everyone!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

8 participants