-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in Vector<T>.Vector<T>(T)
on x86/x64
#108929
Comments
@ap5d Your example uses x64 machine code. Could you update the issue title to reflect this? |
As a workaround, you can use This is an easy thing to fix, but potentially not worth backporting without additional user reports. |
Vector<T>.Vector<T>(T)
on x86Vector<T>.Vector<T>(T)
on x86/x64
push rbp
sub rsp, 48
lea rbp, [rsp+0x30]
mov dword ptr [rbp-0x30], esi
mov dword ptr [rbp-0x2C], esi
mov dword ptr [rbp-0x28], esi
mov dword ptr [rbp-0x24], esi
mov dword ptr [rbp-0x20], esi
mov dword ptr [rbp-0x1C], esi
mov dword ptr [rbp-0x18], esi
mov dword ptr [rbp-0x14], esi
vmovups ymm0, ymmword ptr [rbp-0x30]
vmovups ymmword ptr [rdi], ymm0
mov rax, rdi
vzeroupper
add rsp, 48
pop rbp
ret Unnecessary zeroing of memory is elided here. |
Put up #108945 to resolve the issue. The entry was missing a flag and in the incorrect place to be handled. It was broken by accident as part of a larger refactoring that was simplifying the |
Description
When using NET 9-RC2,
Vector<T>
constructor that broadcasts a scalar to all elements of a vector is not optimized to a broadcasting instruction on x86/x64. .NET 8 compiler makes this optimization.Reproduction Steps
The regression can be reproduced by compiling the following function:
Expected behavior
I would expect the compiler to use only few instructions for broadcasting the scalar to all elements. This what .NET 8 compiler produces:
So a single vpbroadcastd does the job when AVX2 is enabled.
Actual behavior
Using .NET 9-rc2, the following machine code is generated:
As you can see, the compiler fills elements individually to an array on stack, which is much slower.
Regression?
No response
Known Workarounds
Use .NET 8 or select Vector128/256/512.Create method based on
Vector<T>
length:This workaround results in the following machine code with .NET 9-RC2:
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: