-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish BitArray performance optimizations using AVX from #39173 #31161
Comments
I'm interested; Vectorisation always seemed fun :^) Hopefully I can pick up some more knowledge on this. |
@Gnbrkm41 awesome! I've assigned you ;) Please let me know if you need any help |
Note on benchmarks - they seem to be benchmarking stuff for a fixed size (500?) bitarrays and I suspect AVX2 impl will regress small arrays (that's why I didn't add it in my initial simd bitarray PR) so I'd run them for a range of sizes starting from 1. |
Yes, I am aware of the issues with the 256bit path from my personal experiment with this. I'm not sure if it would worth to do a length check then pick whatever path is appropriate... It also would be helpful if we know what the average length of the BitArray is. |
|
Another thing to keep in mind is that (at least on Intel) using AVX2 instructions may downclock the CPU. This could regress application-wide performance if BitArray usage doesn't represent a sizeable chunk of an application's overall workload. We should be running real world benchmarks for these particular perf optimizations, not microbenchmarks. |
To make progress, it would best to omit the AVX2 completely in the first iteration. Focus on preparing PR with just SSE2 path first. Look into the AVX2 optimization only after the SSE2 PR is done and accepted. |
Old wrong benchmark - see belowHere's the results of the benchmarks as of the state from dotnet/corefx#39173. I've added the test case for Size = 4 and with all Intrinsics disabled, AVX2 disabled then all enabled: benchmarks.zip Before the change (SDK 5.0.100-alpha1-014885)BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014885
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-NNUBIL : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-VKYLDZ : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-SWUKUE : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
After the change (#39173)BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014885
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-IJWQXJ : .NET Core ? (CoreCLR 5.0.19.51001, CoreFX 5.0.19.51501), X64 RyuJIT
Avx2 Enabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Intrinsics Disabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Sse2 Enabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Runtime=.NET Core 5.0
(Why are the displayed columns/jobs run differ? I have no idea why; did |
Actually, I'm not sure if the one with the custom corefx ran properly? Job-IJWQXJ : .NET Core ? (CoreCLR 5.0.19.51001, CoreFX 5.0.19.51501), X64 RyuJIT
Avx2 Enabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Intrinsics Disabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Sse2 Enabled : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT Only the job on the top have seems it ran with the custom corefx, and the rest seems to have run with the SDK one. cc @adamsitnik can you spot anything wrong here? I've basically added this class deriving from public class ConfigWithNoISA_Sse2_Avx2 : ManualConfig
{
public ConfigWithNoISA_Sse2_Avx2()
{
Add(Job.Default.With(CoreRuntime.Core50)
.WithEnvironmentVariable(IsaConfigurationKnobs.HWIntrinsic, "0")
.WithId("Intrinsics Disabled"));
Add(Job.Default.With(CoreRuntime.Core50)
.WithEnvironmentVariable(IsaConfigurationKnobs.X86.Avx2, "0")
.WithId("Sse2 Enabled"));
Add(Job.Default.With(CoreRuntime.Core50)
.WithId("Avx2 Enabled"));
}
} namespace System.Collections.Tests
{
[Config(typeof(ConfigWithNoISA_Sse2_Avx2))]
[BenchmarkCategory(Categories.CoreFX, Categories.Collections)]
public class Perf_BitArray
{
// ... Commands used are: |
This is a BenchmarkDotNet limitation|design issue: if you are using a custom Assuming that you have created a copy of the public class ConfigWithNoISA_Sse2_Avx2 : ManualConfig
{
public ConfigWithNoISA_Sse2_Avx2()
{
var before = new CoreRunToolchain(
new FileInfo(@"C:\Projects\coreclr\bin\tests\Windows_NT.x64.Release\before\Core_Root\CoreRun.exe"),
targetFrameworkMoniker: "netcoreapp5.0", displayName: "before");
var after = new CoreRunToolchain(
new FileInfo(@"C:\Projects\coreclr\bin\tests\Windows_NT.x64.Release\after\Core_Root\CoreRun.exe"),
targetFrameworkMoniker: "netcoreapp5.0", displayName: "after");
Add(Job.Default
.With(before)
.WithId("before")
.AsBaseline());
Add(Job.Default
.With(after)
.WithEnvironmentVariable(IsaConfigurationKnobs.HWIntrinsic, "0")
.WithId("Intrinsics Disabled"));
Add(Job.Default
.With(after)
.WithEnvironmentVariable(IsaConfigurationKnobs.X86.Avx2, "0")
.WithId("Sse2 Enabled"));
Add(Job.Default
.With(after)
.WithId("Avx2 Enabled"));
}
} |
Here's the result: Old wrong benchmark - see belowBenchmarksBenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014885
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-DITMAU : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
Job-XXAQCY : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
Job-WMTZFA : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
Job-JIJWRS : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
Job-KOEHHM : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
|
These results suggests there is no advantage in using AVX for this. Do I read the results correctly? |
Old wrong benchmark - see belowI've experimented around various sizes, both with AVX2 and SSE2 path, and here are some results: BenchmarksBenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014885
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-ONEQZA : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
Job-NZWJST : .NET Core ? (CoreCLR 5.0.19.51202, CoreFX 5.0.19.51501), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Toolchain=After IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
I haven't done any statistical analysis on the data; but for me, the results seem to suggest that AVX2 is actually slower (few nanoseconds) when the size of the array is about Based on dotnet/corefx#33367 (comment), as far as .NET repositories are concerned it looks like ML possibly could benefit from this, but I don't really see much benefit to pursue Avx2 path, given that the SSE2 path is already fast enough. |
Side note: could we possibly vectorise |
Regarding the second entry in the TODO list:
I personally feel that using Span.Fill() is better than using simple loops, since BitArray could also benefit by any changes made in Span.Fill(). If it turns out to be slower, maybe we can look into optimising Span.Fill() instead? |
@Gnbrkm41 it's not easy to analyze the results you've posted, could you please make a table where columns are different settings (e.g. AVX, SSE, software-fallback) ?
I suspect for performance critical hot-paths they manually do SIMD work (e.g. alignment stuff)
It's a bit complicated so should be an interesting task to practice SIMD ;-)
it's a known issue (Span.ctor codegen) but I suspect the fix for it is not trivial so it's still there |
@Gnbrkm41 Does the machine you are measuring on have AVX2 support? It looks like you measure essentially no difference between SSE2 and AVX2, yet I measured a huge improvement with AVX2 and the vectorized implementations. |
I'm using Intel i7-8700, which according to Intel supports AVX2. Running |
This actually is odd; it appears like the environment variables are not being applied correctly... I get the same result even if I have the environment variable set to turn intrinsics off. EDIT: nope, it actually is working properly. |
I re-ran the dotnet/performance benchmarks on my machine, and saw the following, which indicate a huge improvement with AVX2 intrinsics.
Baseline:
With dotnet/corefx#39173:
|
The numbers definitely seem very promising. I just need to figure out what caused the numbers to come out that badly then. |
Turns out that I typed the environment variables without the |
To-dos:
Looks like there could be slight slowdown for sizes smaller than 256; but otherwise it appears that Avx2 is faster overall.
Running summary: No Slower results for the provided threshold = 1% and noise filter = 0.3ns.
It appears that the first element of Long since it has been merged. |
In the past I have been similarly caught out by using the wrong casing (eg |
Some benchmarks for the bool array constructor. Before change: BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014888
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-OENBYZ : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-KZUUDN : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-IEXRMQ : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
After change: BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014888
[Host] : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
Job-HQREBH : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51701), X64 RyuJIT
AVX2 Disabled : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51701), X64 RyuJIT
Intrinsics Disabled : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51701), X64 RyuJIT
|
Opened dotnet/corefx#41896; PTAL if you have some time :^) |
@BruceForstall has done some great job optimizing
BitArray
in dotnet/corefx#39173 . The PR was closed because Bruce has currently no time to finish it.A contributor who would like to work on this issue should:
The issue should be a great excercie for somebody who wants to learn more about vectorizing code using the new .NET Core 3.0 CPU Intrinsics API
The text was updated successfully, but these errors were encountered: