Speed up BitArray #39173

BruceForstall · 2019-07-03T18:33:13Z

Use AVX2, if available, for And/Or/Xor
Vectorize Not
Use Span.Fill() for SetAll()
Add more test sizes to account for And/Or/Xor/Not loop unrolling cases

BruceForstall · 2019-07-03T18:35:18Z

BruceForstall · 2019-07-03T18:35:59Z

Perf results:

summary:
better: 22, geomean: 1.722
worse: 2, geomean: 1.128
total diff: 24

| Slower                                                          | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 32)     |      1.18 |             4.27 |             5.03 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayGet(Size: 20480) |      1.08 |        155654.16 |        168092.73 |         |

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512)               |      5.96 |            64.24 |            10.79 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 20480)             |      5.91 |          2095.84 |           354.42 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 256)               |      5.74 |            35.80 |             6.24 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 32)                |      2.35 |             5.07 |             2.16 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512)            |      1.76 |            64.45 |            36.52 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 512)               |      1.72 |            22.85 |            13.28 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 20480)          |      1.67 |          2331.80 |          1396.23 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 512)                |      1.57 |            20.96 |            13.34 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 512)               |      1.56 |            21.63 |            13.83 | bimodal |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 20480)             |      1.52 |           992.41 |           651.68 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 20480)             |      1.52 |           988.19 |           649.90 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 20480)              |      1.52 |           990.61 |           651.54 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayCopyToIntArray(Size: 32)     |      1.45 |            22.64 |            15.59 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 256)               |      1.37 |            12.74 |             9.31 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 256)                |      1.33 |            12.48 |             9.36 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 256)               |      1.33 |            12.35 |             9.31 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 256)            |      1.21 |            22.79 |            18.87 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 32)                |      1.20 |             4.35 |             3.62 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32)             |      1.17 |             4.87 |             4.14 |         |
| System.Collections.Tests.Perf_BitArray.BitArraySetLengthShrink(Size: 256)   |      1.09 |            98.23 |            90.07 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayLengthValueCtor(Size: 20480) |      1.08 |           378.30 |           349.60 |         |
| System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 32)    |      1.06 |            33.76 |            31.84 |         |

I presume the "Slower" cases are due to insufficiently quiescent machine.

GrabYourPitchforks · 2019-07-03T19:08:26Z

Are these modifications "safe"? If somebody (like a separate thread) is modifying either BitArray instance concurrently with this pointer-based code running, is there a potential that memory corruption could occur due to references being swapped out mid-stream without anybody noticing?

Edit: I see that this change isn't introducing unsafe code; it's modifying existing unsafe code. I'll follow up offline regarding the existing code. Meanwhile don't let my comment here block checkin.

BruceForstall · 2019-07-03T20:06:35Z

@GrabYourPitchforks Well, my changes I think are no less safe than what was already there.

And the documentation (https://docs.microsoft.com/en-us/dotnet/api/system.collections.bitarray) says:

Thread Safety
Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.

This implementation does not provide a synchronized (thread safe) wrapper for a BitArray.

BruceForstall · 2019-07-03T20:10:21Z

@stephentoub

btw, are changes like this acceptable at this point in 3.0, or should it wait?

stephentoub · 2019-07-03T21:21:55Z

btw, are changes like this acceptable at this point in 3.0, or should it wait?

Thanks, Bruce. I suggest we wait on this until after we fork master. In the past we've sometimes had some bug tail from vectorization work, and at this point we don't have a lot of time to react to that well.

stephentoub · 2019-07-03T21:24:07Z

src/System.Collections/src/System/Collections/BitArray.cs

-            }
-
+            int arrayLength = GetInt32ArrayLengthFromBitLength(Length);
+            m_array.AsSpan(0, arrayLength).Fill(fillValue);


@ahsonkhan had previously looked into doing this and found there were regressions for smaller inputs. See #33367 (comment) for a discussion with him and @GrabYourPitchforks.

I expected that, but the perf results for a single "int" (32-bit) "SetAll" still showed an improvement, which is surprising:

| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32) | 1.17 | 4.87 | 4.14 | |

I should probably run them again, and try harder to get a quiescent machine.

stephentoub · 2019-07-03T21:27:15Z

src/System.Collections/src/System/Collections/BitArray.cs

+                {
+                    for (; i < count - (Vector256<int>.Count - 1); i += Vector256<int>.Count)
+                    {
+                        Vector256<int> leftVec = Avx.LoadVector256(leftPtr + i);


Are there any potential alignment issues here?

@tannergooding Are there any Vector256 alignment issues to be aware of? (also, if you could look at this change, that would be great)

Not from a correctness point of view as these loads/stores are unaligned anyway. It may be a perf issue as unaligned access that straddles cache lines is kind of slow. Typical vectorized implementations attempt to first align the pointer but it looks like the existing Vector128 code didn't bother with that so...

While I agree AVX + aligned loads + partially unrolled loops to do 4 XORs per iteration is super fast for large inputs I find SSE + unaligned loads is more friendly for small inputs and the code looks (and actually) much smaller. Just take a look at this "Calculate a sum of floats" impl with all of those in mind: https://github.com/dotnet/machinelearning/blob/287bc3e9b84f640e9b75e7b288c2663a536a9863/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs#L988-L1095

Btw, the extended jump table probably regresses bitarrays with Length > 3 && Length < 8

Yeah, as mikedn said, these are all "unaligned" loads anyways and the only real concern is for reads/writes that cross a cache-line or page boundary, which can be significantly slower. The current optimization manual indicates ~4.5 cycles vs 1 cycle for modern processors for cache-line load splits and ~150 cycles for a cache-line store split.

So, you might see some perf benefit (especially for large arrays) by aligning the pointer. You can generally still perform the first (handling unaligned elements) and last (handling trailing elements) via vectorization by masking out elements you don't care about.

danmoseley · 2019-07-03T22:14:36Z

Marking no-merge to wait for the last merge from master into release.

danmoseley · 2019-07-03T22:16:27Z

I assume you'll run the tests with the various COMPlus_xx environment variables so it goes down each of the various ISA paths? I don't think we (?@tannergooding) have set up infrastrcuture yet to make that easy in CoreFX tests.

mikedn · 2019-07-04T05:57:51Z

Are these modifications "safe"? If somebody (like a separate thread) is modifying either BitArray instance concurrently with this pointer-based code running, is there a potential that memory corruption could occur due to references being swapped out mid-stream without anybody noticing?

The current BitArray code is already broken as it makes no attempt to prevent races leading to memory corruption. That should be fixed, it is OK for BitArray to not be thread safe but in general the lack of thread safety in .NET shouldn't lead to memory corruption. It is expected that changing the length of the BitArray while an operation is in progress may lead to an IndexOutOfRangeException or something similar in the worst case.

tannergooding · 2019-07-04T15:19:56Z

src/System.Collections/src/System/Collections/BitArray.cs

                case 3: m_array[2] &= value.m_array[2]; goto case 2;
                case 2: m_array[1] &= value.m_array[1]; goto case 1;
                case 1: m_array[0] &= value.m_array[0]; goto Done;
                case 0: goto Done;
            }

            int i = 0;
-            if (Sse2.IsSupported)
+            if (Avx2.IsSupported)


Do we know if BitArray is commonly large enough that we won't regress perf for the 16-32 byte scenario since Avx2 will force that back to the scalar case?

It's likely worth determining the cutoff point at which AVX2 starts significantly winning out over SSE2 and account for that (I have seen it generally be at 48-128 bytes, rather than at exactly 32-bytes).

GrabYourPitchforks · 2019-07-04T23:17:11Z

It looks like the unsafe code was originally added back in #33781. I followed up with some folks offline and the consensus is that thread safety violations shouldn't lead to type safety violations or memory corruption. This gels with @mikedn's observations above and fits the pattern already used by some of our other types like Memory<T>.

I opened https://github.com/dotnet/corefx/issues/39204 to track the issue. I have no preference on whether the issue is addressed as part of this PR or as a separate follow-up PR.

jkotas · 2019-07-04T23:47:20Z

no preference on whether the issue is addressed as part of this PR or as a separate follow-up PR.

#39204 should be addressed in separate PR for 3.0. This vectorization work is not for 3.0.

Gnbrkm41 · 2019-07-05T15:42:02Z

Please close #37946 if this get merged :^)

BruceForstall · 2019-07-08T17:57:38Z

To summarize the work left to do:

Test and measure performance with AVX2 disabled, and AVX2 and SSE2 both disabled. For performance, this is to see when AVX2 starts being better.
Meaure the performance of SetAll for various small sizes to see if the previous simple loop implementation is faster, and at what point vectorized implementations in Span.Fill() are better.
Investigate aligning Vector256 (and Vector128) reads/writes: split operations into unaligned head, aligned body, unaligned tail. Measure by forcing unaligned and forcing aligned incoming pointers to see the difference. Measure for small and large BitArray sizes.
Merge with Ensure BitArray unsafe accesses are within bounds #39270

adamsitnik · 2019-07-31T07:17:02Z

To summarize the work left to do

@BruceForstall any progress on this? Could you please add a test case of length=4 to the existing performance repo benchmark

https://github.com/dotnet/performance/blob/fb4d42a60f1e79e41212a46a75583bd41653f71d/src/benchmarks/micro/corefx/System.Collections/Perf.BitArray.cs#L17

+ [Params(4, 512)]
- [Params(512)]

BruceForstall · 2019-07-31T18:16:15Z

@adamsitnik I've been waiting until master is post-3.0 (i.e., now), and to find some time. Thanks for the suggestion; will do that.

maryamariyan · 2019-08-07T21:32:44Z

@BruceForstall master branch is now open for 5.0 submissions. There is some conflicts on this PR, would you be updating this or prefer to close and reopen it at a later time?

Asking because, the aim is to have less PRs open that are inactive for multiple weeks.

BruceForstall · 2019-08-07T23:17:29Z

@maryamariyan I plan to update it. Hopefully relatively soon.

1. Use AVX2, if available, for And/Or/Xor 2. Vectorize Not 3. Use Span<T>.Fill() for SetAll() 4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases

ViktorHofer · 2019-10-08T13:19:36Z

ping @BruceForstall

BruceForstall · 2019-10-08T18:57:09Z

@ViktorHofer I haven't forgotten, but new responsibilities + vacation has gotten in the way. If someone else wants to pick this up and finish it, I'm willing to help. Or, I'll get to it eventually...

maryamariyan · 2019-10-08T19:04:14Z

If nobody will be working on this PR anytime soon, it is best to link this PR to an issue and close this PR.
@BruceForstall do you agree? (and mark as up-for-grabs or something)

BruceForstall · 2019-10-08T19:23:20Z

@maryamariyan There is already an issue: https://github.com/dotnet/corefx/issues/37946. I'd prefer to keep this open a little longer, but if you have a "not recently touched PR" requirement to close it, then I guess that's ok.

ViktorHofer · 2019-10-08T19:38:58Z

we don't have such a requirement but prefer to have a not too long list of open PRs. That said, I'm fine with keeping this one open.

maryamariyan · 2019-10-08T20:26:46Z

Thanks Bruce. That makes sense. We're trying to reduce the backlog of old PR'S so we can focus on being responsive on the active ones. I'm going to close this for now. Please do reopen when you're ready to pick this up again

adamsitnik · 2019-10-14T12:23:00Z

I would really love to see these optmizations merged. I've created a new issue to keep track of it and added the Hacktoberfest label, hoping that some contributor is going to pick it up.

https://github.com/dotnet/corefx/issues/41762

stephentoub reviewed Jul 3, 2019

View reviewed changes

stephentoub requested a review from tannergooding July 3, 2019 21:29

danmoseley added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 3, 2019

tannergooding reviewed Jul 4, 2019

View reviewed changes

stephentoub removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Jul 17, 2019

adamsitnik added the tenet-performance Performance related issue label Jul 31, 2019

karelz added this to the 5.0 milestone Aug 3, 2019

danmoseley added the area-System.Collections label Aug 6, 2019

eiriktsarpalis assigned safern Aug 8, 2019

BruceForstall force-pushed the SpeedUpBitArray branch from f78c3fe to e3726f3 Compare August 22, 2019 18:05

Speed up BitArray

579ec7f

1. Use AVX2, if available, for And/Or/Xor 2. Vectorize Not 3. Use Span<T>.Fill() for SetAll() 4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases

BruceForstall force-pushed the SpeedUpBitArray branch from e3726f3 to 579ec7f Compare September 23, 2019 06:52

maryamariyan closed this Oct 8, 2019

Gnbrkm41 mentioned this pull request Oct 18, 2019

Vectorise BitArray #41896

Merged

Gnbrkm41 mentioned this pull request Nov 10, 2019

Add benchmark data for testing non-vectorised paths dotnet/performance#1026

Merged

adamsitnik mentioned this pull request Feb 1, 2020

Finish BitArray performance optimizations using AVX from #39173 dotnet/runtime#31161

Closed

GrabYourPitchforks mentioned this pull request Aug 3, 2021

Optimize CollectionsMarshal.AsSpan with inlining and bounds check removal dotnet/runtime#56773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up BitArray #39173

Speed up BitArray #39173

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

GrabYourPitchforks commented Jul 3, 2019 •

edited

Loading

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

stephentoub commented Jul 3, 2019

stephentoub Jul 3, 2019

BruceForstall Jul 3, 2019

BruceForstall Jul 3, 2019

stephentoub Jul 3, 2019

BruceForstall Jul 3, 2019

mikedn Jul 4, 2019

EgorBo Jul 4, 2019 •

edited

Loading

tannergooding Jul 4, 2019

danmoseley commented Jul 3, 2019 •

edited

Loading

danmoseley commented Jul 3, 2019

mikedn commented Jul 4, 2019

tannergooding Jul 4, 2019

GrabYourPitchforks commented Jul 4, 2019

jkotas commented Jul 4, 2019

Gnbrkm41 commented Jul 5, 2019

BruceForstall commented Jul 8, 2019

adamsitnik commented Jul 31, 2019

BruceForstall commented Jul 31, 2019

maryamariyan commented Aug 7, 2019 •

edited

Loading

BruceForstall commented Aug 7, 2019

ViktorHofer commented Oct 8, 2019

BruceForstall commented Oct 8, 2019

maryamariyan commented Oct 8, 2019 •

edited

Loading

BruceForstall commented Oct 8, 2019

ViktorHofer commented Oct 8, 2019

maryamariyan commented Oct 8, 2019

adamsitnik commented Oct 14, 2019

Speed up BitArray #39173

Speed up BitArray #39173

Conversation

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

GrabYourPitchforks commented Jul 3, 2019 • edited Loading

BruceForstall commented Jul 3, 2019

BruceForstall commented Jul 3, 2019

stephentoub commented Jul 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo Jul 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley commented Jul 3, 2019 • edited Loading

danmoseley commented Jul 3, 2019

mikedn commented Jul 4, 2019

Choose a reason for hiding this comment

GrabYourPitchforks commented Jul 4, 2019

jkotas commented Jul 4, 2019

Gnbrkm41 commented Jul 5, 2019

BruceForstall commented Jul 8, 2019

adamsitnik commented Jul 31, 2019

BruceForstall commented Jul 31, 2019

maryamariyan commented Aug 7, 2019 • edited Loading

BruceForstall commented Aug 7, 2019

ViktorHofer commented Oct 8, 2019

BruceForstall commented Oct 8, 2019

maryamariyan commented Oct 8, 2019 • edited Loading

BruceForstall commented Oct 8, 2019

ViktorHofer commented Oct 8, 2019

maryamariyan commented Oct 8, 2019

adamsitnik commented Oct 14, 2019

GrabYourPitchforks commented Jul 3, 2019 •

edited

Loading

EgorBo Jul 4, 2019 •

edited

Loading

danmoseley commented Jul 3, 2019 •

edited

Loading

maryamariyan commented Aug 7, 2019 •

edited

Loading

maryamariyan commented Oct 8, 2019 •

edited

Loading