-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Conversation
BruceForstall
commented
Jul 3, 2019
- Use AVX2, if available, for And/Or/Xor
- Vectorize Not
- Use Span.Fill() for SetAll()
- Add more test sizes to account for And/Or/Xor/Not loop unrolling cases
Related: dotnet/performance#610 |
Perf results:
I presume the "Slower" cases are due to insufficiently quiescent machine. |
Are these modifications "safe"? If somebody (like a separate thread) is modifying either Edit: I see that this change isn't introducing unsafe code; it's modifying existing unsafe code. I'll follow up offline regarding the existing code. Meanwhile don't let my comment here block checkin. |
@GrabYourPitchforks Well, my changes I think are no less safe than what was already there. And the documentation (https://docs.microsoft.com/en-us/dotnet/api/system.collections.bitarray) says:
|
btw, are changes like this acceptable at this point in 3.0, or should it wait? |
Thanks, Bruce. I suggest we wait on this until after we fork master. In the past we've sometimes had some bug tail from vectorization work, and at this point we don't have a lot of time to react to that well. |
} | ||
|
||
int arrayLength = GetInt32ArrayLengthFromBitLength(Length); | ||
m_array.AsSpan(0, arrayLength).Fill(fillValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahsonkhan had previously looked into doing this and found there were regressions for smaller inputs. See #33367 (comment) for a discussion with him and @GrabYourPitchforks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected that, but the perf results for a single "int" (32-bit) "SetAll" still showed an improvement, which is surprising:
| System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32) | 1.17 | 4.87 | 4.14 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably run them again, and try harder to get a quiescent machine.
{ | ||
for (; i < count - (Vector256<int>.Count - 1); i += Vector256<int>.Count) | ||
{ | ||
Vector256<int> leftVec = Avx.LoadVector256(leftPtr + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any potential alignment issues here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Are there any Vector256 alignment issues to be aware of? (also, if you could look at this change, that would be great)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from a correctness point of view as these loads/stores are unaligned anyway. It may be a perf issue as unaligned access that straddles cache lines is kind of slow. Typical vectorized implementations attempt to first align the pointer but it looks like the existing Vector128
code didn't bother with that so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree AVX + aligned loads + partially unrolled loops to do 4 XORs per iteration is super fast for large inputs I find SSE + unaligned loads is more friendly for small inputs and the code looks (and actually) much smaller. Just take a look at this "Calculate a sum of floats" impl with all of those in mind: https://github.com/dotnet/machinelearning/blob/287bc3e9b84f640e9b75e7b288c2663a536a9863/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs#L988-L1095
Btw, the extended jump table probably regresses bitarrays with Length > 3 && Length < 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as mikedn said, these are all "unaligned" loads anyways and the only real concern is for reads/writes that cross a cache-line or page boundary, which can be significantly slower. The current optimization manual indicates ~4.5 cycles vs 1 cycle for modern processors for cache-line load splits and ~150 cycles for a cache-line store split.
So, you might see some perf benefit (especially for large arrays) by aligning the pointer. You can generally still perform the first (handling unaligned elements) and last (handling trailing elements) via vectorization by masking out elements you don't care about.
Marking no-merge to wait for the last merge from master into release. |
I assume you'll run the tests with the various COMPlus_xx environment variables so it goes down each of the various ISA paths? I don't think we (?@tannergooding) have set up infrastrcuture yet to make that easy in CoreFX tests. |
The current |
case 3: m_array[2] &= value.m_array[2]; goto case 2; | ||
case 2: m_array[1] &= value.m_array[1]; goto case 1; | ||
case 1: m_array[0] &= value.m_array[0]; goto Done; | ||
case 0: goto Done; | ||
} | ||
|
||
int i = 0; | ||
if (Sse2.IsSupported) | ||
if (Avx2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if BitArray
is commonly large enough that we won't regress perf for the 16-32 byte scenario since Avx2
will force that back to the scalar case?
It's likely worth determining the cutoff point at which AVX2 starts significantly winning out over SSE2 and account for that (I have seen it generally be at 48-128 bytes, rather than at exactly 32-bytes).
It looks like the unsafe code was originally added back in #33781. I followed up with some folks offline and the consensus is that thread safety violations shouldn't lead to type safety violations or memory corruption. This gels with @mikedn's observations above and fits the pattern already used by some of our other types like I opened https://github.com/dotnet/corefx/issues/39204 to track the issue. I have no preference on whether the issue is addressed as part of this PR or as a separate follow-up PR. |
#39204 should be addressed in separate PR for 3.0. This vectorization work is not for 3.0. |
Please close #37946 if this get merged :^) |
To summarize the work left to do:
|
@BruceForstall any progress on this? Could you please add a test case of length=4 to the existing performance repo benchmark + [Params(4, 512)]
- [Params(512)] |
@adamsitnik I've been waiting until master is post-3.0 (i.e., now), and to find some time. Thanks for the suggestion; will do that. |
@BruceForstall master branch is now open for 5.0 submissions. There is some conflicts on this PR, would you be updating this or prefer to close and reopen it at a later time? Asking because, the aim is to have less PRs open that are inactive for multiple weeks. |
@maryamariyan I plan to update it. Hopefully relatively soon. |
f78c3fe
to
e3726f3
Compare
1. Use AVX2, if available, for And/Or/Xor 2. Vectorize Not 3. Use Span<T>.Fill() for SetAll() 4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases
e3726f3
to
579ec7f
Compare
ping @BruceForstall |
@ViktorHofer I haven't forgotten, but new responsibilities + vacation has gotten in the way. If someone else wants to pick this up and finish it, I'm willing to help. Or, I'll get to it eventually... |
If nobody will be working on this PR anytime soon, it is best to link this PR to an issue and close this PR. |
@maryamariyan There is already an issue: https://github.com/dotnet/corefx/issues/37946. I'd prefer to keep this open a little longer, but if you have a "not recently touched PR" requirement to close it, then I guess that's ok. |
we don't have such a requirement but prefer to have a not too long list of open PRs. That said, I'm fine with keeping this one open. |
Thanks Bruce. That makes sense. We're trying to reduce the backlog of old PR'S so we can focus on being responsive on the active ones. I'm going to close this for now. Please do reopen when you're ready to pick this up again |
I would really love to see these optmizations merged. I've created a new issue to keep track of it and added the Hacktoberfest label, hoping that some contributor is going to pick it up. |