-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modernisation/Performance boost POC #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite some code duplication.
Art the officially unsupported targets (pre .NET 6) worth to vectorized them?
Can the both Ulid.cs-implementations be unified? Maybe with the help of common helper-methods if not doable otherwise.
@@ -155,21 +183,58 @@ internal Ulid(ReadOnlySpan<char> base32) | |||
randomness8 = (byte)((CharToBase32[base32[22]] << 7) | (CharToBase32[base32[23]] << 2) | (CharToBase32[base32[24]] >> 3)); | |||
} | |||
|
|||
#if NETCOREAPP3_0_OR_GREATER | |||
private static readonly Vector128<byte> guidVecShuffle = Vector128.Create((byte)3, 2, 1, 0, 5, 4, 7, 6, 8, 9, 10, 11, 12, 13, 14, 15); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newer targets (.NET 6.0 onwards) this constant should be given "inline" for allowing the JIT to emit better code. As at the moment only .NET 6+ is officially supported I wouldn't bother with .NET Core 3.1 optimizations anymore. So the below #if
def can be dropped. For the Shuffle
you could create a private method that handles this depending on the target.
Good point, it probably isn't worth supporting them, I'll remove them unless the owner want to keep them. Should remove some bloat. I'll see if I can create one or two helper functions. Could I ask you a question? From what I gathered from testing .Net 3, 3.1, 5 should have constants declared outside the method and .Net 7 should have constants delcared inline. Finally .Net 6 uniquely runs the same with constants declared outside or inline. Is this correct? |
.NET 6 onwards should use the vector constants inline. The JIT has then the ability to optimize the usage. It streamlines the C# code also. |
Hi, seems a lot of great progress has been made. Any chance this might get merged and a new nuget version produced ? |
Sorry for the late reply and thanks for the great PR. |
There seems to be some discussion left unanswered.
Currently, we want the same code to work in Unity 2019.3 so we may not change it until we change the supported version.
This is probably a reasonable change.
Hmm, it looks like you are right. I would like to check with the original author.
Although we do not want to add too much complexity to the value generation for the sake of speed, Thanks for the suggestions. further pr is welcome. |
Partially resolving #38 , #37 #35 . Modernised and implemented interfaces to be more like
Guid
, optimized code using intrinsics and bit hacks.IComparable
.ISpanParsable
andISpanParsable<Ulid>
.Time
by 30%.public Ulid(Guid)
for a 80% and 20% speed up.GetHashCode
to use modern unsafe code.ToGuid
and added bit hack optimization for a 80-90% and 20% speed up.I've updated Ulid, PerfBenchmarks and Ulid,Tests to support .Net 7.0, this is to add
ISpanParsable
and modernVector
methods. All changes should be back compatible with older versions of dotnet - modern code is added with the#if
pre processor directives.A lot of intrinsics code appears to be repeated for .Net 3+ and for .Net 7, this is deliberate as the cross platform
Vector128.Shuffle
was only added in .Net 7. This way Ssse compatible devices will get the benefit of vectorisation in most modern versions of .Net.Because of JIT related issues (I think) if .Net 3 vector code doesn't constant vectors outside of method bodies for performance reasons. Likewise .Net 7 vector code should only use constant vectors placed inside its method body, again for performance reasons. I'm not sure why this happens (might be similar to this or this), all I know is that when I use a shuffle indices vector the wrong way the performance takes a hit. I'd be curious to see if other people experience the same performance improvements I've had.
Note that my benchmarks may not be accurate, the times are in the nano seconds and vary quite a bit. For some benchmarks I've added a loop to get more reliable numbers, but due to the magic of software these numbers are probably not representative. I've labeled these results accordingly. Some of that changes may not be significant and It's possible that I've added performance regressions for other versions of .Net or devices.
Finally please note that I'm still unexperienced with unsafe code and vectorisation. It's likely that I've made a mistake or overlooked key points. ie memory alignment, pointer safety, architecture compatibility, Unsafe.As vs Unsafe.As(Unsafe.AsRef), accidental mutations.
Implemented
IComparable
Implemented
IComparable
, returns 1 if null, raises an exception if not an Ulid, otherwise the defaultCompareTo
method is used. I also added a comparison unit test.ISpanFormattable & ISpanParsable
Implemented both interfaces for their respective versions with additional unit tests. I'm not familiar with these interfaces (can't find any docs/articles) and treated the methods as wrappers for TryWriteStringify/ToString and Parse/TryParse, ignoring
IFormatProvider
andReadOnlySpan<char> format
.Looking at the Guid internals they ignored
IFormatProvider
but would change the serialization method depending on values insideReadOnlySpan<char> format
. I assumed that it isn't mandatory for implementors to use these parameters. Perhaps you could consider custom formatting options here.Changed
Time
Updated the Time property to use Unsafe code to reverse the first 6 bytes. I kept the original method for big endian devices.
Benchmarking shows a 30% speedup with times equal to or faster than vectorisation so I opted to only use byte shuffling.
Looped benchmark (100 iterations)
Constructor public Ulid Guid
Added/updated 3 methods for
public Ulid(Guid guid)
. For .Net 7+ or .Net 3+ I used shuffle intrinsics to shuffle the structure of the guid and cast it into a Ulid. For little endian intrinsic incompatible devices I used Unsafe andBinaryPrimitives.ReverseEndianness
to alter the Guid structure to Ulid.Equality
Added a vectorized equality check using either .Net 7+ Vector128 extensions, a manual Ssse3 comparison for .Net 3.0-.Net 6.0 or a refactored unsafe equality check.
Looped benchmark (100 iterations)
GetHashCode
Refactored the previous unsafe scoped implementation to use modern Unsafe code.
ToString
Used
string.Create
this letsTryWriteStringify
directly mutate the string during constructions, before it is made immutable.ToGuid
Added/updated 3 methods for
public Ulid(Guid guid)
. For .Net 7+ or .Net 3+ I used shuffle intrinsics to shuffle the structure of the Ulid and cast it into a Guid. For little endian intrinsic incompatible devices I used Unsafe andBinaryPrimitives.ReverseEndianness
to alter the Ulid structure to Guid.Testing
To support vectorised/non vectorised code a way of testing with and without hardware intrinsics enabled should be added. I'm not familiar with github actions but perhaps the following could be added to build-debug.yml.
Future changes/ Thoughts
<
,<=
,>
,>=
.public string ToBase64
usesArrayPool
, stackalloc could instead be used here. TheSkipLocalsInit
could be used to remove the overhead of clearing the span.Would the logic used in the original get
Time
andinternal Ulid(long timestampMilliseconds...
work on big endian machines?I have a working vectorized implementation of CrockfordBase32 that is compatible with the Ulid/Nulid implementation (Accepts inputs of 0-122, treats lowercase as uppercase, invalid characters become 31). It needs more work, decoding is 30% faster than the array version but encoding is still slower.
Experimental change to `internal Ulid(long,..)`
Internal Ulid long
Experimented with using Unsafe and
BinaryPrimitives.ReverseEndianness
to speed up internal constructors. Benchmarks showed little improvement.Benchmark
Looped benchmark (100 iterations)