-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it time to replace Marvin? #85206
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsWe could, for example, bring XxHash3 down into corelib from System.IO.Hashing, delete Marvin, and use XxHash3 in place of it. public static IEnumerable<int> GetLengths()
{
for (int i = 0; i < 100; i++) yield return i;
yield return 1000;
}
[ParamsSource(nameof(GetLengths))]
public int Length { get; set; }
private string _str;
private static readonly long s_seed = BitConverter.ToInt64(RandomNumberGenerator.GetBytes(8));
[GlobalSetup]
public void Setup() => _str = RandomNumberGenerator.GetString("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", Length);
[Benchmark(Baseline = true)]
public int Marvin() => string.GetHashCode(_str);
[Benchmark]
public int XXH3() => (int)(long)XxHash3.HashToUInt64(MemoryMarshal.AsBytes(_str.AsSpan()), s_seed);
The above is for case-sensitive. Not sure exactly how we'd want to tweak the XXH3 algorithm for case-insensitive, but we could probably do something similar to what we do with Marvin today. (It's also possible to improve on these numbers a bit for this purpose for XxHash3. There are some calculations in the current implementation that are done on every call, based on the seed supplied to that call, but for string hashing purposes where the seed is constant for the process lifetime, those calculations could be done once for the process when the random seed for the process is selected.) cc: @GrabYourPitchforks, @jkotas
|
Some relevant info in #68616. If it has comparable security guarantees and is faster - seems potentially a good idea. |
I'm going to guess the 99% case is hashing less than 100 bytes. I wonder whether there are algorithms that perhaps don't perform as well across all input sizes as xxHash but do even better for such short inputs. |
I wonder if with this XXH3 we can even remove the fast hashcode that we use in Dictionary till we hit collisitions and fallback to the default randomized one. |
Does XxHash3 provide same or better protection against DoS attacks compared to Marvin? Is there a test that can measure it? |
Most inputs here are in the 6 - 10 char range for strings which represent identifiers. For strings which represent people's names, they can of course be arbitrarily long, but less than 16 chars is a good rule of thumb. The primary scenario for this is strings which are used within record-like types. Strings which are used as keys in collections are already special cased to go down a faster code path, assuming that optimization remains in place. |
No one else I can find uses or evaluates Marvin, so it doesn't show up in any academic / industry comparisons I've seen nor gets much in the way of external scrutiny (that in and of itself might be reason to move on). But XXH3 gets good marks on standard test suites like SMHasher. Levi can likely comment further. We could also adapt such a suite and measure it ourselves if desired. My suggestion also isn't tied specifically to XXH3... we just have a good implementation of it now, it looks to be well-respected, and its the fastest algorithm we've implemented. If something better came along, we could switch to that instead. |
FWIW, I was curious to see how XXH3 compared against FNV-1a, which is what C# uses for hashing strings in switches. I was wondering if it'd make sense for the C# compiler itself to switch to depend on XxHash3 if doing so didn't require an additional assembly reference. But on my machine, the crossover point where XxHash3 starts being faster is at ~13 characters, which I expect is larger than the average string length used in switches, so it's unlikely to be a win. (For very long string lengths, XXH3 is a clear winner; on my machine, at 100 characters, FNV-1a is ~5x slower than XXH3, and at 1000 characters is ~10x slower.) |
Putting this into future as we won't be doing it for 8. We can move it to 9 when and if work actually starts on this to avoid needing to keep pushing it out to the next milestone. |
The main point it is likely that Marvin loses to XxHash3 on all criteria, with the latter being optimal choice on collision rate + performance across all input lenghts:
Given this would affect one of the foundational behaviors in .NET, this probably warrants a more extensive research rather than a simple "go vs no go", with non-trivial performance wins on the table across the board with improved dictionary/hashset lookups and code simplification (likely no more need for custom implementations for specific containers). |
@stephentoub I'm seeing String.GetNonRandomizedHashCode on a hot path (I guess, |
Yes please |
DoS against hashmap data structures comes from key collisions, and while both Marvin and xxHash have good mixing properties and, therefore, a low rate of collision, protection against DoS should come from seeding the hash functions rather than their collision resistance. That being said, the lack of (at least public) analysis of MarvinHash, and the fact that xxHash has been analyzed (flaws were found and fixed), will by default make xxHash the preferred choice. |
It is worth noting that reducing the xxHash function to a mixer (ulong -> mixed ulong) considerably speeds up the function. Specializing the function rather than using a generic version often provides a very large speedup. |
My FastHash project has a benchmark suite for common hash functions with sizes 8, 32, 128, etc. I've included Marvin, as well as both xxHash v2 and xxHash v3 hash functions. I've made little to no changes to the original algorithms. There is a test suite that tests against published test vectors to ensure correctness. General notes:
Benchmark:
Benchmark notes:
|
Can you elaborate? On what hardware? This is contrary to what we previously saw when measuring optimizations here, e.g. #77881 (comment).
You're looking for the HashToUInt64 method. It doesn't create a byte array; it just returns a ulong. |
I tested against the release version of the package. Seems the HashToUint64 was introduced in the pre-release of version 8.0 packages. I'll redo my benchmark against the pre-release version. The byte array allocation overhead certainly will have impacted performance. |
Ok. In that case, you also weren't using XxHash3, which along with XxHash128 was added in the 8.0 package. Thanks. |
Here are the new benchmarks. These functions are the ones from System.IO.Hashing:
The others are my implementations from Genbox.FastHash.
Config/Hardware:
My previous testing was against System.IO.Hashing 7.0.0, which did not have xxhash v3 or the HashToUInt64/HashToUInt32 functions which are a really nice addition. xxhash v3 from System.IO.Hashing 8.0.0-rc is indeed fast. It gains traction on 16+ bytes, but has overall good performance. |
Good to see Marvin being challenged! It performs well for small input sizes but significantly falls behind some other algorithms such as xxh3 as input size grows. I've put some of my free time lately into making another non-cryptographic algorithm named gxhash, leveraging SIMD and with high ILP. With the rust implementation the benchmark results are compelling, outperforming all counterparts I have found for all input sizes on my ARM and x64 machines. As a quick test I have ported the algorithm in full managed C# (using portable SIMD from Another benefit is that it compiles for a much shorter assembly code compared Please let me know if you think it's worth putting a little more effort into this. If you are interested I can clean the C# implementation a bit and share it. Of course any feedback is more than welcome. Here are some numbers with the current implementation (.NET 8, MacBook M1 pro ARM).
|
@ogxd interesting! Interested in thoughts of others but I imagine we'd want to stick to something that's industry tested for something this fundamental and it sounds like it's relatively early days for your algorithm, although the performance sounds encouraging. |
Yes I completely agree, I was just thinking it was worth mentioning it here, despite it being new, given that XxH3 performs worse than Marvin on small input sizes, making the switch to XxH3 more of a compromise than a 100% win. |
As some seem to be interested I took some more time on this C# implementation for Some benchmark results from a github runner:
|
We could, for example, bring XxHash3 down into corelib from System.IO.Hashing, delete Marvin, and use XxHash3 in place of it.
The above is for case-sensitive. Not sure exactly how we'd want to tweak the XXH3 algorithm for case-insensitive, but we could probably do something similar to what we do with Marvin today.
(It's also possible to improve on these numbers a bit for this purpose for XxHash3. There are some calculations in the current implementation that are done on every call, based on the seed supplied to that call, but for string hashing purposes where the seed is constant for the process lifetime, those calculations could be done once for the process when the random seed for the process is selected.)
cc: @GrabYourPitchforks, @jkotas
The text was updated successfully, but these errors were encountered: