-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Marvin32 to System.IO.Hashing #68616
Comments
Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones Issue DetailsDuring the initial standup of System.IO.Hashing there were requests for Marvin32, and a couple more have trickled in from other channels, so we may as well go ahead and add it. Marvin32 is a 32-bit output hash that uses a 64-bit seed. As such, it looks like the XxHash32 type except the data type for The default seed for the Marvin32 algorithm is namespace System.IO.Hashing
{
public sealed partial class Marvin32 : System.IO.Hashing.NonCryptographicHashAlgorithm
{
public const long DefaultSeed = unchecked((long)0xD53CD9CECD0893B7);
public Marvin32() : base (sizeof(int)) { }
public Marvin32(long seed) : base (sizeof(int)) { }
public static byte[] Hash(byte[] source);
public static byte[] Hash(byte[] source, long seed);
public static byte[] Hash(System.ReadOnlySpan<byte> source, long seed = DefaultSeed);
public static int Hash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed = DefaultSeed);
public static bool TryHash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten, long seed = DefaultSeed);
public override void Append(System.ReadOnlySpan<byte> source);
protected override void GetCurrentHashCore(System.Span<byte> destination);
public override void Reset();
}
}
|
would it make sense to expose also .NET randomized seed (same as used for i.e. string.GetHashCode())? |
I think we'd need a very good reason as to why we'd expose it. The risk of exposing it ends up being that it gets accidentally disclosed via a service and makes it easier for a bad actor to compute string hash collisions. Not knowing a good reason to expose it I think the balance says not to. (For comparison, we also didn't expose the random seed used for XxHash32 in System.HashCode) |
I'd propose removing the default seed const. If somebody really cares, they can look it up from the spec. And it removes the temptation for people to say "I'm good because I'm using Marvin!" while still passing in a globally known seed, which would defeat the point of using the algorithm in the first place. |
Presumably you think we should also get rid of the default ctor and I was debating that when drafting the proposal, but since a default seed was specified I went with "keep an existing shape". And the default seed is better than 0, since a 0 seed is bad for the algorithm. So namespace System.IO.Hashing
{
public sealed partial class Marvin32 : System.IO.Hashing.NonCryptographicHashAlgorithm
{
public Marvin32(long seed) : base (sizeof(int)) { }
public static byte[] Hash(byte[] source, long seed);
public static byte[] Hash(System.ReadOnlySpan<byte> source, long seed);
public static int Hash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed);
public static bool TryHash(System.ReadOnlySpan<byte> source, System.Span<byte> destination, long seed, out int bytesWritten);
public override void Append(System.ReadOnlySpan<byte> source);
protected override void GetCurrentHashCore(System.Span<byte> destination);
public override void Reset();
}
} |
What is the use case for anyone using this hash function? My understanding is that, since Marvin32 was invented, 20 years of fast hash research have occurred. Modern fast hashes should just be better. |
I believe Marvin was invented in 2012, or thereabouts; which might make it newer than the XxHash hashes we've already added.
The concrete requests I've seen seem to stem from "we're already using a copy of the code" and/or "interop". |
In my estimation, this hash algorithm is likely to be legacy tech 5-10 years from now (perhaps even right now). I don't believe it should be eternalized in the framework to be supported forever. Creative alternative: Refactor the existing code to be easy to copy out. Those users who require that will then have a decent workaround. This strategy avoids the support burden. |
Something to consider is that the entire System.IO.Hashing package ships out-of-band. It's not part of the inbox SDK. To me, this makes it no more "eternal" than any other OOB package which has shipped over the years, and those packages don't enjoy a "supported forever" policy. (Case in point: all of the OOB System.Web.* releases over the years, most of which are no longer supported.) |
A bit of info for those not well versed in non-cryptographic hash functions that might take an interest in this conversation. Marvin hash was patented back in 2012. It is described as a fast, seeded, and secure non-cryptographic hash function. It is currently used as the default implementation in GetHashCode() on string. It uses a mixer function based on Xor-Shift-Add to achieve avalancing in parallel on x86 systems. xxHash2 was first released back in 2014, it has since changed a lot. It is also a fast, seeded and secure non-cryptographic hash function. It not only has a 32bit version, but also a 64bit version that is optimized for 64bit systems. It acheves its speed through a mixer function using the Add-Shift-Multiply scheme and a final avalanche function that mixes using Xor-Shift-Multiply. I belive this trick comes from the Murmur2 hash. For comparison, I've taken HashDepot, uranium62, System.IO.Hashing, and my own implementation (Genbox) and benchmarked them. Size is the number of bytes of random input.
Speed is one thing, but another more important factor is random distribution (part of security). Unfortunately, I've not had the time to port it to the SMhasher framework, so I'm unable to backup my claims, but I would belive Marvin would perform fine, as it mixes pretty well with well-studied techniques. |
I'm with @GSPP on this one. Marvin hash is not a first-order citizen like many other hash functions (farmHash, xxHash, SipHash), so it has not seen widespread implementation in software. To me, there are only three reasons why anything should be implemented in .NET:
As Marvin32 is none of the above, I'd say time is better spent optimizing on xxHash, as it seems to be 2-3x slower than other implementations. |
Based on low current demand, and there being arguably better alternatives already implemented (such as xxHash32), this doesn't seem like a thing we should be adding at this time. (If this makes you sad and you really wanted this, then please speak up.) |
During the initial standup of System.IO.Hashing there were requests for Marvin32, and a couple more have trickled in from other channels, so we may as well go ahead and add it.
Marvin32 is a 32-bit output hash that uses a 64-bit seed. As such, it generally looks like the XxHash32 type except the data type for
seed
has been changed fromint
tolong
. One major deviation from the shape of the XxHash32 type is that for Marvin32 the seed is a mandatory parameter.The text was updated successfully, but these errors were encountered: