Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash/maphash: add func Comparable #54670

Closed
Tracked by #13091
dsnet opened this issue Aug 25, 2022 · 40 comments
Closed
Tracked by #13091

hash/maphash: add func Comparable #54670

dsnet opened this issue Aug 25, 2022 · 40 comments

Comments

@dsnet
Copy link
Member

dsnet commented Aug 25, 2022

I'm trying to implement my own map-like data structure. I'm using generics where I have a comparable type on hand, but I don't have a way to hash it. To workaround it, I'm adding a Hash() uint64 constraint, but that requires every key type to have a custom Hash method, which is quite unfortunate, given that the Go runtime knows how to hash comparable types.

I propose we add:

// Comparable returns the hash of comparable value v with the given seed
// such that Comparable(s, v1) == Comparable(s, v2) if v1 == v2.
// If v contains a floating-point NaN, then the hash is non-deterministically random.
func Comparable[T comparable](seed Seed, v T) uint64

Under the hood, this function is a thin wrapper over mapType.hasher.

There are some caveats with the operation of this function, but they're no different than the caveats of using ==.

@gopherbot gopherbot added this to the Proposal milestone Aug 25, 2022
@dsnet
Copy link
Member Author

dsnet commented Aug 25, 2022

I'm realizing a major caveat with this is that this can hash pointers. In order for Comparable(s, v1) == Comparable(s, v2) if v1 == v2 to hold, this probably implies that Go doesn't use a moving GC.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Aug 25, 2022
@ianlancetaylor
Copy link
Member

Note that in #28322 we discussed and rejected the idea of having the package that became hash/maphash support general types.

@randall77
Copy link
Contributor

I'm realizing a major caveat with this is that this can hash pointers. In order for Comparable(s, v1) == Comparable(s, v2) if v1 == v2 to hold, this probably implies that Go doesn't use a moving GC.

Even today, we have moving GC if you count stacks. The built-in maps avoid this problem by never allocating on the stack objects that map pointer keys refer to. For maphash.Comparable we'd need to make sure that it escapes its second argument so that anything it refers to will not be on the stack.

@andy-wm-arthur
Copy link

I wanted something like Comparable[T]() for a package I was working on. When I couldn't find a general-purpose solution, I wrote my version that uses unsafe to get access to the runtime hash implementation:

https://github.com/dolthub/maphash

My use case is a simple stripe-locked map, where a generic hash function is used to pick the stripe:

type Map[K comparable, V any] struct {
	stripes [64]map[K]V
	locks   [64]sync.Mutex
} 

Looking through the history of this issue, it seems like there are a number of interesting applications for this proposal. @bcmills listed a number of relevant examples in #21195

@dsnet
Copy link
Member Author

dsnet commented Feb 13, 2023

In my application, I'm writing a generic data structure for packet processing, where I only care about hashing byte arrays.

As a half-step measure, I wonder if there's value in providing a Comparable function that panics if the type transitively contains any Go pointers, interfaces, or channels. This avoids the gnarly question of how pointers should be handled until a later point in time. There's always the possibility of providing it later, while providing value for many comparable types today.

The downside is that generic data structures (that expose comparable as part of their API) are not fully type-safe since only a subset of comparable types are supported. It also functionally introduces a third definition of "comparability". It's already confusing enough that the Go specification distinguishes between "comparable" and "strictly comparable". And we'd be introducing "very strictly comparable" (or something) that's a subset of "strictly comparable".

@dsnet
Copy link
Member Author

dsnet commented Feb 13, 2023

For posterity, here are the thorny issues with pointers, interfaces, and channels:

  • This assumes that the GC is non-moving.
  • Even for today's GC, there's the movement due to stacks that may cause pointer addresses to change. See hash/maphash: add func Comparable #54670 (comment).
  • If we hash a pointer p1 as h1 and let p1 be GC'd, and then allocate p2 as the same type with the exact same address, should hashing p2 as h2 result in h1 == h2? Intuitively, the answer seems to be no since these are strictly different objects p1 was destroyed, and p2 was created in its place and just happened to use the same address. However, naively hashing the address will result in equal hashes.
  • If we resolve runtime: types are not garbage collected #28783, where dynamically created types can be GC'd, then we need to be careful how the rtype in an interface is hashed. If we naively hash the pointer, then two Go types that are semantically identical, but alive at different times in the program's lifespan may have different hashes since the rtype (which semantically represents the same Go type) was allocated from different memory addresses).

@aclements
Copy link
Member

This assumes that the GC is non-moving.

I don't think it does. Broadly, there are two solutions to pointer hashing in a moving GC:

  1. Rehash all objects that contain pointers.
  2. When an object is moved, record the original hash of pointers to that object and use that for future hashing.

Today, option 1 would mean rebuilding all maps. This is possible, but would be extremely difficult to do in a low-latency way. If we were to expose hash values of pointers, this would become essentially untenable. (You could imagine that the hash values we expose are opaque, but that severely limits what you can do with them.)

Option 2 is, I believe, fairly reasonable. This is what Java does for the default hashCode implementation. The main downside is that, today, hashing a pointer value requires no indirection–we simply hash the pointer value–and this would require us to perform an object lookup to check for an original hash (not just an indirection because it could be an interior pointer). It might be possible on some architectures to use a bit in the pointer to indicate that the object has moved and we need to do this extra work. It would also require us to have somewhere to store the original hash that's easy to access from the pointer. But if we're moving objects, we'd probably have a different heap layout, too, and this would simply have to be a consideration.

Even for today's GC, there's the movement due to stacks that may cause pointer addresses to change.

I think maphash.Comparable would have to escape its argument in the current implementation. That's not ideal, but doesn't seem like a big deal to me.

@mpx
Copy link
Contributor

mpx commented Sep 16, 2023

Note that in #28322 we discussed and rejected the idea of having the package that became hash/maphash support general types.

I think there are a couple of additions to Go that make this highly worth reconsidering:

  • Type Parameters greatly increases the desire for working with generic types and non-builtin data structures
  • comparable makes it possible to refer to types that support equality which are suitable for hashing (reinforcing the above).

The lack of maphash.Comparable (or similar) leaves developers and the ecosystem with 2 bad options:

  1. Write a lot more code for slower, buggy custom hash implementations for their types (I think it's unreasonable to expect most developers to do this well)
  2. Leverage Go's fast/safe hash implementation via runtime hacks which are tricky and risk future incompatibility (along with situations like assume-no-moving-gc).

Providing maphash.Comparable would be extremely useful and remove this effective limitation of Type Parameters.

@aclements
Copy link
Member

aclements commented Mar 28, 2024

Edit 2024-06-05: Tried to clarify table.

We were discussing this in the Google compiler/runtime team and had some more thoughts.

Even for today's GC, there's the movement due to stacks that may cause pointer addresses to change.

Stepping back, the basic property of hashing is that x == y implies hash(x) == hash(y). Values of type *T that point to the stack cause trouble for this because currently the hash of a pointer is simply a function of the pointer, and stack pointers can change. That is, performing hash(&x) before and after a stack movement will result in different values.

I'm being very careful to say "values of type *T" because not all stack pointers are a problem. Notably, values with underlying type string structurally contain a pointer that may well point to the stack, but the value of this pointer does not affect the result of == or of the hash function. We know this is always true because all hash functions are constructed by the compiler.

Values with underlying type chan T behave like *T, so for brevity we won't discuss them explicitly below.

Values of interface type are more complicated. Their dynamic type may be *T and thus they could be a stack pointer. This means static analysis must conservatively assume they may be stack pointers. Dynamic solutions, on the other hand, can inspect the type and value and determine if it's a stack pointer that would affect the hash function.

It's informative to look at how built-in maps deal with all this today. Built-in maps are known to the compiler's escape analysis and follow very simple rules: m[k] lookups do not escape either m or k, delete(m, k) also does not escape m or k, but m[k] as a lvalue for an insert always escapes k. This works out because we know these operations are going to combine hashing with the map operation. Insert needs to escape the value regardless because it's about to put it in the map (this could perhaps be relaxed if the map itself were stack allocated, but the compiler currently doesn't reason about this). Lookup/delete only need the hash temporarily, so it doesn't have to be stable for stack pointers. Furthermore, because insert always escapes its value, lookup/delete of a stack pointer cannot possibly succeed, so the value of the hash doesn't matter.

Options

Once we decouple hashing from data structure operations using that hash, things get more complicated. I see a few options:

  1. We change how pointers to the stack are hashed. This is a purely dynamic solution.

  2. We escape values of type *T and interfaces passed to maphash.Comparable, so the hash function is never affected by a stack pointer. Since this is a static solution, it has to treat interfaces conservatively and we have two options for dealing with string values:

    i. We indiscriminately escape the argument to maphash.Comparable.

    ii. We make escape analysis distinguish strings passed to maphash.Comparable and not escape them.

  3. We use dynamic escape analysis: when maphash.Comparable is called at run-time with a value of type *T that points to the stack, we move its target to the heap. We don't have this capability right now, but we're already exploring it for other reasons. This is a dynamic solution.

  4. We somehow track the data flow of the result of maphash.Comparable to figure out if it's short-lived for a lookup or long-lived for an insert.

1. Change how pointers to the stack are hashed

The hash function for *T would normalize the pointer before hashing it such that the normalized value would not change across stack moves. Probably the simplest implementation is to check if the pointer points within the stack bounds and subtract the pointer value from the top-of-stack pointer. This is nice because it's totally localized to the hash function for *T, and doesn't affect strings or interface values. However, it has some probably minor performance penalty and restricts some future design directions.

Today, composite types can sometimes be hashed directly with a single large memhash operation, and this would narrow the set of circumstances in which that optimization were possible. There are already many things that disallow this optimization, including string and interface fields, and even padding between fields, so in practice further narrowing it may not matter much. We could potentially use different hash functions for built-in maps and maphash to mitigate this, but that would make binaries bigger and disadvantage maphash. We can also make this decision on the fly: because of escape analysis, if the object we're hashing isn't itself on the stack, we know it can't contain stack pointers, so we can do more efficient hashing. This check is fast and only needs to be done once per hash operation.

The more concerning thing to me is that this may restrict some future design directions. For example, we're currently exploring dynamic escape analysis, where values can move from stack to the heap at run-time. This would make the simple normalized hash function no longer stable. We could apply the same solutions as a moving GC, but that has a cost for hashing of all pointers. There may be other ways this would hamstring us.

2. Escape *T and interfaces passed to maphash

For a user-defined container that uses maphash, insert is going to escape the key, while lookup and delete can avoid this. This option has the following consequences:

Op Type Built-in map User map Note
Lookup *T No escape Escapes Lookup can never succeed for a stack pointer, so this is almost certainly rare.
Lookup interface No escape Escapes ⚠ Unfortunate.
Lookup string (no special case) No escape Escapes ⚠⚠ Very unfortunate.
Lookup string (w/ special case) No escape No escape
Insert *T Escapes Escapes Both hashing and storing cause escape.
Insert interface Escapes Escapes Same.
Insert string (no special case) Escapes Escapes Same.
Insert string (w/ special case) Escapes Escapes Storing causes escape in user map.

With this approach, lookups in user-defined maps are slightly disadvantaged relative to the built-in map. For *T, this probably doesn't matter in practice. For interfaces it's unfortunate because interface values often contain pointers that don't affect the hash result. For example, if the interface's dynamic type is int, the compiler has to box that value and store it as a pointer to an int value. This solution would force these to escape.

For string values (and strings embedded in other types), if we treat them naively, they'll always escape, but we can special-case them in escape analysis to close this particular gap.

3. Dynamically escape stack pointers passed to maphash

Dynamic escape analysis allows the compiler's escape analysis to be less conservative by enabling the runtime to move values allocated on the stack to the heap only when necessary. We don't currently have this ability, but we're exploring it as a general optimization when we have hot/cold path information from PGO and the compiler can determine that all escapes of a value happen on cold paths. We could apply it to maphash as well.

4. Track the data flow of hash values

I include this here for completeness, but I'm not sure how plausible it is. The idea is that, if we can bound the lifetime of a hash value, it doesn't need to escape the hashed valued. This may be possible if we could represent hash values as a distinct type, but in general data structures need to use the numerical value of a hash, e.g., to select as hash bucket, and I think at that point it would be very easy to lose track of the data flow.

@rsc
Copy link
Contributor

rsc commented May 15, 2024

The easiest implementation would be to accept that Comparable's arguments escape. We see how to do that very easily. My instinct is that it probably is still useful in that case, but maybe I'm wrong. Is it still useful if arguments escape?

If arguments can't escape then it will take a lot longer to move forward with this.

@rsc rsc changed the title proposal: hash/maphash: add Comparable to hash comparable values proposal: hash/maphash: add func Comparable May 15, 2024
@rsc
Copy link
Contributor

rsc commented May 15, 2024

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals May 15, 2024
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 19, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 19, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
@rsc
Copy link
Contributor

rsc commented May 23, 2024

@dsnet I'd love to hear your take on whether Comparable escaping its arguments is still a useful API.

(Let's assume the compiler transparently rewrites maphash.Comparable[string] -> maphash.String so the escape question is not important in that specific case.)

Also it seems odd to have the global func but no way to update a *Hash with a comparable. The choices are to have a method taking an any, or else to have a global func like

func WriteComparable[T comparable](h *Hash, x T)

I assume these routines panic if you use something like maphash.Comparable[any](seed, []int(nil)), where the comparable type is any but the actual value is not comparable?

@thepudds
Copy link
Contributor

thepudds commented May 24, 2024

Overall, this seems worth doing even for the simple implementation.

That said, when re-reading, I’m not always 100% certain when some folks say in some spots above something like "always escape k" if they are using that as shorthand for "always escape k in some cases" (such as if k is a pointer, interface, chan, or other variations).

For clarity, I think the key would not escape if it is for example an integer type or struct of integers, even under the simple implementation?

(Let's assume the compiler transparently rewrites maphash.Comparable[string] -> maphash.String so the escape question is not important in that specific case.)

Would it be reasonable to also avoid escaping types like the following (either in the first implementation, or with modest follow-up effort)?

type t1 struct { k1, k2 string }
type t2 struct { k1, k2 string; k3 int }
type t3 struct { k1 t1; k2 t2 }

(That seems plausible based on a perhaps naive guess at an implementation).

In any event, even the "simple" implementation seems pretty useful, and hard cases could come later or never.

@DmitriyMV
Copy link
Contributor

Can we use this:

//go:nosplit
func noEscapePtr[T any](p *T) *T {
	x := uintptr(unsafe.Pointer(p))
	return (*T)(unsafe.Pointer(x ^ 0))
}

func main() {
...
	h := maphash.Comparable(*noEscapePtr(&myStruct))

To trick maphash.Comparable into thinking that variable does't escape? The same way HashTrieMap internals currently does.

DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
DmitriyMV added a commit to DmitriyMV/gen that referenced this issue May 27, 2024
This PR adds a hashtriemap implementation to the project. This is mostly a direct copy from unique package [^1]
internals [^2] with some adjustments to make it work with our project. Once `maphash.Comparable` lands we can remove
dependency on Go internals [^3]. Michael Knyszek says he plans to use this implementation as a general idea for the
future generic sync/v2.Map, but nothing is in stone yet.

[^1]: golang/go#62483
[^2]: https://go-review.googlesource.com/c/go/+/573956
[^3]: golang/go#54670

Signed-off-by: Dmitriy Matrenichev <[email protected]>
@rsc
Copy link
Contributor

rsc commented Aug 7, 2024

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The proposal is to add:

// Comparable returns the hash of comparable value v with the given seed
// such that Comparable(s, v1) == Comparable(s, v2) if v1 == v2.
// If v contains a floating-point NaN, then the hash is non-deterministically random.
func Comparable[T comparable](seed Seed, v T) uint64

// WriteComparable adds x to the data hashed by h. 
func WriteComparable[T comparable](h *Hash, x T)

There were some questions about implementation but I believe we have worked them out and are confident we can implement it efficiently enough to be useful.

@rsc rsc moved this from Likely Accept to Accepted in Proposals Aug 7, 2024
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/609761 mentions this issue: hash/maphash: add WriteComparable and Comparable

qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Aug 30, 2024
By default, runtime.memhash is used.
When purego is used, reflect is used to generate the same []byte
with the same value, and then hash the []byte.

Fixes golang#54670

Change-Id: Ibd0538a7dfb3d831c5145970cac7c910692bca69
@randall77
Copy link
Contributor

Circling back here with an issue related to collision avoidance.

I think we should strive to avoid collisions between two values of the same type, but not between two values of different types. The reason is that I think it would be hard for an attacker to populate a single map with many different key types. There just aren't that many paths for attackers to force a program to generate lots of differently-typed keys for a single map. Normally, this package will be used with just one type for all the calls to Comparable (with the same seed and whose collisions matter, at least).

We do want to defend against different values of the same type having identical hashes. So the case [2]string{"foo",""} vs [2]string{"","foo"} vs [2]string{"f","oo"}, we want different hashes. But one of them could be the same as the hash of struct{a,b string}{"foo",""}.

So consider this a clarifying mini-proposal. Let me know if anyone has a different thought in this area.

@nightlyone
Copy link
Contributor

The question is how would the current map hashing code handle those cases?

The original intent of runtime/maphash is to expose the runtime internal hashing for user defined containers needing hashes benefitting from any improvements made to runtime internal hashing.

The current implementation is not type variant as every type gets its own algorithm generated at compile time.

Value variance must be handled already in existing hashing code as one can use the above mentioned types as keys in a map.

I believe the hash collisions created by identical byte sequences in a data structure is also ok currently in map hashes of existing maps.

Strings might include the length in the hash, so that might lead to different hashes in arrays for the simple cases.

So I would suggest to not over engineer a solution that is only supposed to expose an existing hashing and it's constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

Successfully merging a pull request may close this issue.