-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized hashing of keys for memoization #1074
Conversation
Just some quick examples/performance statistics:
Unsurprisingly, big keys hash slower than small keys (I expect most keys to be pretty small). Note that |
Looks good to me. I doubt we'll catch all the cases we'll need straight away so it's good you can add |
supplied, the objects id is used - this may be an issue when dealing | ||
with mutable objects. For this reason, it is important to support | ||
common mutable types such as pandas Dataframes and numpy arrays | ||
(supported). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit discussing mutable objects is confusing -- it calls out Pandas and Numpy, but doesn't mention dicts, lists, sets, etc. -- surely those are both mutable and supported as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction is between what is supported directly by JSON (dicts, lists) and things that we need to add support for it HashGenerator
. Currently support is added for sets, datetime, numpy arrays and pandas dataframes/series. More things may need to be explicitly supported in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, please reword to convey that.
""" | ||
Extends JSONEncoder to generate a hashable string for as many types | ||
of key as possible. These keys are supplied by widgets and Stream | ||
parameters and may be nested. If an unrecognized object type is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm confused about whether it's truly a HashGenerator -- the docstring says it will generate a hashable string, not a hash, and the code indicates that sometimes it's one, sometimes it's the other. Which is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashableStringGenerator
is a pretty ugly name but probably more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.
Happy to update the docstring as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, all strings are hashable already. It is about taking an arbitrary thing and generating a unique string from it suitable to generate a hash. I agree that is a less ugly name though... @philippjfr Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringForHashing? Not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashableJSON
might be better - the output is a JSON string with some customised support for additional types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashableJSONEncoder
if we want to be pedantic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashableJSON seems fine to me.
|
||
|
||
|
||
def keyhash(key, as_string=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this hashable_string?
In any case, I'm not sure in what sense the argument is a "key".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this
hashable_string
?
No. I'll get rid of the as_string
argument as I only needed it temporarily for debugging. The output is an integer, as returned by the Python hash
function.
I am happy to rename it. In the context it is used (memoisation) the argument is what we have always called a 'key'. I'll just call it obj
here... but then this function will then need a new name. We can't simply use hash
clashes with the built-in. Again, any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys. obj is a good name for the argument; that's what it is (an arbitrary object). For the function name, maybe objhash (to contrast it with the regular hash(), which doesn't work for arbitrary objects)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys.
Agreed! Not that keen on objhash
as a name though...
Implemented suggestions and added unit tests. Ready for final review/merge once tests pass. |
(supported). | ||
of object as possible including nested objects and objects that are | ||
not normally hashable. The purpose of this class is to generate | ||
unique strings that once hashed are suitable for memoization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, it's not the hashed strings that are memoized, it's the actual results (and only implicitly even those, in this case). The hashed strings are useful for memoization, but they themselves are not memoized.
So maybe "The purpose of this class is to generate unique strings that once hashed are suitable for use in memoization and other cases where deep equality must be tested without storing the entire object".
|
||
By default JSONEncoder supports booleans, numbers, strings, lists, | ||
tuples and dictionaries. In order to support other types such as | ||
sets, datetime objects and mutable objects such as pandas Dataframes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and custom mutable
Not sure why it's failing, but it looks good to me apart from the minor docstring fixes suggested. |
Python 3 has an additional limitation - the If we are happy to accept these limitations, the PR is ready to merge now the test have passed. Even with its caveats/holes, this approach greatly expands what can be memoized (before this PR, even a simple list wouldn't allow memoization). |
Looks good to me. |
Annoying this turned out to be such a pain. We'll just have to see if we run into any issues in practice. Currently the types of objects that will appear in a stream are fairly limited so I don't foresee any immediate issues. Will merge. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Addresses #1067.
This approach isn't perfect but it will allow a lot more keys to be memoized than before (sets, lists, dictionaries). The main limitation is that composite values as dictionary keys are not allowed by the JSON spec.
Currently pandas Dataframes/Series and numpy arrays are allowed as mutables. In general you will want to support mutable types explicitly as using the
id
does not reflect the contents of the object.This PR isn't ready yet - I want to add a bunch of unit tests.