Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized hashing of keys for memoization #1074

Merged
merged 14 commits into from
Jan 18, 2017
Merged

Conversation

jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Jan 18, 2017

Addresses #1067.

This approach isn't perfect but it will allow a lot more keys to be memoized than before (sets, lists, dictionaries). The main limitation is that composite values as dictionary keys are not allowed by the JSON spec.

Currently pandas Dataframes/Series and numpy arrays are allowed as mutables. In general you will want to support mutable types explicitly as using the id does not reflect the contents of the object.

This PR isn't ready yet - I want to add a bunch of unit tests.

@jlstevens
Copy link
Contributor Author

jlstevens commented Jan 18, 2017

Just some quick examples/performance statistics:

%%timeit
   ...: keyhash([datetime.datetime(1,2,3), set([1,2,3]), 
                 pd.DataFrame({'a':[1,2],'b':[3,4]}), 
                 np.array([1,2,3]), np.int64(34)])
   ...:
1000 loops, best of 3: 659 µs per loop
%%timeit
   ...: keyhash([1,2,3, (4,5)])
   ...:
10000 loops, best of 3: 23.5 µs per loop

Unsurprisingly, big keys hash slower than small keys (I expect most keys to be pretty small). Note that datetime objects are also supported.

@philippjfr
Copy link
Member

Looks good to me. I doubt we'll catch all the cases we'll need straight away so it's good you can add repr_hashable and str_hashable types. Happy to merge when you've added some unit tests.

supplied, the objects id is used - this may be an issue when dealing
with mutable objects. For this reason, it is important to support
common mutable types such as pandas Dataframes and numpy arrays
(supported).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit discussing mutable objects is confusing -- it calls out Pandas and Numpy, but doesn't mention dicts, lists, sets, etc. -- surely those are both mutable and supported as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction is between what is supported directly by JSON (dicts, lists) and things that we need to add support for it HashGenerator. Currently support is added for sets, datetime, numpy arrays and pandas dataframes/series. More things may need to be explicitly supported in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, please reword to convey that.

"""
Extends JSONEncoder to generate a hashable string for as many types
of key as possible. These keys are supplied by widgets and Stream
parameters and may be nested. If an unrecognized object type is
Copy link
Member

@jbednar jbednar Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm confused about whether it's truly a HashGenerator -- the docstring says it will generate a hashable string, not a hash, and the code indicates that sometimes it's one, sometimes it's the other. Which is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashableStringGenerator is a pretty ugly name but probably more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docstring for this class should mention widgets or Streams in any way other than something explicitly labeled as an example. It could mention memoization, though, so that we know the goal of this object is to generate an id that can tell whether two objects differ, not to provide an encoding of the object.

Happy to update the docstring as suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do! Thanks.

Copy link
Contributor Author

@jlstevens jlstevens Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, all strings are hashable already. It is about taking an arbitrary thing and generating a unique string from it suitable to generate a hash. I agree that is a less ugly name though... @philippjfr Any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringForHashing? Not great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashableJSON might be better - the output is a JSON string with some customised support for additional types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashableJSONEncoder if we want to be pedantic...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashableJSON seems fine to me.




def keyhash(key, as_string=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this hashable_string?

In any case, I'm not sure in what sense the argument is a "key".

Copy link
Contributor Author

@jlstevens jlstevens Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this hashable_string?

No. I'll get rid of the as_string argument as I only needed it temporarily for debugging. The output is an integer, as returned by the Python hash function.

I am happy to rename it. In the context it is used (memoisation) the argument is what we have always called a 'key'. I'll just call it obj here... but then this function will then need a new name. We can't simply use hash clashes with the built-in. Again, any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys. obj is a good name for the argument; that's what it is (an arbitrary object). For the function name, maybe objhash (to contrast it with the regular hash(), which doesn't work for arbitrary objects)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, you'll use it as a key, but that doesn't mean the function has anything to do with keys.

Agreed! Not that keen on objhash as a name though...

@jlstevens
Copy link
Contributor Author

Implemented suggestions and added unit tests. Ready for final review/merge once tests pass.

(supported).
of object as possible including nested objects and objects that are
not normally hashable. The purpose of this class is to generate
unique strings that once hashed are suitable for memoization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, it's not the hashed strings that are memoized, it's the actual results (and only implicitly even those, in this case). The hashed strings are useful for memoization, but they themselves are not memoized.

So maybe "The purpose of this class is to generate unique strings that once hashed are suitable for use in memoization and other cases where deep equality must be tested without storing the entire object".


By default JSONEncoder supports booleans, numbers, strings, lists,
tuples and dictionaries. In order to support other types such as
sets, datetime objects and mutable objects such as pandas Dataframes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and custom mutable

@jbednar
Copy link
Member

jbednar commented Jan 18, 2017

Not sure why it's failing, but it looks good to me apart from the minor docstring fixes suggested.

@jlstevens
Copy link
Contributor Author

jlstevens commented Jan 18, 2017

Python 3 has an additional limitation - the sort_keys option in json.dumps fails if it encounters dictionaries with heterogenous keys. This is a very clear bug as the final JSON keys are always strings which can be sorted but unfortunately this issue is unlikely to be fixed.

If we are happy to accept these limitations, the PR is ready to merge now the test have passed. Even with its caveats/holes, this approach greatly expands what can be memoized (before this PR, even a simple list wouldn't allow memoization).

@jbednar
Copy link
Member

jbednar commented Jan 18, 2017

Looks good to me.

@philippjfr
Copy link
Member

Annoying this turned out to be such a pain. We'll just have to see if we run into any issues in practice. Currently the types of objects that will appear in a stream are fairly limited so I don't foresee any immediate issues. Will merge.

@philippjfr philippjfr merged commit facf323 into master Jan 18, 2017
@philippjfr philippjfr deleted the nested_key_hashes branch January 27, 2017 02:53
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants