-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve hash efficiency by directly using str/unicode hash #746
Conversation
This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals. Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo'). However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones. Test code for ipython: --- from rdflib import URIRef, Literal def test(): # worst case collisions s = set() for i in range(100000): _s = u'foo:%d' % i s.add(_s) s.add(URIRef(_s)) s.add(Literal(_s)) assert len(s) == 300000 %timeit test() "more natural ones:" %timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000)) %timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000)) --- Results: Old: 1 loop, best of 3: 940 ms per loop 1 loop, best of 3: 334 ms per loop 1 loop, best of 3: 610 ms per loop New: 1 loop, best of 3: 945 ms per loop 1 loop, best of 3: 250 ms per loop 1 loop, best of 3: 515 ms per loop
# clashes of 'foo', URIRef('foo') and Literal('foo') are typically so rare | ||
# that they don't justify additional overhead. Notice that even in case of | ||
# clash __eq__ is still the fallback and very quick in those cases. | ||
__hash__ = text_type.__hash__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this give you something to attach the comment to, otherwise this line does nothing? the class already inherits from text_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in py2 this is irrelevant, in py3 (from https://docs.python.org/3/reference/datamodel.html#object.__hash__):
A class that overrides
__eq__()
and does not define__hash__()
will have its__hash__()
implicitly set to None. When the__hash__()
method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.Hashable).If a class that overrides
__eq__()
needs to retain the implementation of__hash__()
from a parent class, the interpreter must be told this explicitly by setting__hash__ = <ParentClass>.__hash__
.
We override __eq__
in two places: Identifier
and Literal
, both also have an explicit __hash__
, as they would in py3 otherwise fail to be hashable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see! you learn something new every day!
Good catch! |
During investigation of a performance issue in my graph pattern learner, i noticed (sadly not for the first time) that rdflib spends massive amounts of time when hashing Identifiers. For example below you can see how about 34 % of the execution time of a minimal example are spent in
Identifier.__hash__
:This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals.
After this change (nothing else changed) for my code (using various sets, dicts, ...) this means a speedup of ~4:
Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo').
However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones.
Test code for ipython:
Results:
Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop
New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop