improve hash efficiency by directly using str/unicode hash #746

joernhees · 2017-05-25T21:00:25Z

During investigation of a performance issue in my graph pattern learner, i noticed (sadly not for the first time) that rdflib spends massive amounts of time when hashing Identifiers. For example below you can see how about 34 % of the execution time of a minimal example are spent in Identifier.__hash__:

This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals.

After this change (nothing else changed) for my code (using various sets, dicts, ...) this means a speedup of ~4:

Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo').

However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones.

Test code for ipython:

from rdflib import URIRef, Literal

def test():
    # worst case collisions
    s = set()
    for i in range(100000):
        _s = u'foo:%d' % i
        s.add(_s)
        s.add(URIRef(_s))
        s.add(Literal(_s))
    assert len(s) == 300000

%timeit test()

"more natural ones:"
%timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000))
%timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000))

Results:

Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop

New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop

This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals. Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo'). However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones. Test code for ipython: --- from rdflib import URIRef, Literal def test(): # worst case collisions s = set() for i in range(100000): _s = u'foo:%d' % i s.add(_s) s.add(URIRef(_s)) s.add(Literal(_s)) assert len(s) == 300000 %timeit test() "more natural ones:" %timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000)) %timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000)) --- Results: Old: 1 loop, best of 3: 940 ms per loop 1 loop, best of 3: 334 ms per loop 1 loop, best of 3: 610 ms per loop New: 1 loop, best of 3: 945 ms per loop 1 loop, best of 3: 250 ms per loop 1 loop, best of 3: 515 ms per loop

gromgull · 2017-05-26T07:36:19Z

rdflib/term.py

+    # clashes of 'foo', URIRef('foo') and Literal('foo') are typically so rare
+    # that they don't justify additional overhead. Notice that even in case of
+    # clash __eq__ is still the fallback and very quick in those cases.
+    __hash__ = text_type.__hash__


this give you something to attach the comment to, otherwise this line does nothing? the class already inherits from text_type?

in py2 this is irrelevant, in py3 (from https://docs.python.org/3/reference/datamodel.html#object.__hash__):

A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.Hashable).

If a class that overrides __eq__() needs to retain the implementation of __hash__() from a parent class, the interpreter must be told this explicitly by setting __hash__ = <ParentClass>.__hash__.

We override __eq__ in two places: Identifier and Literal, both also have an explicit __hash__, as they would in py3 otherwise fail to be hashable.

i see! you learn something new every day!

gromgull · 2017-05-26T07:36:55Z

Good catch!

joernhees added enhancement New feature or request performance labels May 25, 2017

joernhees added this to the rdflib 5.0.0 milestone May 25, 2017

joernhees requested a review from gromgull May 25, 2017 21:24

gromgull reviewed May 26, 2017

View reviewed changes

gromgull merged commit 7c65b34 into RDFLib:master May 26, 2017

joernhees deleted the hash_efficiency branch May 26, 2017 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve hash efficiency by directly using str/unicode hash #746

improve hash efficiency by directly using str/unicode hash #746

joernhees commented May 25, 2017 •

edited

Loading

gromgull May 26, 2017

joernhees May 26, 2017

gromgull May 26, 2017

gromgull commented May 26, 2017

improve hash efficiency by directly using str/unicode hash #746

improve hash efficiency by directly using str/unicode hash #746

Conversation

joernhees commented May 25, 2017 • edited Loading

gromgull May 26, 2017

Choose a reason for hiding this comment

joernhees May 26, 2017

Choose a reason for hiding this comment

gromgull May 26, 2017

Choose a reason for hiding this comment

gromgull commented May 26, 2017

joernhees commented May 25, 2017 •

edited

Loading