Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve hash efficiency by directly using str/unicode hash #746

Merged
merged 1 commit into from
May 26, 2017

Conversation

joernhees
Copy link
Member

@joernhees joernhees commented May 25, 2017

During investigation of a performance issue in my graph pattern learner, i noticed (sadly not for the first time) that rdflib spends massive amounts of time when hashing Identifiers. For example below you can see how about 34 % of the execution time of a minimal example are spent in Identifier.__hash__:
screen shot 2017-05-25 at 22 43 27

This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals.

After this change (nothing else changed) for my code (using various sets, dicts, ...) this means a speedup of ~4:
screen shot 2017-05-25 at 22 44 59

Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo').

However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones.

Test code for ipython:

from rdflib import URIRef, Literal

def test():
    # worst case collisions
    s = set()
    for i in range(100000):
        _s = u'foo:%d' % i
        s.add(_s)
        s.add(URIRef(_s))
        s.add(Literal(_s))
    assert len(s) == 300000

%timeit test()

"more natural ones:"
%timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000))
%timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000))

Results:

Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop

New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop

This change makes hashing about 25 % more efficient for URIRefs, 15 %
for Literals.

Before, hashing performed several string concatenations to get the fqn,
then hash that and XOR with str/unicode hash. It did this to avoid
potential hash collisions between 'foo', URIRef('foo'), Literal('foo').

However, those scenarios can be considered corner cases. Testing the new
hashing version in worst-case collision scenarios, it actually performs
very close to the old behavior, but clearly outperforms it in more
normal ones.

Test code for ipython:
---
from rdflib import URIRef, Literal

def test():
    # worst case collisions
    s = set()
    for i in range(100000):
        _s = u'foo:%d' % i
        s.add(_s)
        s.add(URIRef(_s))
        s.add(Literal(_s))
    assert len(s) == 300000

%timeit test()

"more natural ones:"
%timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000))
%timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000))
---

Results:

Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop

New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop
@joernhees joernhees added enhancement New feature or request performance labels May 25, 2017
@joernhees joernhees added this to the rdflib 5.0.0 milestone May 25, 2017
@joernhees joernhees requested a review from gromgull May 25, 2017 21:24
# clashes of 'foo', URIRef('foo') and Literal('foo') are typically so rare
# that they don't justify additional overhead. Notice that even in case of
# clash __eq__ is still the fallback and very quick in those cases.
__hash__ = text_type.__hash__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this give you something to attach the comment to, otherwise this line does nothing? the class already inherits from text_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in py2 this is irrelevant, in py3 (from https://docs.python.org/3/reference/datamodel.html#object.__hash__):

A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.Hashable).

If a class that overrides __eq__() needs to retain the implementation of __hash__() from a parent class, the interpreter must be told this explicitly by setting __hash__ = <ParentClass>.__hash__.

We override __eq__ in two places: Identifier and Literal, both also have an explicit __hash__, as they would in py3 otherwise fail to be hashable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see! you learn something new every day!

@gromgull
Copy link
Member

Good catch!

@gromgull gromgull merged commit 7c65b34 into RDFLib:master May 26, 2017
@joernhees joernhees deleted the hash_efficiency branch May 26, 2017 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants