Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

Closed
joernhees opened this issue Mar 29, 2017 · 2 comments
Closed
Labels
bug Something isn't working fix-in-progress
Milestone

Comments

@joernhees
Copy link
Member

joernhees commented Mar 29, 2017

During the evaluation of my graph pattern learner i'm currently trying to generate all possible (different) SPARQL BGPs of a given length (5 at the moment). With up to 11 variables, enumerating all of those graphs might be stretching it a bit, but i'm nearly there. However, to do this, i need the canonical representations of SPARQL BGPs. As discussed before (#483), i'm reducing SPARQL BGPs (and especially their variables) to RDF graphs with BNodes (see here if interested), then run RGDA1 on it, and map the canonical BNode labels to the SPARQL Variables.

Similarly to #494, I noticed that sometimes during this process i still lose nodes. Minimal test-case below (PR with test will follow):

g = rdflib.Graph()
g += [
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v2')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v1')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v5')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v4')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:source')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v1')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v3')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'))]

cg = rdflib.compare.to_canonical_graph(g)

for g we will get the following "stats":

print 'graph length: %d, nodes: %d' % (len(g), len(g.all_nodes()))
print 'triple_bnode degrees:'
for triple_bnode in g.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']):
    print len(list(g.triples([triple_bnode, None, None])))
print 'all node out-degrees:'
print sorted([len(list(g.triples([node, None, None]))) for node in g.all_nodes()])
print 'all node in-degrees:'
print sorted([len(list(g.triples([None, None, node]))) for node in g.all_nodes()])

output:

graph length: 20, nodes: 14
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 3, 5, 5]

for cg we'll get the following:

print 'graph length: %d, nodes: %d' % (len(cg), len(cg.all_nodes()))
print 'triple_bnode degrees:'
for triple_bnode in cg.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']):
    print len(list(cg.triples([triple_bnode, None, None])))
print 'all node out-degrees:'
print sorted([len(list(cg.triples([node, None, None]))) for node in cg.all_nodes()])
print 'all node in-degrees:'
print sorted([len(list(cg.triples([None, None, node]))) for node in cg.all_nodes()])

output:

graph length: 20, nodes: 13
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 5, 5]

@jimmccusker could you maybe have another look?

@joernhees joernhees added the bug Something isn't working label Mar 29, 2017
@joernhees joernhees added this to the rdflib 5.0.0 milestone Mar 29, 2017
joernhees added a commit to joernhees/rdflib that referenced this issue Mar 29, 2017
@joernhees joernhees changed the title RDGA1 graph canonicalization sometimes still collapses distinct BNodes RGDA1 graph canonicalization sometimes still collapses distinct BNodes Mar 29, 2017
@jpmccu
Copy link
Contributor

jpmccu commented Apr 9, 2017

You managed to hit a hash collision, where two branches of node refinements result in the same hash. I've changed things two ways: 1) node refinements always continue to be applied even when the node color groups are already singular, and 2) if there are any duplicate node colors at the end of refinement they are combined so they can be individuated later on.

joernhees added a commit to joernhees/rdflib that referenced this issue Apr 9, 2017
…ation

* upstream/pull/730:
  Removed extraneous debug statements.
  Fixed RDFLib#725 by fully refining all bnode colors and combining color groups if there is a hash collision.
joernhees added a commit that referenced this issue Apr 9, 2017
@joernhees
Copy link
Member Author

awesome, big thanks to @jimmccusker :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix-in-progress
Projects
None yet
Development

No branches or pull requests

2 participants