RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

joernhees · 2017-03-29T22:46:36Z

During the evaluation of my graph pattern learner i'm currently trying to generate all possible (different) SPARQL BGPs of a given length (5 at the moment). With up to 11 variables, enumerating all of those graphs might be stretching it a bit, but i'm nearly there. However, to do this, i need the canonical representations of SPARQL BGPs. As discussed before (#483), i'm reducing SPARQL BGPs (and especially their variables) to RDF graphs with BNodes (see here if interested), then run RGDA1 on it, and map the canonical BNode labels to the SPARQL Variables.

Similarly to #494, I noticed that sometimes during this process i still lose nodes. Minimal test-case below (PR with test will follow):

g = rdflib.Graph()
g += [
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v2')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v1')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v5')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v4')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:source')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v0')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'),
  rdflib.term.BNode('v1')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'),
  rdflib.term.BNode('v3')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'),
  rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')),
 (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
  rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'))]

cg = rdflib.compare.to_canonical_graph(g)

for g we will get the following "stats":

print 'graph length: %d, nodes: %d' % (len(g), len(g.all_nodes()))
print 'triple_bnode degrees:'
for triple_bnode in g.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']):
    print len(list(g.triples([triple_bnode, None, None])))
print 'all node out-degrees:'
print sorted([len(list(g.triples([node, None, None]))) for node in g.all_nodes()])
print 'all node in-degrees:'
print sorted([len(list(g.triples([None, None, node]))) for node in g.all_nodes()])

output:

graph length: 20, nodes: 14
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 3, 5, 5]

for cg we'll get the following:

print 'graph length: %d, nodes: %d' % (len(cg), len(cg.all_nodes()))
print 'triple_bnode degrees:'
for triple_bnode in cg.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']):
    print len(list(cg.triples([triple_bnode, None, None])))
print 'all node out-degrees:'
print sorted([len(list(cg.triples([node, None, None]))) for node in cg.all_nodes()])
print 'all node in-degrees:'
print sorted([len(list(cg.triples([None, None, node]))) for node in cg.all_nodes()])

output:

graph length: 20, nodes: 13
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 5, 5]

@jimmccusker could you maybe have another look?

The text was updated successfully, but these errors were encountered:

jpmccu · 2017-04-09T17:43:10Z

You managed to hit a hash collision, where two branches of node refinements result in the same hash. I've changed things two ways: 1) node refinements always continue to be applied even when the node color groups are already singular, and 2) if there are any duplicate node colors at the end of refinement they are combined so they can be individuated later on.

…ation * upstream/pull/730: Removed extraneous debug statements. Fixed RDFLib#725 by fully refining all bnode colors and combining color groups if there is a hash collision.

fixes #725, huge thanks to @jimmccusker

joernhees · 2017-04-09T19:04:22Z

awesome, big thanks to @jimmccusker :D

joernhees added the bug Something isn't working label Mar 29, 2017

joernhees added this to the rdflib 5.0.0 milestone Mar 29, 2017

joernhees added a commit to joernhees/rdflib that referenced this issue Mar 29, 2017

test for issue RDFLib#725: canonicalization collapses BNodes

6ab6174

joernhees mentioned this issue Mar 29, 2017

test for issue #725: canonicalization collapses BNodes #726

Merged

joernhees changed the title ~~RDGA1 graph canonicalization sometimes still collapses distinct BNodes~~ RGDA1 graph canonicalization sometimes still collapses distinct BNodes Mar 29, 2017

joernhees added the fix-in-progress label Apr 9, 2017

joernhees closed this as completed in 7283c6b Apr 9, 2017

joernhees added a commit that referenced this issue Apr 9, 2017

Merge pull request #726 from joernhees/fix_canonicalization

ee2c08b

fixes #725, huge thanks to @jimmccusker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

joernhees commented Mar 29, 2017 •

edited

Loading

jpmccu commented Apr 9, 2017

joernhees commented Apr 9, 2017

RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

RGDA1 graph canonicalization sometimes still collapses distinct BNodes #725

Comments

joernhees commented Mar 29, 2017 • edited Loading

jpmccu commented Apr 9, 2017

joernhees commented Apr 9, 2017

joernhees commented Mar 29, 2017 •

edited

Loading