SPARQL Optimisations #689

gromgull · 2017-01-19T08:36:25Z

This is mainly a place for me to gather some notes.

SPARQL Lazy joins

In SPARQL lots of things ending up being a JOIN between two parts, the SPARQL semantics then tell you to evaluate each part independently, and loop over both sets and check if each combination is valid like this: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evalutils.py#L27-L31

In cases like:

SELECT ?p WHERE { 
  ?p foaf:knows ?other ; 
  ?other foaf:knows ex:Bob . 
} VALUES ( ?p ) { ex:Bill }

we will have to find everyone who knows someone who knows Bob, and then only at the join stage find out that we actually only care about Bill.

My optimisation for this was to in many cases do a "lazy join" (my terminology), where instead of doing the two parts independently, we do one first, then for each solution set, pass the bindings up the other part and check: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L87-L97

If this goes well, if we already know that ?p above has to be Bill, and we save tons of work. Unfortunately, SPARQL scoping rules means that sometimes this is not allowed, and we leak bindings from "further up the tree" that should not be visible. The solution is to hard-code for those cases and "forget" the bindings at the right moments, like here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L155

Finding all the cases where this is required has lead to lots of bugs, like: #580, #615, #294 and probably more.

This was my ad-hoc fix. I wonder what other SPARQL engines do here, I suspect there is a more principled, conceptually cleaner way to solve this. Maybe re-arranging the query tree, making a query plan that is still semantically identical, but better.

If we keep the lazy joins, an easy improvement is to do a conscious choice of which part of the join to do first. Currently we do the one called part1, but that could be anything.

Split BGPs into disjoint parts and join them

Another thing to think about is that sometimes the query splits into two parts without any overlapping variables, slightly contrived:

SELECT ?p WHERE { 
  ?p foaf:knows ?other1 ; 
  ?other foaf:knows ex:Bob . 
  ?p2 foaf:knows ?other2 ; 
  ?other foaf:knows ex:Bill . 
}

Currently we will do the triples in some order, and probably recurse much more than needed. We can split the body into two BGP clauses and then join them.

The text was updated successfully, but these errors were encountered:

joernhees · 2017-01-19T14:13:11Z

all sounds good...

is this a 4.2.2 blocker?

gromgull · 2017-01-19T14:21:00Z

Hell no, it's not even a 6.0-blocker :)

gromgull added discussion SPARQL labels Jan 19, 2017

gromgull self-assigned this Jan 19, 2017

ghost locked and limited conversation to collaborators Dec 25, 2021

ghost converted this issue into discussion #1537 Dec 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

SPARQL Optimisations #689

SPARQL Optimisations #689

gromgull commented Jan 19, 2017 •

edited

Loading

joernhees commented Jan 19, 2017

gromgull commented Jan 19, 2017

This issue was moved to a discussion.

This issue was moved to a discussion.

SPARQL Optimisations #689

SPARQL Optimisations #689

Comments

gromgull commented Jan 19, 2017 • edited Loading

SPARQL Lazy joins

Split BGPs into disjoint parts and join them

joernhees commented Jan 19, 2017

gromgull commented Jan 19, 2017

This issue was moved to a discussion.

gromgull commented Jan 19, 2017 •

edited

Loading