You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If this goes well, if we already know that ?p above has to be Bill, and we save tons of work. Unfortunately, SPARQL scoping rules means that sometimes this is not allowed, and we leak bindings from "further up the tree" that should not be visible. The solution is to hard-code for those cases and "forget" the bindings at the right moments, like here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L155
Finding all the cases where this is required has lead to lots of bugs, like: #580, #615, #294 and probably more.
This was my ad-hoc fix. I wonder what other SPARQL engines do here, I suspect there is a more principled, conceptually cleaner way to solve this. Maybe re-arranging the query tree, making a query plan that is still semantically identical, but better.
If we keep the lazy joins, an easy improvement is to do a conscious choice of which part of the join to do first. Currently we do the one called part1, but that could be anything.
Split BGPs into disjoint parts and join them
Another thing to think about is that sometimes the query splits into two parts without any overlapping variables, slightly contrived:
Currently we will do the triples in some order, and probably recurse much more than needed. We can split the body into two BGP clauses and then join them.
The text was updated successfully, but these errors were encountered:
This is mainly a place for me to gather some notes.
SPARQL Lazy joins
In SPARQL lots of things ending up being a JOIN between two parts, the SPARQL semantics then tell you to evaluate each part independently, and loop over both sets and check if each combination is valid like this: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evalutils.py#L27-L31
In cases like:
we will have to find everyone who knows someone who knows Bob, and then only at the join stage find out that we actually only care about Bill.
My optimisation for this was to in many cases do a "lazy join" (my terminology), where instead of doing the two parts independently, we do one first, then for each solution set, pass the bindings up the other part and check: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L87-L97
If this goes well, if we already know that
?p
above has to be Bill, and we save tons of work. Unfortunately, SPARQL scoping rules means that sometimes this is not allowed, and we leak bindings from "further up the tree" that should not be visible. The solution is to hard-code for those cases and "forget" the bindings at the right moments, like here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L155Finding all the cases where this is required has lead to lots of bugs, like: #580, #615, #294 and probably more.
This was my ad-hoc fix. I wonder what other SPARQL engines do here, I suspect there is a more principled, conceptually cleaner way to solve this. Maybe re-arranging the query tree, making a query plan that is still semantically identical, but better.
If we keep the lazy joins, an easy improvement is to do a conscious choice of which part of the join to do first. Currently we do the one called
part1
, but that could be anything.Split BGPs into disjoint parts and join them
Another thing to think about is that sometimes the query splits into two parts without any overlapping variables, slightly contrived:
Currently we will do the triples in some order, and probably recurse much more than needed. We can split the body into two BGP clauses and then join them.
The text was updated successfully, but these errors were encountered: