Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL Optimisations #689

Closed
gromgull opened this issue Jan 19, 2017 · 2 comments
Closed

SPARQL Optimisations #689

gromgull opened this issue Jan 19, 2017 · 2 comments
Assignees

Comments

@gromgull
Copy link
Member

gromgull commented Jan 19, 2017

This is mainly a place for me to gather some notes.

SPARQL Lazy joins

In SPARQL lots of things ending up being a JOIN between two parts, the SPARQL semantics then tell you to evaluate each part independently, and loop over both sets and check if each combination is valid like this: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evalutils.py#L27-L31

In cases like:

SELECT ?p WHERE { 
  ?p foaf:knows ?other ; 
  ?other foaf:knows ex:Bob . 
} VALUES ( ?p ) { ex:Bill } 

we will have to find everyone who knows someone who knows Bob, and then only at the join stage find out that we actually only care about Bill.

My optimisation for this was to in many cases do a "lazy join" (my terminology), where instead of doing the two parts independently, we do one first, then for each solution set, pass the bindings up the other part and check: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L87-L97

If this goes well, if we already know that ?p above has to be Bill, and we save tons of work. Unfortunately, SPARQL scoping rules means that sometimes this is not allowed, and we leak bindings from "further up the tree" that should not be visible. The solution is to hard-code for those cases and "forget" the bindings at the right moments, like here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/evaluate.py#L155

Finding all the cases where this is required has lead to lots of bugs, like: #580, #615, #294 and probably more.

This was my ad-hoc fix. I wonder what other SPARQL engines do here, I suspect there is a more principled, conceptually cleaner way to solve this. Maybe re-arranging the query tree, making a query plan that is still semantically identical, but better.

If we keep the lazy joins, an easy improvement is to do a conscious choice of which part of the join to do first. Currently we do the one called part1, but that could be anything.

Split BGPs into disjoint parts and join them

Another thing to think about is that sometimes the query splits into two parts without any overlapping variables, slightly contrived:

SELECT ?p WHERE { 
  ?p foaf:knows ?other1 ; 
  ?other foaf:knows ex:Bob . 
  ?p2 foaf:knows ?other2 ; 
  ?other foaf:knows ex:Bill . 
} 

Currently we will do the triples in some order, and probably recurse much more than needed. We can split the body into two BGP clauses and then join them.

@joernhees
Copy link
Member

all sounds good...

is this a 4.2.2 blocker?

@gromgull
Copy link
Member Author

Hell no, it's not even a 6.0-blocker :)

@ghost ghost locked and limited conversation to collaborators Dec 25, 2021
@ghost ghost converted this issue into discussion #1537 Dec 25, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants