-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Improve skipping logic for after values in sort query #13313
Comments
Is my understanding correct that this would possibly help with deep pagination but hurt with the first few pages since it may force Lucene to evaluate a BKD range query that matches most docs? |
FWIW there is a similar problem for sorting by score, as document scores may be clustered in the doc ID space, which increases the time for the minimum competitite score to converge to the actual score of the top k-th hit. There are a few approaches that have been documented to evaluate a lower bound of the top-k-th score, see e.g. https://culpepper.io/publications/pm+19-sigir.pdf. I wonder if there's anything to learn from this for sorting by field. |
@jpountz yes, I am thinking to relax the 88% (current) non-competitive document threshold would be great solution instead blindly invoking skipping. Like for This will be really one time BKD intersect to skip value based on This will relly help not eve deep pagination, but mid-start, mid level pagination as well. basically any pagination which are not too early in the iterator. The performance gain I see in |
I was thinking if we can make these threshold |
I'd rather not make this configurable, but if you can suggest a new threshold that works better for this adversarial case and doesn't hurt sorted queries such as tasks from nightly benchmarks, I'd probably be good with it. |
Description
Background
Lucene sort queries are using skipping logic for faster execution and skip non-competitive documents by updating its competitive iterator whenever it updates its bottom value in priority queue. Reference
This works for fields indexed with BKD.
In case of
after
value,after
value is considerd as topValue we try to execute skipping logic with bothtopValue
andafterValue
. The skipping logic has a constraint, and with that, the estimated number of competitive documents must reduce to1/8
to intersect BKD skipping logic. That can be problamatic in like described below.Problem
With #12333 change, we started using
topValue
to skip documents in casetopValue
is able to skip7/8
number of documents. But what if we only able to skip6/8
number of documents withtopvalue
and those all non-competitive documents6/8
are ahead ofafter
document in docsIdIterator ?This problem specifically comes in
timeseries
workload.i.e. we have time series segment where documents ids and document values are in nearly sorted order and lets assume we have
100
documents and its time field values are from 1,2,3,....100.Now if I trigger sort query
sort
on this field withafter
values as87
, we wont able to skip first 86 documents and they will end up being in comparison here in PagingFieldCollector.Assume this happening for millions of documents in case of time series workload.
Proposed solution
Lets invoke skipping logic in case
topValue
is known butbottomvalue
is unknown ir-respective of number of docuements we are able to skip. That will be one time invocation of skipping logic in caseafter
value is specific and we will skip all documents which are non-competitive w.r.t after value. And from next iteration, we will know aboutbottomValue
once priority queue is full and can have constraint of 7/8 documents skipping.....The text was updated successfully, but these errors were encountered: