-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HTML search] optimization: don't loop over all document terms and title terms during partial-matching. #12045
Comments
Resolved by #11958. |
Re-opening because I'm puzzling over this potential performance problem. Some thoughts:
Edit: and a second item questioning how often the problem occurs in practice. |
Out of curiosity, I ran a search through Chrome's devtools profiler and here were the results: The javascript portion of the search is around 16ms on my Dell i7. No doubt it would be a little slower on older machines, but I doubt by much. A more significant optimization would likely be to fetch results for the search previews in parallel. That seems to be the main bottleneck to showing a full result. |
Thanks @wlach - I'll try to do some performance profiling soon too. Could you share the query you used when collecting that profile trace? (it seems that the Python 3[.12?] docs were the dataset?) Some thoughts:
I also had a weird idea about substring matching: if we sorted the characters within the terms (so |
From the URL in the screenshot; https://docs.python.org/3.13/search.html?q=asyncio |
Thank you @AA-Turner! So for that example, the code path I'm wary of would not currently be followed, because an exact-match does exist: >>> import json
>>> index = json.loads(open('searchindex.json', 'r').read()) # data from searchindex.js:setIndex
>>> len(index['terms'])
40195
>>> 'asyncio' in index['terms']
True
>>> That's good news. What I'd like to examine next is the cost of the substring matching (complete |
For me on an RPi400 (Cortex-A72) the difference within Query for
|
For the Python documentation sources, I anticipate the size increase in the Sphinx search index to support the proposed efficient substring matching approach above would be ~25%, based on: >>> index = json.loads(open('searchindex.json', 'r').read()) # data from search
index.js:setIndex
>>> len(index['terms'])
40195
>>> len(set("".join(sorted(term)) for term in index['terms']))
30881 |
Agh, ignore me. This idea for an index structure doesn't help for the inner-string substring matching problem at all -- because the substring isn't guaranteed to contain the lowest-ordinal character -- the needle that we'd use to begin and navigate the binary search stage. Back to the drawing board for now... ngrams do seem to be a valid alternative here. |
Ok, here's an alternative strategy.
The algorithm to perform substring matching itself would be to iterate (in JavaScript) over the trigrams of the query term, and to successively set-intersect on the list of terms found in the ngram-term lookaside trie, until either the intersection set is empty (no results), or we reach the end of the query term. I think that evaluating the technique on the terms index before deciding whether it's worthwhile to also apply to the titles index (generally much smaller, so less likely to exhibit noticeable performance improvements) could be worthwhile. Edit: clarification for querying algorithm. |
Would an ideal solution be a suffix tree or even suffix automaton? Because:
|
Possibly? 😄 I'm not familiar with either of those, but I'll read about them. There are going to be some tradeoffs with any search approach, I think. Always good to learn about additional options though.
That seems true, yep. The ngram-index implementation in #12596 isn't limited to prefix-matches, incidentally -- there are ngrams in there that begin in the middle of strings and could therefore be used by autosuggest for those kind of queries.
We definitely don't want to be super-restrictive on the query words/terms. However, using 3-grams doesn't rule out query-matching for words of length 4, for example. If the word |
I'd like to revisit the JavaScript search code again in future, but I think that this bug seems to have been overly-specific (there haven't been reports of performance issues with this particular area of the code). Closing this as wontfix/not-planned. |
Is your feature request related to a problem? Please describe.
There seems to be a potentially-large inefficiency in the Sphinx JavaScript search code: for any query word greater than length two (likely to be >90% of them, I'd guess!), we iterate through all document terms and all title terms in the client's search index to check for partial matches.
However, if an exact-match was already found on the query word, then we won't add any of those candidates, even when they do match. That means that we spend JavaScript compute resources iterating through items that are unused.Since Sphinx 7.3.0, this is no longer true thanks to #11958 -- if an exact-match on a stemmed term is found, we skip checks for partial matches for that query term.The relevant code is found here:
sphinx/sphinx/themes/basic/static/searchtools.js
Lines 469 to 476 in cf7d275
I don't have any stats on the performance impact of this, but it feels like it may be significant, especially for large documentation projects.Initial results appear to indicate a 9ms vs 48ms difference on a local test machine.Describe the solution you'd like
If we've already found exact-matches for a document term, then do not iterate over all document terms to check for partial matches.If we've already found exact-matches for a document title term, then do not iterate over all document title terms to check for partial matches.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
The text was updated successfully, but these errors were encountered: