-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[search] issues with the new HTML search algorithm #12391
Comments
Ironically I think fixing the search algorithm so it worked as originally intended made this worse. In previous versions, there were various bugs that made title matches rank lower than they should have. In this case, That's said, there's a few things we could investigate that might improve things:
|
If I may add my two cents here, I think a key part is actually the webpage name: https://docs.python.org/3.12/library/asyncio.html, https://docs.python.org/3.12/library/itertools.html |
There are other possibilities, namely rank the match higher:
EDIT: We had the same idea with the name of the file ! |
Definitely lots of things we could experiment with! I think now that we have a foundation with better testing (esp. after #12102 merges) it will be easier to iterate on things and make sure that we maintain desired behaviour. |
I'll be out for the next days due to Eurocrypt, but I'll be back next week and could perhaps make that happen. In the meantime, @hugovk I suggest you generate the docs with a patched version of searchtools.js so that Python is not broken for too much time (or maybe people can live with that, though it's a bit annoying...). |
Thanks, I think we can wait, let's see. Enjoy Eurocrypt! |
Related to sphinx-doc#12391. Intended as a proof of concept, not a full solution.
Spent a bit of time looking at this today. Some thoughts:
I attempted a quick fix that adjusted some things here:
It seems to give somewhat better results (see PR) but to be honest I'm not super happy with it, it feels very much like a set of hacks (on top of the existing hacks) that might just lead to other problems. Longer term, I think we might want to focus on improving the scoring algorithm for generic term search. I may be misunderstanding what's going on (wouldn't be the first time 😄), but based on my initial spelunking of the https://github.com/sphinx-doc/sphinx/blob/48cbb43e28efe82b198137042811e0ade3599ae7/sphinx/search/__init__.py it doesn't look like we're using tf-idf -- it seems like we just check whether the term is in the document or not. I suspect we could get much better results by using that by default, then implementing title search on top of that (to further boost scores), instead of making it its own thing. |
So, I'm back and here's what I suggest:
I had some lectures about learning theory, tensors and similarities but those were mostly theoretical and not really with implementations. So on that one I cannot really help although I can give ideas. Nonetheless, I can help with the maths behind if needed (and graph-based searches as well). PS: If anyone has experience in search-based engine (e.g., someone that is well-versed in Google techniques :D), they are welcome to help us! |
Do you have rights to edit the pull request? I believe maintainers usually do with the committers' consent. But yes, that makes sense to me.
Yes, I kind of think so though it feels like a large project -- at least 40 to 80 hours, maybe more (especially if you consider product research and testing). One thing to note is that any site hosted by readthedocs (which many sphinx projects use, though not cpython) uses a different server-side implementation of search which doesn't appear to suffer from these problems. Part of me wonders if we should implement some variation of #12393 as an interim solution to at least fix these pathologically bad cases where a subtitle gets ranked higher than a title.
Well, there's definitely prior art out there to look at-- e.g. I think mkdocs and mkdocs-material have pretty good client side search. I believe they use a combination of lunr.js (using https://github.com/yeraydiazdiaz/lunr.py to generate the index) with some other magic. |
I think Jay allowed it.
I'm defending my thesis in one month, so I cannot really afford to spend 40-80h on that task but at least, it's good that we agree that we need to do something. If you however know of anything that is opensource and that can be used as a plug-in replacement or from where we could inspire ourselves, then it'd be definitely a starting point. The lunar.py framework seems suitable to our needs since we do have something that could serve as primary/secondary keys. For instance, with the 'asyncio' use case, we would have multiple matches but only one of them would be a "module"-like match and in this case, it is pretty easy to select which would be the expected one. Now, before doing anything, perhaps we could decide the priority of the fields, like, when you search something, do you first want a module, a class, a property, or what? and in which order? |
Fair enough, I've also been busy and my heart sinks a bit thinking about the size of this project. That said, I feel like this would benefit from a broader discussion (we've gone far beyond the scope of this bug) so started one here with some ideas about how one might move forward: https://github.com/orgs/sphinx-doc/discussions/12419
I think my initial hypothesis is that we might not want to distinguish between these things explicitly and that we'd be better off with a good search implementation that just worked against document titles/content in a more general way. I think one of the nice things about doing this up as an extension (as I suggest in the discussion) is that we can more easily experiment with various approaches. |
Re: @wlach's bug analysis:
I agree that this seems to be the cause of the problem - the JavaScript variable named So if I understand correctly, we're producing a Re: @picnixz's suggested courses of action:
Yep - I would like to treat this bugreport as something we can write a test case for and then have some confidence that it is resolved. I think a search engine / information retrieval expert would say that it is impractical to rely on unit tests to guarantee the continued quality of a search engine where indexed content evolves over time -- and I more-or-less agree with that, but I also think that a few well-chosen test cases can be useful to maintain (and document) desired behaviours of the search engine.
I would really like if we could document what we have already, and I think that would help with this. Re: interim solutions (quoting @wlach)
I think I'm more-or-less-OK with that idea; I don't love the additional code path complexity but I think the idea -- creating a distinction between main-title and section-title match scoring -- is good. It might be nice to clarify those values as ...and it's also worth eeping @Saba-Sabato's note about filenames in mind. In particular: is Re: #12102 edit permissionsIn general, yep feel free to push changes to my branches - I've added a few replies on there and will get around to adding the relevant commits soon. |
I've developed a minimal-ish repro case that I believe exhibits the same problem. It contains:
The search results when querying for the term
Those don't seem ideal to me; I'd suggest we'd want the results to appear instead as: object-name match, main-title match, subsection match. That suggested ordering is based on a sense I have that many of the projects built with Sphinx are technical/reference projects, and so a match on an object name in the project's domain is usually highly relevant to the user's query. And also a sense that if a title is the main title of a document, then that document should score more highly than a subsection heading with the same level of matching. There are likely multiple ways to implement scoring that would achieve the suggested re-ordering; some of those might be good, some of them might be less good. Trying to find an implementation that is straightforward to read and understand (both code and behaviour) and also produces good results would be the next step -- and then we could (re)evaluate that after a subsequent release, based on feedback. |
This is great, thank you @jayaddison! As I said above, I think there's only so far we can go with the current approach, but as an incremental step adding a real test case and then adjusting the scoring weights seems like the way to go. Happy to help with review / adjustments as you go along. As an aside, it's hilarious that we get 7 results for a corpus of three documents. 😭 Really looking forward to #12047 landing.
Yup that makes sense to me too. |
Well spotted :) Yep, I will confirm that that resolves the result duplication in this case too. |
Done - merging #12047 (at c7245d2) into #12441 (at 5eaea64) does include de-duplication as expected. Six (6) results are returned. They mostly appear distinct (unique); the third result is arguably dubious though -- it looks like a term match, but the term also happens to be a subsection title. Even so, I think that is a separate issue to both this and #11961. |
I don't really have a good title for this issue, but the CPython issue speaks for itself. It appears we are still misranking titles in some cases (the 3.11 doc is generated by 7.2.6 but the 3.12 is generated by 7.3.7).
cc @wlach @jayaddison
From python/cpython#119423 (comment)
The text was updated successfully, but these errors were encountered: