-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Search: omit anchor reference from document titles in the search index. #12047
HTML Search: omit anchor reference from document titles in the search index. #12047
Conversation
Note: reStructuredText document titles are implicit; ref: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#document
…anchor values" This reverts commit d44dc11.
This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. 👍 from me. |
Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged |
That's correct. I think there are three things that I don't like about this pull request, in order of priority:
|
I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add. |
Is it possible to keep a string or is the null type needed? |
If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine. What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds |
Moved into #12099. |
I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that). |
(and maybe do the |
@@ -176,7 +179,10 @@ def test_IndexBuilder(): | |||
'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}, | |||
'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}, | |||
} | |||
assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}} | |||
assert index._title_mapping == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I see that the values are sets and thus can be unordered. Is it possible to actually have a more reliable guarantee on the title_mapping
actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check that I understand correctly: use an assertion that checks not only the contents of the title mapping, but also the ordering of the elements inside it?
(makes sense to me, order can be significant for index data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding there was correct: I'd prefer to handle that in a separate issue thread and PR, if that's OK. Or, if I misunderstood: let's make sure to resolve that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is more than one occurrence of that, yes we could discuss it in another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; I think it would be applicable for both _title_mapping
and _mapping
; because both of those currently use a pattern of <mapping>.setdefault(..., set()).add(...)
(implying that insertion-order may not match subsequent access iteration order):
sphinx/sphinx/search/__init__.py
Lines 447 to 464 in b70578f
for word in word_store.title_words: | |
# add stemmed and unstemmed as the stemmer must not remove words | |
# from search index. | |
stemmed_word = stem(word) | |
if _filter(stemmed_word): | |
self._title_mapping.setdefault(stemmed_word, set()).add(docname) | |
elif _filter(word): | |
self._title_mapping.setdefault(word, set()).add(docname) | |
for word in word_store.words: | |
# add stemmed and unstemmed as the stemmer must not remove words | |
# from search index. | |
stemmed_word = stem(word) | |
if not _filter(stemmed_word) and _filter(word): | |
stemmed_word = word | |
already_indexed = docname in self._title_mapping.get(stemmed_word, ()) | |
if _filter(stemmed_word) and not already_indexed: | |
self._mapping.setdefault(stemmed_word, set()).add(docname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: any runtime (in-memory) differences in the ordering of objects within the _title_mapping
and _mapping
set entries are later resolved into deterministic order by an explicit sort that occurs during freezing (serialization) of the search index (thanks to d24bd73).
I'm not aware of any other current bugs causing nondeterminism in our searchindex.js
output since v7.3.0 (95fb0e5 included in that release handled a similar situation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So _title_mapping
are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, pretty much. I'll be a bit verbose:
- In-memory (and therefore in the tests), each value in
_title_mapping
may store elements in arbitrary ordering. - The tests use Python
set
equality comparison, and that is order-agnostic (as it should be) -- so the tests aren't really checking the order, even though it might appear from the code that they would. - The output of a Sphinx production build to output a
searchindex.js
sorts these entries.
What you suggest about a comment makes sense...
...however.. perhaps I could attempt a separate PR to refactor the code so that sorting occurs in-place and in-memory, and to adjust the unit tests to assert on that. Those would provide stronger properties than a comment in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for that. It's better that at runtime, the objects are represented in sets since they are more efficient. It's only for serialization that you need to sort to ensure reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, agreed (storing the lists in-place for ordered iteration would probably add various types of overhead).
I've pushed a commit to add some commentary as a docstring on the relevant method and also in the unit tests.
@@ -391,7 +391,7 @@ def freeze(self) -> dict[str, Any]: | |||
objtypes = {v: k[0] + ':' + k[1] for (k, v) in self._objtypes.items()} | |||
objnames = self._objnames | |||
|
|||
alltitles: dict[str, list[tuple[int, str]]] = {} | |||
alltitles: dict[str, list[tuple[int, str | None]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the shape of this value, i.e., to what corresponds list[tuple[int, str | None]
(what is the integer, what is the str | None
and) and what is the key str
of the dict
, and why does it change from the all_titles
? (the list has pairs of int and strings and not pairs of strings anymore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm, this is not very clear at the moment, but I believe that the integer values here are what I'd call documentId
values - sequential IDs assigned to each of the source .rst
files at build-time. The second value in each tuple -- the str
or empty value -- is a target reference (aka anchor
when talking about HTML) -- it's an indicator about what destination on the page/document the user should be taken to were they to navigate using this entry.
Perhaps this is a situation where we should refactor and extract out meaningful type aliases (a bit like the recent Intersphinx refactorings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not use aliases, but the int
always seem to me like corresponding to the index in the list, so that you could possibly have multiple times the same element but with different indices. However, I'd be happy to know if this is really the reason why we duplicate that or if we could just live with a list of strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess an example to exlain that could be a title like Recent Changes
that could appear in multiple documents and might have different headings (or perhaps even be the document title in some cases).
alltitles: {"Recent Changes": [0, None], [1, "coolmodule-recentchanges"], [5, "anothermodule-recentchanges"], ...}
(a bit hypothetical, but it's to handle situations like that I believe. again, a good case for checking and documenting)
Co-authored-by: Bénédikt Tran <[email protected]>
I'm struggling a bit to see the problem: aside from deserialization, I can't imagine for what reason we'd be reading index entries from Python code? I skimmed through the code and don't see us doing anything where the different types would be a problem: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20self._titles&type=code
It looks like we explicitly reject any index created with a previous version of Sphinx in this case: sphinx/sphinx/search/__init__.py Line 302 in de4ac2f
|
I think the theory is that the search index -- like individual output files in many cases -- could be rebuilt incrementally rather than having to re-read the entire project content. Ref: b1271fa However, it's not clear to me whether those loaded entries are retained before the updated search index is written, because of the code linked below; is sphinx/sphinx/builders/html/__init__.py Line 1163 in de4ac2f
Ok, that's good to know - but that version number is a static value in the codebase -- it's not the Sphinx version number, nor a counter of the number of builds on the machine where Sphinx is being run: sphinx/sphinx/environment/__init__.py Lines 59 to 61 in de4ac2f
|
Ok, but I don't see anything in the codebase accessing that variable (
You could always increment that number if you wanted to be on the safe side. Not really a lot of downside outside of maybe a very minor build-time penalty in a few obscure cases. |
Ok - that matches my understanding too; the only time we write to the search-related attributes on
Given that we believe that the Python-side changes should be backwards-compatible here (no structural changes to the pickled |
NodeJS tests on this branch began failing from 9fef43b onwards; that isn't easily visible from the CI results in the GitHub web UI because we weren't running those on each commit until #12445 was merged (the latest merge-from-master includes that, explaining the visibility of the test failure). The change at 9fef43b regenerated the search index file for the repro test case. Next I'm going to try to figure out why the tests began failing after that, and what that means. |
Good news, I think.
This index-regeneration occurred directly after e947f6b - a merge from the mainline branch. That merge pulled in support for JavaScript testing of client-side HTML. And that happened to include a test case -- with a relevant TODO note -- indicating that some of the data in it is a workaround and should be removed when #11961 is fixed. This PR fixes #11961 - so it is in fact expected that we should remove that data, and it's good that the test is failing while it is still in place. Edit: remove duplication of quoted comment |
… no longer required with the fix from this branch in place.
@jayaddison Is there anything left to be done here or is this ready for a review by a committer? |
This is ready, yep; the |
I'll review it tomorrow as well! (now that I've successfully defended, I have a bit more time!) |
Well done completing the defence @picnixz :) Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my last interrogation was resolved. I'm also happy that we don't have those duplicated entries anymore! I'll wait for Will to also review it and then we can merge it (just ping me when it's done).
@@ -176,7 +179,10 @@ def test_IndexBuilder(): | |||
'index': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}, | |||
'test': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}, | |||
} | |||
assert index._title_mapping == {'section_titl': {'docname1_1', 'docname1_2', 'docname2_1', 'docname2_2'}} | |||
assert index._title_mapping == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So _title_mapping
are unsorted in the tests, but in production they are sorted? if so, a comment is needed (at least to remember the reason).
…erialized sort-order for title and document terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good from my end! @picnixz please merge when ready
I'll do it tomorrow (I was working on something else and missed the notification but I'm going offline now and I prefer not to merge something that I won't be able to see the merge result of). |
Thank you Jay (and Will for the review!) |
Thank you @picnixz! |
Feature or Bugfix
Purpose
Detail
docutils
nodes.title
nodes into the title index by omitting theirid
that is used as the hyperlink anchor in the special-case of the document title.Relates