Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop putting article URL if no title is provided #208

Closed
kelson42 opened this issue Feb 23, 2019 · 12 comments
Closed

Stop putting article URL if no title is provided #208

kelson42 opened this issue Feb 23, 2019 · 12 comments
Assignees

Comments

@kelson42
Copy link
Contributor

Following this comment kiwix/kiwix-js#475 (comment), I think we should stop faking the article.title properly if there is not title at all. The problematic code is here:

const std::string& getTitle() const { return title.empty() ? url : title; }

@Jaifroid
Copy link

Just say that I believe you have to be careful not to break binary search or the display of search results with any change to this. There was probably a reason why libzim added the url in place of title when titles were missing. I think it has to do with the way the binary search is implemented. This is something @mossroy and I discovered when we implemented kiwix/kiwix-js#443 (comment), and there is some discussion there of why we have to search on the url when the title is missing. We were getting crazy binary searches because of missing titles in TEDx ZIMs. For the display of search results (but obviously not for sorting), we decided to add square brackets around the "faked" title when it was derived from the url.

@kelson42
Copy link
Contributor Author

kelson42 commented Feb 23, 2019

@Jaifroid this line of code has been introduced at the really beginning of the libzim project, 10 years ago. At that time, there was no concept of title (and title index). This has been introduced at my request. I suspect this piece of code has been made that way to keep compatibility if no title index. In fact I think the main question is even more if we need this title index at all now that we can make suggestion based on the ft index... and the ft index is meanwhile always here. To conclude I don't see any case where the removal of this "hack" should create a problem now (for 10 years it was different).

@Jaifroid
Copy link

@kelson42 OK! (FYI, Kiwix JS still uses a title index, and has no way of accessing any other index. However, Kiwix JS doesn't use libzim, so changing things here won't have any effect, I believe.)

@kelson42
Copy link
Contributor Author

@Jaifroid It is fundamental that Kiwix JS can access both URl and Title index. This should be a priority to fix. I hardly see how it could work without accessing the URL index.

@Jaifroid
Copy link

@kelson42 All I meant was that Kiwix JS cannot currently access the word index in pre-indexed ZIM files. I'm sure we access the directory pointer list ordered by url, or else we wouldn't be able to find directory entries for assets, which we clearly do. Hopefully this is a non-issue!

@mossroy
Copy link

mossroy commented Feb 24, 2019

The full-text index is not implemented in Kiwix-JS. But it does use both URL and Title indexes.

I also think, like @Jaifroid, that this change might break compatibility with some older ZIM files, for Kiwix clients that use the libzim.

For example, I would advise that you test an article search in tedxgeneva-2014_fr_all_2015-03.zim (which probably does not have a built-in full-text index).
In our javascript implementation (which follows the libzim algorithms), the binary search was failing with this ZIM file if the URL was not used for missing titles.

@kelson42
Copy link
Contributor Author

kelson42 commented Feb 24, 2019

@mossroy @Jaifroid Please bring me something concrete where this change breaks something. Then I will evaluate it. AFAIK there is no reason not to do it.

@mgautierfr
Copy link
Collaborator

This is use to displayed "something" in kiwix-desktop if there is no title.
Every content not being a html page has no title (image, js script, css file, metadata, html page without <title> tag)

The fact that we return the url if the title is empty is not a bug. It is part of the spec (See line title in the array about directory entries in https://wiki.openzim.org/wiki/ZIM_file_format, or The title is empty and hence the title is the same as the url: "Auto". in https://wiki.openzim.org/wiki/ZIM_File_Example)

If you umderstand "If an article has no title return the url" this can be a bug.
But if you understand "If an article has a title equal to the url, we don't need to store twice the same content", this is definitively not a bug.

@kelson42
Copy link
Contributor Author

@mgautierfr Thank you for enlightening the discussion. To me we should not have this kind of logic where respective duties of URL and titles are mixed. This is a fact that this is hard to understand and therefore confusing. This is a legacy thing (at the time we have introduced the title & title index on my demand), we do (should) set title properly if not empty (and whatever the value of the URL). As a consequence the best thing is to remove this fallback scenario.

@mgautierfr
Copy link
Collaborator

As a consequence the best thing is to remove this fallback scenario.

This is not a fallback scenario. This is THE scenario.
All zims are made using this classical scenario.
The Title Pointer List is sorted by title using this scenario.
A article with url "A/foo.hmtl" but without title tag in the html content will be sorted between an article "A-letter" and "Ant". This is the normal behavior and it is consistent.

If we remove the fact that "empty title -> use url", we will end with all zims with article not properly sorted by title. And then the binary search will fail.

kiwix-js simply must follow the spec. This is not that at a moment we are comparing url with title. We are always comparing title with title and sometime the title is the same as the url (and then, we avoid duplication)

@Jaifroid
Copy link

@mgautierfr JFYI Kiwix JS does follow the spec: we use exactly the same algorithm (implemented in JS) as libzim for title and url indices, and binary search.

@kelson42
Copy link
Contributor Author

@mgautierfr I understand now that the title index has not been implemented as I think it was (and as I requested 10 years ago). This is unfortunate as it is unnecessary complicated and confusing. To me, at the best, this is an optimization (which brings almost nothing). Lets keep it like it is then. Thank you @Jaifroid @mossroy @mgautierfr for patience to sort this request out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants