Stop putting article URL if no title is provided #208

kelson42 · 2019-02-23T11:19:50Z

Following this comment kiwix/kiwix-js#475 (comment), I think we should stop faking the article.title properly if there is not title at all. The problematic code is here:

libzim/src/_dirent.h

Line 84 in 684c355

const std::string& getTitle() const { return title.empty() ? url : title; }

Jaifroid · 2019-02-23T12:06:24Z

Just say that I believe you have to be careful not to break binary search or the display of search results with any change to this. There was probably a reason why libzim added the url in place of title when titles were missing. I think it has to do with the way the binary search is implemented. This is something @mossroy and I discovered when we implemented kiwix/kiwix-js#443 (comment), and there is some discussion there of why we have to search on the url when the title is missing. We were getting crazy binary searches because of missing titles in TEDx ZIMs. For the display of search results (but obviously not for sorting), we decided to add square brackets around the "faked" title when it was derived from the url.

kelson42 · 2019-02-23T12:36:45Z

@Jaifroid this line of code has been introduced at the really beginning of the libzim project, 10 years ago. At that time, there was no concept of title (and title index). This has been introduced at my request. I suspect this piece of code has been made that way to keep compatibility if no title index. In fact I think the main question is even more if we need this title index at all now that we can make suggestion based on the ft index... and the ft index is meanwhile always here. To conclude I don't see any case where the removal of this "hack" should create a problem now (for 10 years it was different).

Jaifroid · 2019-02-23T13:35:40Z

@kelson42 OK! (FYI, Kiwix JS still uses a title index, and has no way of accessing any other index. However, Kiwix JS doesn't use libzim, so changing things here won't have any effect, I believe.)

kelson42 · 2019-02-23T14:21:39Z

@Jaifroid It is fundamental that Kiwix JS can access both URl and Title index. This should be a priority to fix. I hardly see how it could work without accessing the URL index.

Jaifroid · 2019-02-23T18:18:02Z

@kelson42 All I meant was that Kiwix JS cannot currently access the word index in pre-indexed ZIM files. I'm sure we access the directory pointer list ordered by url, or else we wouldn't be able to find directory entries for assets, which we clearly do. Hopefully this is a non-issue!

mossroy · 2019-02-24T09:31:31Z

The full-text index is not implemented in Kiwix-JS. But it does use both URL and Title indexes.

I also think, like @Jaifroid, that this change might break compatibility with some older ZIM files, for Kiwix clients that use the libzim.

For example, I would advise that you test an article search in tedxgeneva-2014_fr_all_2015-03.zim (which probably does not have a built-in full-text index).
In our javascript implementation (which follows the libzim algorithms), the binary search was failing with this ZIM file if the URL was not used for missing titles.

kelson42 · 2019-02-24T10:02:49Z

@mossroy @Jaifroid Please bring me something concrete where this change breaks something. Then I will evaluate it. AFAIK there is no reason not to do it.

mgautierfr · 2019-02-25T10:52:30Z

This is use to displayed "something" in kiwix-desktop if there is no title.
Every content not being a html page has no title (image, js script, css file, metadata, html page without <title> tag)

The fact that we return the url if the title is empty is not a bug. It is part of the spec (See line title in the array about directory entries in https://wiki.openzim.org/wiki/ZIM_file_format, or The title is empty and hence the title is the same as the url: "Auto". in https://wiki.openzim.org/wiki/ZIM_File_Example)

If you umderstand "If an article has no title return the url" this can be a bug.
But if you understand "If an article has a title equal to the url, we don't need to store twice the same content", this is definitively not a bug.

kelson42 · 2019-02-25T11:57:27Z

@mgautierfr Thank you for enlightening the discussion. To me we should not have this kind of logic where respective duties of URL and titles are mixed. This is a fact that this is hard to understand and therefore confusing. This is a legacy thing (at the time we have introduced the title & title index on my demand), we do (should) set title properly if not empty (and whatever the value of the URL). As a consequence the best thing is to remove this fallback scenario.

mgautierfr · 2019-02-25T12:41:36Z

As a consequence the best thing is to remove this fallback scenario.

This is not a fallback scenario. This is THE scenario.
All zims are made using this classical scenario.
The Title Pointer List is sorted by title using this scenario.
A article with url "A/foo.hmtl" but without title tag in the html content will be sorted between an article "A-letter" and "Ant". This is the normal behavior and it is consistent.

If we remove the fact that "empty title -> use url", we will end with all zims with article not properly sorted by title. And then the binary search will fail.

kiwix-js simply must follow the spec. This is not that at a moment we are comparing url with title. We are always comparing title with title and sometime the title is the same as the url (and then, we avoid duplication)

Jaifroid · 2019-02-25T12:55:47Z

@mgautierfr JFYI Kiwix JS does follow the spec: we use exactly the same algorithm (implemented in JS) as libzim for title and url indices, and binary search.

kelson42 · 2019-02-25T13:22:10Z

@mgautierfr I understand now that the title index has not been implemented as I think it was (and as I requested 10 years ago). This is unfortunate as it is unnecessary complicated and confusing. To me, at the best, this is an optimization (which brings almost nothing). Lets keep it like it is then. Thank you @Jaifroid @mossroy @mgautierfr for patience to sort this request out.

kelson42 added the bug label Feb 23, 2019

kelson42 assigned mgautierfr Feb 23, 2019

kelson42 mentioned this issue Feb 23, 2019

[TEST] New ZIM files without file extension in the URL kiwix/kiwix-js#475

Closed

kelson42 closed this as completed Feb 25, 2019

kelson42 added the invalid label Feb 25, 2019

mossroy mentioned this issue Apr 17, 2019

Title should be based on the DOM node <title> openzim/mwoffliner#605

Closed

kelson42 mentioned this issue Aug 9, 2020

Random should be based off suggestions' pool kiwix/libkiwix#404

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop putting article URL if no title is provided #208

Stop putting article URL if no title is provided #208

kelson42 commented Feb 23, 2019

Jaifroid commented Feb 23, 2019

kelson42 commented Feb 23, 2019 •

edited

Loading

Jaifroid commented Feb 23, 2019

kelson42 commented Feb 23, 2019

Jaifroid commented Feb 23, 2019

mossroy commented Feb 24, 2019

kelson42 commented Feb 24, 2019 •

edited

Loading

mgautierfr commented Feb 25, 2019

kelson42 commented Feb 25, 2019

mgautierfr commented Feb 25, 2019

Jaifroid commented Feb 25, 2019

kelson42 commented Feb 25, 2019

Stop putting article URL if no title is provided #208

Stop putting article URL if no title is provided #208

Comments

kelson42 commented Feb 23, 2019

Jaifroid commented Feb 23, 2019

kelson42 commented Feb 23, 2019 • edited Loading

Jaifroid commented Feb 23, 2019

kelson42 commented Feb 23, 2019

Jaifroid commented Feb 23, 2019

mossroy commented Feb 24, 2019

kelson42 commented Feb 24, 2019 • edited Loading

mgautierfr commented Feb 25, 2019

kelson42 commented Feb 25, 2019

mgautierfr commented Feb 25, 2019

Jaifroid commented Feb 25, 2019

kelson42 commented Feb 25, 2019

kelson42 commented Feb 23, 2019 •

edited

Loading

kelson42 commented Feb 24, 2019 •

edited

Loading