Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title should be based on the DOM node <title> #605

Closed
kelson42 opened this issue Feb 25, 2019 · 13 comments · Fixed by #678
Closed

Title should be based on the DOM node <title> #605

kelson42 opened this issue Feb 25, 2019 · 13 comments · Fixed by #678
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Collaborator

Based on kiwix/kiwix-js#475 (comment)

AFAIK this was done properly before. We should also secure that if we have something like <title>2<b>b</b></title> then the <b> will be removed.

@Jaifroid
Copy link
Collaborator

@ISNIT0 as per kiwix/kiwix-js#475, Wikivoyage_en_all_novid_2019-04.zim is displaying the "missing titles" symptoms discussed in this issue again. See screenshot below from Kiwix JS. All "titles" in square brackets are auto-generated by Kiwix JS from the URL, because the title is blank in the dirEntry. Can someone re-open this issue?

image

@kelson42 kelson42 reopened this Apr 10, 2019
@kelson42 kelson42 modified the milestones: 1.8, 1.9 Apr 10, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Apr 10, 2019

Odd, the code seems like it always provides a title (even if it has to fallback to the articleId)
I'll look into this later, but it's not an obvious issue

@Jaifroid
Copy link
Collaborator

@ISNIT0 , the underlying libzim code in most Kiwix clients fills in the title from the url if there is no title, so the error is masked. In Kiwix JS, we opted to put [...] around the title so that any problem would be obvious. However, using the url as fallback leads to incorrect titles with underscores in them, so mwoffliner should always populate the title field from the DOM title node, never from the url. It seems the code isn't doing that consistently.

@ISNIT0
Copy link
Contributor

ISNIT0 commented Apr 17, 2019

I've modified the logic now to read the (html stripped) title from <title> in the generated HTML (we were using the Parsoid response). If it doesn't find a title, it falls back to articleId.replace(/_/g, ' ')

Does this seem right? PR coming soon

ISNIT0 added a commit that referenced this issue Apr 17, 2019
ISNIT0 added a commit that referenced this issue Apr 17, 2019
🐛 Modifying article title logic #605
@kelson42
Copy link
Collaborator Author

@ISNIT0 Have you verified it fixed the problem reported by Jaifrod. I hardly understand how the Parsoid API title should/could be different from the HTML one. Are you able to explain this?

@ISNIT0
Copy link
Contributor

ISNIT0 commented Apr 17, 2019

Weirdly, zimdump shows the titles are set, but KiwixJS still has the [..]

➜  mwoffliner git:(master) ✗ ../../Projects/zim-tools/zimtools/zimdump -L ./out/wikivoyage_en_articlelist_2019-04.zim | grep html 
A       Laurel_Highlands        Laurel Highlands        54      A       text/html       21459
A       Laurinburg      Laurinburg      55      A       text/html       8991
A       Lausanne        Lausanne        56      A       text/html       197980
A       Lautoka Lautoka 57      A       text/html       17105
A       Waterloo_(Ontario)      Waterloo (Ontario)      58      A       text/html       86126
A       index   Main Page       59      A       text/html       1359

image

@kelson42
Copy link
Collaborator Author

kelson42 commented Apr 17, 2019

@Jaifroid I don't remember the whole discussion with @mgautierfr but the conclusion was I think the format just takes the string from the URL in case there is not title... and the writer of the libzim considering this, never writes the title if this is the same as the URL. So, saying this, the question might be more: why you display it differently if the title from the title or the URL?
@ISNIT0 the real thing to test here is Laurel_Highlands, so you see [Laurel_Highlands] in Kiwix-JS?

@Jaifroid
Copy link
Collaborator

The issue is that clients based on libzim transparently use the dirEntry's URL if the dirEntry's title is empty. In Kiwix JS we decided to add square brackets in such cases as a diagnostic tool. The square brackets are only added by Kiwix JS if the title is empty. The underlying problem is that mwoffliner (I presume) has not correctly added titles in such cases. "Laurel Highlands" does seem to have the correct title (no square brackets above, also no underscore, another telltale symptom that the URL has been used instead of title).

I'm pinging @mossroy in case he can help to diagnose or corroborate, as I'm currently travelling for work.

@mossroy
Copy link

mossroy commented Apr 17, 2019

I confirm that. Here is more detail :

In libzim, the title attribute of a DirEntry is protected and, when you use getTitle(), it returns the URL if the title is empty. See https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/_dirent.h#L84
So I suppose that every kiwix reader based on libzim hides empty titles this way.
The removal of this controversial behavior in libzim has been discussed in openzim/libzim#208, and rejected for backward compatibility reasons of the binary search.

In kiwix-js, we had to implement the exact same title/url fallback in the binary search (else some articles could not be found : kiwix/kiwix-js#440) BUT we did not implement it in the DirEntry class : only in the binary search where the bug was. And, when we display a matching article (in search results), and its title is empty, we display instead the URL inside square brackets. All this has been implemented in PR kiwix/kiwix-js#443.

It might be discussed if we should implement the exact same fallback in DirEntry class of kiwix-js. I did not consider it as necessary so far, and don't like getter/setters that do such nasty things in my back.
We don't have a getTitle() function and the title attribute is currently directly read (mostly because we had no native way to make an attribute private in javascript at this time, and ES6 classes are still not implemented on all our supported browsers).

@kelson42 kelson42 reopened this Apr 18, 2019
@kelson42
Copy link
Collaborator Author

@ISNIT0 I reopen the ticket as this is not clear to me if the pb is fixed. Could you please upload somewhere the ZIM file of Wikivoyage you ave created with git master HEAD?

@kelson42 kelson42 self-assigned this Apr 18, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Apr 18, 2019

Using the published Wikivoyage file from download.kiwix.org I'm seeing this:

➜ zimdump -L ./wikivoyage_en_all_nopic_2019-04.zim | grep Laurel_Highlands

A	Laurel_Highlands	Laurel Highlands	54069	A	text/html	23522
➜ zimdump -L ./wikivoyage_en_all_nopic_2019-04.zim | grep Lausanne        

A	Lausanne	Lausanne	54074	A	text/html	197644
A	Lausanne/Ouchy	Lausanne/Ouchy	54075	R	54074

It seems to me the published file is valid. It's odd to me that KiwixJS only shows the [..] around articles with no space character in the title (when the articleId == the title)

@kelson42
Copy link
Collaborator Author

I'm closing the ticket, the whole discussion is going in a direction which has nothing to do with the original bug report.

@Jaifroid
Copy link
Collaborator

OK. Still, something is off, either in mwoffliner or in Kiwix JS. I'll investigate what's happening on the Kiwix JS end, and will open an issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants