Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Badly rewritten URLs in case of "/" in the article title #726

Open
kelson42 opened this issue May 10, 2019 · 36 comments · Fixed by #727, #798 or #998
Open

Badly rewritten URLs in case of "/" in the article title #726

kelson42 opened this issue May 10, 2019 · 36 comments · Fixed by #727, #798 or #998
Assignees
Labels
Milestone

Comments

@kelson42
Copy link
Collaborator

The forward slash in the URL is causing problems in Kiwix JS (see kiwix/kiwix-js#494). I wonder if it might be causing problems for you too. For example, the article Singapore references its assets like this:

<link href="../-/s/css_modules/ext.kartographer.link.css" …>

Whereas the article Singapore/Riverside references its assets like this:

<link href="../../-/s/css_modules/ext.kartographer.link.css" …>

Note the extra ../../, because the browser thinks it's accessing an article Riverside in the subdirectory Singapore. This would all be fine, except that the hrefs in the Singapore/Riverside article are written like this:

<p>The <b>Singapore River</b> forms a central artery in <a href="Singapore"
title="Singapore">Singapore</a>'s densely packed Central Business District.

Note that to access the hyperlink on this page from a browser, we would need to write that link as <a href="../Singapore" ...>.

So there seems to be some inconsistency. Are we inside a subdirectory from the browser's perspective, or not? The subroutine that writes the location of the assets seems to think so, while the one that writes the location of the hyperlinks doesn't! Do you think this is an mwoffliner issue?

@kelson42
Copy link
Collaborator Author

kelson42 commented Jun 10, 2019

I reopen the ticket as this is still not done properly. Same example with Wikivoyage Singapore/Riverside article. The link to Suntec City and the Esplanade is wrong. It is ../../A/../Singapore%2FMarina_Bay and it should be ../../A/Singapore/Marina_Bay (So two errors). @ISNIT0 Please create a proper unit test to test URL writing completely so we can be sure that we don't do any regression on that anymore in the future. Please also put me in reviewer of the PR, so I can have a look in details on the solution.

@kelson42 kelson42 reopened this Jun 10, 2019
@kelson42 kelson42 modified the milestones: 1.9, 1.9-maintenance Jun 10, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Jun 10, 2019

Yea, this is some code I've tried not to touch so far - it's very complicated and old :(

I'd like to refactor once we have good tests for it

@kelson42
Copy link
Collaborator Author

kelson42 commented Jul 3, 2019

Just tried again and CSS is not loaded properly in Wikivoyage "EN Paris/2nd arrondissement" article

@kelson42 kelson42 reopened this Jul 3, 2019
@kelson42 kelson42 pinned this issue Jul 3, 2019
@kelson42
Copy link
Collaborator Author

kelson42 commented Jul 4, 2019

Seems to work now... not sure what happend here.

@kelson42
Copy link
Collaborator Author

kelson42 commented Jul 15, 2019

No, it's not, I have created a new ZIM file with git master head of wikinews FR, and here is what zimcheck reports (we have broken internal links):

$ zimcheck out/wikinews_fr_all_2019-07.zim 
[INFO] Checking zim file out/wikinews_fr_all_2019-07.zim
[INFO] Verifying Internal Checksum.. 
  [INFO] Internal checksum found correct
[INFO] Searching for metadata entries..
[INFO] Searching for Favicon..
[INFO] Searching for main page..
[INFO] Verifying Articles' content.. 
[INFO] Searching for redundant articles..
  Verifying Similar Articles for redundancies..
[ERROR] Invalid internal links found :
  A/Belgique_:_Belgacom/Évènements_du_10_août_2014 (%C3%89v%C3%A8nements_du_10_ao%C3%BBt_2014) was not found in article A/Belgique_:_Belgacom/swing.be,_le_médiateur_des_télécoms_s'interroge_et_déplore_l'attitude_de_Belgacom.
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_Les_résultats_de_la_vingt-septième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_Les_r%C3%A9sultats_de_la_vingt-septi%C3%A8me_journ%C3%A9e#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_Les_résultats_de_la_vingt-septième_journée
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_Lyon_vainqueur_à_Toulouse_et_Lille_à_Marseille (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_Lyon_vainqueur_%C3%A0_Toulouse_et_Lille_%C3%A0_Marseille#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_Lyon_vainqueur_à_Toulouse_et_Lille_à_Marseille
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_le_suspense_continue (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_le_suspense_continue#cite_ref-euro_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_le_suspense_continue
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_12ème_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_12%C3%A8me_journ%C3%A9e#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_12ème_journée
  A/Championnat_de_France_2007/Football_:_décès_d'Antonio_Puerta (Football_%3A_d%C3%A9c%C3%A8s_d'Antonio_Puerta) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_6ème_journée_(suite)
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-cinquième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-cinqui%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-cinquième_journée
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-quatrième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-quatri%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-quatrième_journée
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-troisième_journée (Championnat_de_France_2007%2F2008_de_Ligue_1_%3A_les_r%C3%A9sultats_de_la_trente-troisi%C3%A8me_journ%C3%A9e#cite_ref-2) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_1_:_les_résultats_de_la_trente-troisième_journée
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Le_Havre_AC_en_Ligue_1_la_saison_prochaine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Le_Havre_AC_en_Ligue_1_la_saison_prochaine#cite_ref-l1_1-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Le_Havre_AC_en_Ligue_1_la_saison_prochaine
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Match_nul_entre_Sedan_et_Bastia (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Match_nul_entre_Sedan_et_Bastia#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Match_nul_entre_Sedan_et_Bastia
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Une_saison_se_termine#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_Une_saison_se_termine
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_le_FC_Nantes_en_Ligue_1_la_saison_prochaine (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_le_FC_Nantes_en_Ligue_1_la_saison_prochaine#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_le_FC_Nantes_en_Ligue_1_la_saison_prochaine
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_les_résultats_de_la_trente-quatrième_journée (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_les_r%C3%A9sultats_de_la_trente-quatri%C3%A8me_journ%C3%A9e#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_les_résultats_de_la_trente-quatrième_journée
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_match_nul_entre_Angers_et_Nantes (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_match_nul_entre_Angers_et_Nantes#cite_ref-nat_2-0) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_match_nul_entre_Angers_et_Nantes
  A/Championnat_de_France_2007/Championnat_de_France_2007/2008_de_Ligue_2_:_score_vierge_entre_Grenoble_et_Ajaccio (Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_score_vierge_entre_Grenoble_et_Ajaccio#cite_ref-1) was not found in article A/Championnat_de_France_2007/2008_de_Ligue_2_:_score_vierge_entre_Grenoble_et_Ajaccio
  I/m/Flag_of_Scotland_(traditional).svg.png (../I/m/Flag_of_Scotland_(traditional).svg.png) was not found in article A/Championnat_de_Ligue_celtique_2014-2015_:_les_résultats_de_la_dix-neuvième_journée
  A/Congo/Évènements_du_23_avril_2019 (%C3%89v%C3%A8nements_du_23_avril_2019) was not found in article A/Congo/Rwanda_:_une_quarantaine_de_corps_retrouvés_après_naufrage_sur_lac_Kivu
  A/De_l'alcool_et_du_sucre_dans_la_comète_C/Évènements_du_1er_novembre_2015 (%C3%89v%C3%A8nements_du_1er_novembre_2015) was not found in article A/De_l'alcool_et_du_sucre_dans_la_comète_C/2014_Q2_(Lovejoy)
  A/Découverte_d'oxygène_sur_la_comète_67P/Évènements_du_31_octobre_2015 (%C3%89v%C3%A8nements_du_31_octobre_2015) was not found in article A/Découverte_d'oxygène_sur_la_comète_67P/Tchourioumov-Guérassimenko
  A/Europe_de_l'Ouest_:_vigilance_orange_neige/Évènements_du_9_février_2013 (%C3%89v%C3%A8nements_du_9_f%C3%A9vrier_2013) was not found in article A/Europe_de_l'Ouest_:_vigilance_orange_neige/verglas_en_France,_en_Belgique_et_au_Luxembourg
  A/France/Évènements_du_15_avril_2017 (%C3%89v%C3%A8nements_du_15_avril_2017) was not found in article A/France/Suisse_:_mort_de_l'auteur_des_tirs_de_1982_contre_la_centrale_de_Creys-Malville
  A/Les_États-Unis_font_usage_d'une_GBU-43/Évènements_du_12_avril_2017 (%C3%89v%C3%A8nements_du_12_avril_2017) was not found in article A/Les_États-Unis_font_usage_d'une_GBU-43/B_pour_la_première_fois_en_Afghanistan
  A/Passage_de_la_comète_C/Évènements_du_12_mars_2013 (%C3%89v%C3%A8nements_du_12_mars_2013) was not found in article A/Passage_de_la_comète_C/2011_L4_(PANSTARRS)
[ERROR] Invalid external links found :
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&amp;lang=cs&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/2011_sur_Wikinews
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3ABangladesh_building_collapse_-_WN.ogv&amp;lang=bn&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/Bangladesh_:_effondrement_mortel_d'un_immeuble
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3ABangladesh_building_collapse_-_WN.ogv&amp;lang=bn&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/Bangladesh_:_émeutes_et_revendications_après_l'effondrement_mortel_d'un_immeuble
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3AWhite_House_Spokesman_Spicer_Holds_News_Conference.webm&amp;lang=en&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/Le_porte-parole_de_la_Maison_Blanche_attaque_les_médias_lors_de_sa_première_allocution
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&amp;lang=cs&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/Oussama_ben_Laden_tué_lors_d'un_raid_près_d'Islamabad
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3AWalesCalltoAction.ogv&amp;lang=ca&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/Wikipédia_célèbre_son_dixième_anniversaire_en_ligne
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3APresident_Obama_speaks_on_attacks_in_Boston_%282013-04-16%29.ogv&amp;lang=en&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/États-Unis_:_3_morts_et_plus_de_170_blessés_dans_l'attentat_de_Boston
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3APresident_Obama_on_Death_of_Osama_bin_Laden.ogv&amp;lang=cs&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/États-Unis_:_Obama_demande_une_enquête_sur_Ben_Laden
  https://commons.wikimedia.org/w/api.php?action=timedtext&amp;title=File%3APresident_Obama_speaks_on_explosions_in_Boston_%282013-04-15%29.ogv&amp;lang=en&amp;trackformat=srt&amp;origin=%2Ais an external dependence in article A/États-Unis_:_attentats_à_Boston
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: 272 seconds.

If I looks the HTML, I see this kind of link in

href="../../A/Championnat_de_France_2007%2F2008_de_Ligue_2_%3A_Une_saison_se_termine#cite_note-1"
``

The URL should not be encoded... and in fact it should be only `href="#cite_note-1"`.

@kelson42 kelson42 reopened this Jul 15, 2019
@ISNIT0
Copy link
Contributor

ISNIT0 commented Aug 1, 2019

@kelson42 I was able to reproduce this with master by doing a full fr.wikinews scrape, but now I'm unable to reproduce.
I believe this is fixed.
Will close for now.

@ISNIT0 ISNIT0 closed this as completed Aug 1, 2019
@Jaifroid
Copy link
Collaborator

I'm afraid this issue has recurred in wikipedia_en_medicine_maxi_2019-08.zim. This is a very recent ZIM, so I assume it should have fixes from this issue? If that's not the case, then please ignore, and I'll check again next month.

Issue: Hyperlinks on the landing page of this wikimed ZIM are missing a ../. Links to stylesheets and other assets are correct, but not the hyperlinks.

The URL of the page is Wikipedia:WikiProject_Medicine/Open_Textbook_of_Medicine2 (note the forward slash). Hyperlinks to articles on this page are written like this:

<a href="Book%3AChildren's_health" title="Book:Children's health">...</a>

As written, this hyperlink should lead to a page Wikipedia%3AWikiProject_Medicine/Book%3AChildren's_health, which is clearly wrong. Our client (Kiwix JS) throws an error (correctly).

Links to assets on this page are correctly written:

<img src="../../I/m/MedLogoNoWiFi.png" …>

@kelson42
Copy link
Collaborator Author

@Jaifroid I confirm :(((((

@kelson42
Copy link
Collaborator Author

@ISNIT0 Please re-implement this URL rewritting properly like suggested in #904 and extend the automated tests that we never have to reopen again this ticket.

@ISNIT0
Copy link
Contributor

ISNIT0 commented Aug 25, 2019

@Jaifroid As far as I can tell, this is a different instance of a very general ticket.
I've proposed a fix (#954 ) for this specific problem, which is that the slash re-writing wasn't being applied properly for desktop scrapes.

@kelson42 #904 is not related to @Jaifroid 's bug (and the fix in #954)

@Jaifroid
Copy link
Collaborator

Thanks @ISNIT0. When you say "desktop scrapes", do you mean ZIMs with "desktop style" or something else? FWIW, this ZIM appears to have a mobile style. It has the "minerva" mobile style.

@ISNIT0
Copy link
Contributor

ISNIT0 commented Aug 25, 2019

@Jaifroid sorry, I was unclear - the main page is always scraped as a desktop version, so it gets treated differently to the individual articles

@kelson42
Copy link
Collaborator Author

kelson42 commented Oct 6, 2019

Still not working fine and obviously not automated tested properly http://library.kiwix.org/wikivoyage_en_all_maxi_2019-10/A/Osaka%2FNorth (link comes from http://library.kiwix.org/wikivoyage_en_all_maxi_2019-10/A/Osaka/Kita) :(

@artiommocrenco
Copy link

@kelson42 This looks like it is still an issue

sudo docker run --rm -v /output:/output:rw --memory-swappiness 0 ghcr.io/openzim/mwoffliner:1.13.0 mwoffliner --webp --mwUrl="https://neolurk.org/" --format="nodet,nopic:mini" --format="nopic:nopic" --format="novid:maxi" --osTmpDir="/dev/shm" --requestTimeout="1000" --outputDirectory="/output" --adminEmail="[email protected]" --mwRestApiPath="/wiki/"
WARNING: Your kernel does not support memory swappiness capabilities or the cgroup is not mounted. Memory swappiness discarded.
starting redis-server in the background…
(node:13) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023.

Please migrate your code to use AWS SDK for JavaScript (v3).
For more information, check the migration guide at https://a.co/7PzMCcy
(Use `node --trace-warnings ...` to show where the warning was created)
[error] [2024-12-20T03:14:31.871Z] Error downloading article Русская_деревня/Животноводство
[error] [2024-12-20T03:14:32.812Z] Error downloading article Старое_обсуждение:Гоблин/Архив/1
[error] [2024-12-20T03:14:33.934Z] Error downloading article Дохлые_герои/Архив
[error] [2024-12-20T03:14:35.926Z] Error downloading article Desperate_Housewives/Rex_Van_de_Kamp
[error] [2024-12-20T03:14:36.635Z] Error downloading article Русская_деревня/Локации
[error] [2024-12-20T03:14:37.646Z] Error downloading article The_New_Order:_Last_Days_of_Europe/Werbell_III
[error] [2024-12-20T03:14:38.342Z] Error downloading article 4chan/Вариант_из_Луркоморья
[error] [2024-12-20T03:14:39.406Z] Error downloading article EVE_Online/Эпические_битвы
[error] [2024-12-20T03:14:39.893Z] Error downloading article Знают_именно_за_это/IRL/Исторические_деятели_России
[error] [2024-12-20T03:14:42.260Z] Error downloading article My_Little_Pony/Confound
[error] [2024-12-20T03:14:42.894Z] Error downloading article Германия/Персоналии
[error] [2024-12-20T03:14:43.155Z] Error downloading article Blackface/Вариант_из_Posmotre.li
[error] [2024-12-20T03:14:43.756Z] Error downloading article Таня_Гроттер/НП

@kelson42 kelson42 reopened this Dec 20, 2024
@kelson42 kelson42 modified the milestones: 1.9-maintenance, 1.14.0 Dec 20, 2024
@Jaifroid
Copy link
Collaborator

@artiommocrenco Did you try the same with 1.14 dev?

@artiommocrenco
Copy link

@Jaifroid for some reason dev does not work at all in my case openzim/zim-requests#958 (comment)

@Jaifroid
Copy link
Collaborator

Hmm, OK, yes it's not quite stable yet from what I understand. While I'm not a dev of mwOffliner, I initially thought the errors may have to do with requesting titles that require a URI-encoded forward-slash to access them on the wiki, but using a string without a URI-encoded slash. However, on your server I can't find https://neolurk.org/wiki/My_Little_Pony%2FConfound nor https://neolurk.org/wiki/My_Little_Pony/Confound . Do these URLs actually exist?

@artiommocrenco
Copy link

However, on your server I can't find https://neolurk.org/wiki/My_Little_Pony%2FConfound nor https://neolurk.org/wiki/My_Little_Pony/Confound . Do these URLs actually exist?

Second URL works for me @Jaifroid

@Jaifroid
Copy link
Collaborator

Indeed it does! Apologies, I did actually test it earlier, but I must have made a typo or used wrong case when testing it in the URL field of my browser.

@kelson42 kelson42 modified the milestones: 1.14.0, 1.15.0 Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants
@mossroy @kelson42 @Jaifroid @ISNIT0 @artiommocrenco and others