Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New mobile-html Wikimedia returns random empty responses #2003

Open
benoit74 opened this issue Mar 4, 2024 · 26 comments
Open

New mobile-html Wikimedia returns random empty responses #2003

benoit74 opened this issue Mar 4, 2024 · 26 comments
Labels
Milestone

Comments

@benoit74
Copy link
Contributor

benoit74 commented Mar 4, 2024

Recipe: https://farm.openzim.org/recipes/wikipedia_arz_all

Error is:

[error] [2024-03-02T00:03:51.082Z] Error downloading article كريس_ستاندرينج
[error] [2024-03-02T00:03:51.086Z] Failed to run mwoffliner after [7351s]: {
	"name": "Error",
	"message": "Cannot render [] into an article"
}
[error] [2024-03-02T00:03:51.086Z] 

**********

Cannot render [] into an article

**********

Looks like the article is not empty online: https://arz.wikipedia.org/wiki/%D9%83%D8%B1%D9%8A%D8%B3_%D8%B3%D8%AA%D8%A7%D9%86%D8%AF%D8%B1%D9%8A%D9%86%D8%AC

@benoit74
Copy link
Contributor Author

wikipedia_id_all is impacted as well: openzim/zim-requests#879

@kelson42 kelson42 added this to the 1.14.0 milestone Jun 29, 2024
@kelson42
Copy link
Collaborator

@audiodude This newly stops many Wikipedia to render properly. I believe this is not a regression with 1.14 but this impairs us seriously to move forward with testing of 1.14. Last run of WPAR is impacted: https://farm.openzim.org/pipeline/c8708ce9-f831-4c06-a9d6-748e6e860cec/debug

@kelson42 kelson42 reopened this Jun 29, 2024
@kelson42 kelson42 changed the title wikipedia_arz_all is failing with Cannot render [] into an article wikipedia_arz_all (and a few other WP) is failing with Cannot render [] into an article Jun 29, 2024
@kelson42
Copy link
Collaborator

@audiodude
Copy link
Member

Although the error message is the same, I'm not sure this is the same bug.

For WPARZ, I cannot reproduce with an articleList of only كريس_ستاندرينج.

For WPCA, it is 100% reproducible with an articleList of Khalifa_ibn_Askar. However it is also the case that https://ca.wikipedia.org/api/rest_v1/page/mobile-html/Khalifa_ibn_Askar returns empty/missing data: https://gist.github.com/audiodude/139ad898a925733d56fd08fee5a5fb9f

WPID doesn't reproduce the bug when using an article list of IL-2_Sturmovik_(series). However it fails otherwise with the following stack trace: https://gist.github.com/audiodude/7743f8e6020c4dbe9c4f32301c7e5a6e

@audiodude
Copy link
Member

Finally, realizing that WPAR is different from WPARZ, I tried the former and could not reproduce with articleList of توموت

@kelson42
Copy link
Collaborator

kelson42 commented Jul 1, 2024

Hmmm, not sure what should be done next. In your log the line:

[warn] [2024-06-30T18:30:00.292Z] Couldn't find strings file for [id]

Seem suspicious.

@audiodude
Copy link
Member

Seem suspicious.

That's the new message added in #2050. Before, it would simply fail to find the id file, since there's no translation file for that language, and fall back silently to en. Now it logs a message whenever it can't find a required file.

@kelson42
Copy link
Collaborator

I get it, somehow this message is missing the keyword "language"...

@audiodude
Copy link
Member

Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that. It's not clear what kind of phabricator ticket we could file other than "JSON endpoint sometimes returns empty response for non-empty articles" but without a demonstrable reproduction case.

@audiodude
Copy link
Member

Just re-tested ARZ, and it's failing on a different article now. This is definitely an upstream bug and I've filed https://phabricator.wikimedia.org/T379017

@kelson42
Copy link
Collaborator

kelson42 commented Dec 4, 2024

@audiodude So it seems that this bugbis a kind of new show stopper. Do we know if we have wikis which are notmimpacted at all or are they all (potentially):impacted?

@audiodude
Copy link
Member

audiodude commented Dec 4, 2024

@kelson42 So far, we've seen it on arz, es wikitionary, and any other place that is reporting this bug. It doesn't seem limited to any specific Wikimedia project no, so potentially any project is impacted as far as we currently know.

EDIT: We of course have the option of skipping pages that exhibit this bug, but I know you don't like to skip pages.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 5, 2024

Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that.

Yes, I see two scenarios:

  • Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML
  • OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

@audiodude
Copy link
Member

  • Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML

The API is returning a 200 response that is completely empty (ie ''). I'm fine with writing an empty page, but this will likely require some "re-wiring" of mwoffliner, which I believe assumes in many places that the HTML it received is not empty (such as when doing substitutions).

  • OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

This somewhat contradicts the above statement. It's "syntactically correct" in the sense that an empty string is a string. It is a response, a 200 success, though an empty one. But the article exists per our previous article discovery mechanism, it's live on the wiki, just the API response is empty, so it's definitely some kind of bug.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 5, 2024

  • Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML

The API is returning a 200 response that is completely empty (ie ''). I'm fine with writing an empty page, but this will likely require some "re-wiring" of mwoffliner, which I believe assumes in many places that the HTML it received is not empty (such as when doing substitutions).

  • OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

This somewhat contradicts the above statement. It's "syntactically correct" in the sense that an empty string is a string. It is a response, a 200 success, though an empty one. But the article exists per our previous article discovery mechanism, it's live on the wiki, just the API response is empty, so it's definitely some kind of bug.

OK, I think we can agree that empty string is not a valid json string (even if theoritically it is).

@benoit74
Copy link
Contributor Author

benoit74 commented Dec 5, 2024

What is worrying from my PoV, is that folks in upstream issue https://phabricator.wikimedia.org/T379017 agree that there is an issue and it is transient. It means that whatever we do in mwoffliner, we will keep having problems of missing articles in the ZIM because the API returned bad response. I think we must insist for upstream issue to be found and fixed.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 5, 2024

@benoit74 We won't have empty articles, we will stop the scraping with clear error. Once handled properly I will move the issue to next milestone as this should not be a blocker for release of 1.14.0 (rhere is nothing we could do more).

For the rest I will escalate the Phabricator bug tomorrow.

@benoit74
Copy link
Contributor Author

benoit74 commented Dec 5, 2024

I suspect a very significant portion of Zimfarm recipes will become unstable until upstream ticket is fixed, but indeed there is not much I would advise to do and this should not block 1.14 if we have clear message.

@kelson42
Copy link
Collaborator

Wikipedia is also impacted, yesterday WPCA scrape fails twice on the same article withthis error: https://farm.openzim.org/pipeline/cb6b6898-1a06-4e51-bc93-2091a6a01c31/debug

@audiodude
Copy link
Member

Yes I've been reporting additional articles to the phabricator ticket (https://phabricator.wikimedia.org/T379017) as they fail my scrapes. The good news is that there's some activity on the ticket now.

@audiodude
Copy link
Member

cc @kelson42

I don't think we can wait for this issue to be resolved upstream before releasing 1.14. My advice is to build in code to "skip" articles that don't return content from the mobile-html endpoint, I think there's already the concept of "missing" in the final mwoffliner output.

Then we can mark the code that does the skipping with a TODO and this issue number, and remove it later when this issue is fixed upstream.

@kelson42 kelson42 removed this from the 1.14.0 milestone Jan 2, 2025
@kelson42 kelson42 added this to the 1.15.0 milestone Jan 2, 2025
@kelson42
Copy link
Collaborator

kelson42 commented Jan 2, 2025

Yes! This upstream bug should not stop us to move forward. I have postponed the issue.

We will do the necessary so this bug is fixed at the WMF asap.

We don't create passthrough exceptions in scrapers because they lead to broken ZIM and people notice, complain, etc...

@benoit74
Copy link
Contributor Author

benoit74 commented Jan 3, 2025

I've made some tests on WPCA, I will summarize them in upstream issue, but basically the outcome is that less than 10 articles out of 765k are experiencing the problem of returning empty content on my tests. And they are not always the same, but trying again immediately does not help, some are still bad after days, some resolve after few hours.

This means the impact of a passthrough is really minimal (around 1 for 100k articles).

And I strongly suspect the upstream bug is some kind of cache issue, so it probably impact all / most wikimedia instances.

So I don't agree that we can postpone this issue, it is most probably going to be blocking many ZIMs from being produced.

And I agree that we need to create a passthrough, we already know how to limit the impact on ZIM quality: add a clear "we are sorry" page in place of the missing article, because we know there should be something here but scraper failed to retrieve it.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 3, 2025

So I don't agree that we can postpone this issue, it is most probably going to be blocking many ZIMs from being produced.

I know, i prefer not to generate any broken -by-design ZIM and scraper. We should better focus to get problem fixed upstream!

@kelson42 kelson42 changed the title wikipedia_arz_all (and a few other WP) is failing with Cannot render [] into an article New mobile-html Wikimedia returns random empty responses Jan 3, 2025
@audiodude
Copy link
Member

@kelson42 While I appreciate moving this issue from 1.14 to 1.15, it's only a token gesture. If we release 1.14 without what @benoit74 is referring to as a "passthrough", as he pointed out, we will not be able to reliably create ZIMs from WM projects, because as it is this is a fatal error.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 3, 2025

@kelson42 While I appreciate moving this issue from 1.14 to 1.15, it's only a token gesture. If we release 1.14 without what @benoit74 is referring to as a "passthrough", as he pointed out, we will not be able to reliably create ZIMs from WM projects, because as it is this is a fatal error.

This is perfectly clear to me that any big ZIM will probably be impacted by this upstream bug... and made impossible do for the moment.

Once 1.14 released we should focus on fixing our bugs and for your part integrating latest node-libzim.

I have already taken measures to raise the awareness on Wikimedia side about this serious API bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants