New mobile-html Wikimedia returns random empty responses #2003

benoit74 · 2024-03-04T08:21:22Z

Recipe: https://farm.openzim.org/recipes/wikipedia_arz_all

Error is:

[error] [2024-03-02T00:03:51.082Z] Error downloading article كريس_ستاندرينج
[error] [2024-03-02T00:03:51.086Z] Failed to run mwoffliner after [7351s]: {
	"name": "Error",
	"message": "Cannot render [] into an article"
}
[error] [2024-03-02T00:03:51.086Z] 

**********

Cannot render [] into an article

**********

Looks like the article is not empty online: https://arz.wikipedia.org/wiki/%D9%83%D8%B1%D9%8A%D8%B3_%D8%B3%D8%AA%D8%A7%D9%86%D8%AF%D8%B1%D9%8A%D9%86%D8%AC

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-03-12T14:40:25Z

wikipedia_id_all is impacted as well: openzim/zim-requests#879

kelson42 · 2024-06-29T06:23:13Z

@audiodude This newly stops many Wikipedia to render properly. I believe this is not a regression with 1.14 but this impairs us seriously to move forward with testing of 1.14. Last run of WPAR is impacted: https://farm.openzim.org/pipeline/c8708ce9-f831-4c06-a9d6-748e6e860cec/debug

kelson42 · 2024-06-30T09:14:17Z

WPCA impacted as well https://farm.openzim.org/pipeline/1c29259f-d858-40f4-8cfb-530696e2b20f/debug

audiodude · 2024-06-30T18:31:00Z

Although the error message is the same, I'm not sure this is the same bug.

For WPARZ, I cannot reproduce with an articleList of only كريس_ستاندرينج.

For WPCA, it is 100% reproducible with an articleList of Khalifa_ibn_Askar. However it is also the case that https://ca.wikipedia.org/api/rest_v1/page/mobile-html/Khalifa_ibn_Askar returns empty/missing data: https://gist.github.com/audiodude/139ad898a925733d56fd08fee5a5fb9f

WPID doesn't reproduce the bug when using an article list of IL-2_Sturmovik_(series). However it fails otherwise with the following stack trace: https://gist.github.com/audiodude/7743f8e6020c4dbe9c4f32301c7e5a6e

audiodude · 2024-06-30T19:01:17Z

Finally, realizing that WPAR is different from WPARZ, I tried the former and could not reproduce with articleList of توموت

kelson42 · 2024-07-01T05:03:45Z

Hmmm, not sure what should be done next. In your log the line:

[warn] [2024-06-30T18:30:00.292Z] Couldn't find strings file for [id]

Seem suspicious.

audiodude · 2024-07-01T15:18:49Z

Seem suspicious.

That's the new message added in #2050. Before, it would simply fail to find the id file, since there's no translation file for that language, and fall back silently to en. Now it logs a message whenever it can't find a required file.

kelson42 · 2024-07-15T17:55:02Z

I get it, somehow this message is missing the keyword "language"...

audiodude · 2024-07-16T00:35:01Z

Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that. It's not clear what kind of phabricator ticket we could file other than "JSON endpoint sometimes returns empty response for non-empty articles" but without a demonstrable reproduction case.

audiodude · 2024-11-04T18:15:25Z

Just re-tested ARZ, and it's failing on a different article now. This is definitely an upstream bug and I've filed https://phabricator.wikimedia.org/T379017

kelson42 · 2024-12-04T06:21:28Z

@audiodude So it seems that this bugbis a kind of new show stopper. Do we know if we have wikis which are notmimpacted at all or are they all (potentially):impacted?

audiodude · 2024-12-04T18:44:15Z

@kelson42 So far, we've seen it on arz, es wikitionary, and any other place that is reporting this bug. It doesn't seem limited to any specific Wikimedia project no, so potentially any project is impacted as far as we currently know.

EDIT: We of course have the option of skipping pages that exhibit this bug, but I know you don't like to skip pages.

kelson42 · 2024-12-05T04:03:09Z

Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that.

Yes, I see two scenarios:

Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML
OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

audiodude · 2024-12-05T04:55:06Z

Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML

The API is returning a 200 response that is completely empty (ie ''). I'm fine with writing an empty page, but this will likely require some "re-wiring" of mwoffliner, which I believe assumes in many places that the HTML it received is not empty (such as when doing substitutions).

OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

This somewhat contradicts the above statement. It's "syntactically correct" in the sense that an empty string is a string. It is a response, a 200 success, though an empty one. But the article exists per our previous article discovery mechanism, it's live on the wiki, just the API response is empty, so it's definitely some kind of bug.

kelson42 · 2024-12-05T05:16:07Z

Either the API is syntaxicly correct - but empty - and we should write an "empty" HTML

The API is returning a 200 response that is completely empty (ie ''). I'm fine with writing an empty page, but this will likely require some "re-wiring" of mwoffliner, which I believe assumes in many places that the HTML it received is not empty (such as when doing substitutions).

OR it's not and we should die properly telling that json is not parseable or this mandatory part of the response is missing.

This somewhat contradicts the above statement. It's "syntactically correct" in the sense that an empty string is a string. It is a response, a 200 success, though an empty one. But the article exists per our previous article discovery mechanism, it's live on the wiki, just the API response is empty, so it's definitely some kind of bug.

OK, I think we can agree that empty string is not a valid json string (even if theoritically it is).

benoit74 · 2024-12-05T07:17:58Z

What is worrying from my PoV, is that folks in upstream issue https://phabricator.wikimedia.org/T379017 agree that there is an issue and it is transient. It means that whatever we do in mwoffliner, we will keep having problems of missing articles in the ZIM because the API returned bad response. I think we must insist for upstream issue to be found and fixed.

kelson42 · 2024-12-05T08:53:57Z

@benoit74 We won't have empty articles, we will stop the scraping with clear error. Once handled properly I will move the issue to next milestone as this should not be a blocker for release of 1.14.0 (rhere is nothing we could do more).

For the rest I will escalate the Phabricator bug tomorrow.

benoit74 · 2024-12-05T10:02:19Z

I suspect a very significant portion of Zimfarm recipes will become unstable until upstream ticket is fixed, but indeed there is not much I would advise to do and this should not block 1.14 if we have clear message.

kelson42 · 2024-12-11T05:03:57Z

Wikipedia is also impacted, yesterday WPCA scrape fails twice on the same article withthis error: https://farm.openzim.org/pipeline/cb6b6898-1a06-4e51-bc93-2091a6a01c31/debug

audiodude · 2024-12-11T16:50:44Z

Yes I've been reporting additional articles to the phabricator ticket (https://phabricator.wikimedia.org/T379017) as they fail my scrapes. The good news is that there's some activity on the ticket now.

audiodude · 2025-01-02T04:14:28Z

cc @kelson42

I don't think we can wait for this issue to be resolved upstream before releasing 1.14. My advice is to build in code to "skip" articles that don't return content from the mobile-html endpoint, I think there's already the concept of "missing" in the final mwoffliner output.

Then we can mark the code that does the skipping with a TODO and this issue number, and remove it later when this issue is fixed upstream.

kelson42 · 2025-01-02T07:45:32Z

Yes! This upstream bug should not stop us to move forward. I have postponed the issue.

We will do the necessary so this bug is fixed at the WMF asap.

We don't create passthrough exceptions in scrapers because they lead to broken ZIM and people notice, complain, etc...

benoit74 · 2025-01-03T08:34:05Z

I've made some tests on WPCA, I will summarize them in upstream issue, but basically the outcome is that less than 10 articles out of 765k are experiencing the problem of returning empty content on my tests. And they are not always the same, but trying again immediately does not help, some are still bad after days, some resolve after few hours.

This means the impact of a passthrough is really minimal (around 1 for 100k articles).

And I strongly suspect the upstream bug is some kind of cache issue, so it probably impact all / most wikimedia instances.

So I don't agree that we can postpone this issue, it is most probably going to be blocking many ZIMs from being produced.

And I agree that we need to create a passthrough, we already know how to limit the impact on ZIM quality: add a clear "we are sorry" page in place of the missing article, because we know there should be something here but scraper failed to retrieve it.

kelson42 · 2025-01-03T08:41:45Z

So I don't agree that we can postpone this issue, it is most probably going to be blocking many ZIMs from being produced.

I know, i prefer not to generate any broken -by-design ZIM and scraper. We should better focus to get problem fixed upstream!

audiodude · 2025-01-03T16:19:58Z

@kelson42 While I appreciate moving this issue from 1.14 to 1.15, it's only a token gesture. If we release 1.14 without what @benoit74 is referring to as a "passthrough", as he pointed out, we will not be able to reliably create ZIMs from WM projects, because as it is this is a fatal error.

kelson42 · 2025-01-03T16:31:14Z

@kelson42 While I appreciate moving this issue from 1.14 to 1.15, it's only a token gesture. If we release 1.14 without what @benoit74 is referring to as a "passthrough", as he pointed out, we will not be able to reliably create ZIMs from WM projects, because as it is this is a fatal error.

This is perfectly clear to me that any big ZIM will probably be impacted by this upstream bug... and made impossible do for the moment.

Once 1.14 released we should focus on fixing our bugs and for your part integrating latest node-libzim.

I have already taken measures to raise the awareness on Wikimedia side about this serious API bug.

benoit74 added the bug label Mar 4, 2024

benoit74 mentioned this issue Mar 4, 2024

wikipedia_arz_all is failing openzim/zim-requests#853

Open

benoit74 mentioned this issue Mar 12, 2024

wikipedia_id_all is failing continuously since 5 months openzim/zim-requests#879

Open

kelson42 added this to the 1.14.0 milestone Jun 29, 2024

kelson42 closed this as completed Jun 29, 2024

kelson42 reopened this Jun 29, 2024

kelson42 changed the title ~~wikipedia_arz_all is failing with Cannot render [] into an article~~ wikipedia_arz_all (and a few other WP) is failing with Cannot render [] into an article Jun 29, 2024

audiodude mentioned this issue Nov 5, 2024

Produce a scrape of a smallish Wiktionary with dev-1.14 for testing #2098

Open

kelson42 removed this from the 1.14.0 milestone Jan 2, 2025

kelson42 added this to the 1.15.0 milestone Jan 2, 2025

kelson42 changed the title ~~wikipedia_arz_all (and a few other WP) is failing with Cannot render [] into an article~~ New mobile-html Wikimedia returns random empty responses Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New mobile-html Wikimedia returns random empty responses #2003

New mobile-html Wikimedia returns random empty responses #2003

benoit74 commented Mar 4, 2024

benoit74 commented Mar 12, 2024

kelson42 commented Jun 29, 2024

kelson42 commented Jun 30, 2024

audiodude commented Jun 30, 2024

audiodude commented Jun 30, 2024

kelson42 commented Jul 1, 2024

audiodude commented Jul 1, 2024

kelson42 commented Jul 15, 2024

audiodude commented Jul 16, 2024

audiodude commented Nov 4, 2024

kelson42 commented Dec 4, 2024

audiodude commented Dec 4, 2024 •

edited

Loading

kelson42 commented Dec 5, 2024

audiodude commented Dec 5, 2024

kelson42 commented Dec 5, 2024

benoit74 commented Dec 5, 2024

kelson42 commented Dec 5, 2024 •

edited

Loading

benoit74 commented Dec 5, 2024

kelson42 commented Dec 11, 2024

audiodude commented Dec 11, 2024

audiodude commented Jan 2, 2025

kelson42 commented Jan 2, 2025

benoit74 commented Jan 3, 2025

kelson42 commented Jan 3, 2025 •

edited

Loading

audiodude commented Jan 3, 2025

kelson42 commented Jan 3, 2025

New mobile-html Wikimedia returns random empty responses #2003

New mobile-html Wikimedia returns random empty responses #2003

Comments

benoit74 commented Mar 4, 2024

benoit74 commented Mar 12, 2024

kelson42 commented Jun 29, 2024

kelson42 commented Jun 30, 2024

audiodude commented Jun 30, 2024

audiodude commented Jun 30, 2024

kelson42 commented Jul 1, 2024

audiodude commented Jul 1, 2024

kelson42 commented Jul 15, 2024

audiodude commented Jul 16, 2024

audiodude commented Nov 4, 2024

kelson42 commented Dec 4, 2024

audiodude commented Dec 4, 2024 • edited Loading

kelson42 commented Dec 5, 2024

audiodude commented Dec 5, 2024

kelson42 commented Dec 5, 2024

benoit74 commented Dec 5, 2024

kelson42 commented Dec 5, 2024 • edited Loading

benoit74 commented Dec 5, 2024

kelson42 commented Dec 11, 2024

audiodude commented Dec 11, 2024

audiodude commented Jan 2, 2025

kelson42 commented Jan 2, 2025

benoit74 commented Jan 3, 2025

kelson42 commented Jan 3, 2025 • edited Loading

audiodude commented Jan 3, 2025

kelson42 commented Jan 3, 2025

audiodude commented Dec 4, 2024 •

edited

Loading

kelson42 commented Dec 5, 2024 •

edited

Loading

kelson42 commented Jan 3, 2025 •

edited

Loading