-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New mobile-html Wikimedia returns random empty responses #2003
Comments
wikipedia_id_all is impacted as well: openzim/zim-requests#879 |
@audiodude This newly stops many Wikipedia to render properly. I believe this is not a regression with 1.14 but this impairs us seriously to move forward with testing of 1.14. Last run of WPAR is impacted: https://farm.openzim.org/pipeline/c8708ce9-f831-4c06-a9d6-748e6e860cec/debug |
Cannot render [] into an article
Cannot render [] into an article
WPCA impacted as well https://farm.openzim.org/pipeline/1c29259f-d858-40f4-8cfb-530696e2b20f/debug |
Although the error message is the same, I'm not sure this is the same bug. For WPARZ, I cannot reproduce with an articleList of only For WPCA, it is 100% reproducible with an articleList of WPID doesn't reproduce the bug when using an article list of |
Finally, realizing that WPAR is different from WPARZ, I tried the former and could not reproduce with articleList of |
Hmmm, not sure what should be done next. In your log the line:
Seem suspicious. |
That's the new message added in #2050. Before, it would simply fail to find the |
I get it, somehow this message is missing the keyword "language"... |
Overall though, this issue is currently non-reproducible and seems due to some kind of upstream bug. Perhaps we should update the code to be more resilient to that. It's not clear what kind of phabricator ticket we could file other than "JSON endpoint sometimes returns empty response for non-empty articles" but without a demonstrable reproduction case. |
Just re-tested ARZ, and it's failing on a different article now. This is definitely an upstream bug and I've filed https://phabricator.wikimedia.org/T379017 |
@audiodude So it seems that this bugbis a kind of new show stopper. Do we know if we have wikis which are notmimpacted at all or are they all (potentially):impacted? |
@kelson42 So far, we've seen it on arz, es wikitionary, and any other place that is reporting this bug. It doesn't seem limited to any specific Wikimedia project no, so potentially any project is impacted as far as we currently know. EDIT: We of course have the option of skipping pages that exhibit this bug, but I know you don't like to skip pages. |
Yes, I see two scenarios:
|
The API is returning a 200 response that is completely empty (ie
This somewhat contradicts the above statement. It's "syntactically correct" in the sense that an empty string is a string. It is a response, a 200 success, though an empty one. But the article exists per our previous article discovery mechanism, it's live on the wiki, just the API response is empty, so it's definitely some kind of bug. |
OK, I think we can agree that empty string is not a valid json string (even if theoritically it is). |
What is worrying from my PoV, is that folks in upstream issue https://phabricator.wikimedia.org/T379017 agree that there is an issue and it is transient. It means that whatever we do in mwoffliner, we will keep having problems of missing articles in the ZIM because the API returned bad response. I think we must insist for upstream issue to be found and fixed. |
@benoit74 We won't have empty articles, we will stop the scraping with clear error. Once handled properly I will move the issue to next milestone as this should not be a blocker for release of 1.14.0 (rhere is nothing we could do more). For the rest I will escalate the Phabricator bug tomorrow. |
I suspect a very significant portion of Zimfarm recipes will become unstable until upstream ticket is fixed, but indeed there is not much I would advise to do and this should not block 1.14 if we have clear message. |
Wikipedia is also impacted, yesterday WPCA scrape fails twice on the same article withthis error: https://farm.openzim.org/pipeline/cb6b6898-1a06-4e51-bc93-2091a6a01c31/debug |
Yes I've been reporting additional articles to the phabricator ticket (https://phabricator.wikimedia.org/T379017) as they fail my scrapes. The good news is that there's some activity on the ticket now. |
cc @kelson42 I don't think we can wait for this issue to be resolved upstream before releasing 1.14. My advice is to build in code to "skip" articles that don't return content from the Then we can mark the code that does the skipping with a TODO and this issue number, and remove it later when this issue is fixed upstream. |
Yes! This upstream bug should not stop us to move forward. I have postponed the issue. We will do the necessary so this bug is fixed at the WMF asap. We don't create passthrough exceptions in scrapers because they lead to broken ZIM and people notice, complain, etc... |
I've made some tests on WPCA, I will summarize them in upstream issue, but basically the outcome is that less than 10 articles out of 765k are experiencing the problem of returning empty content on my tests. And they are not always the same, but trying again immediately does not help, some are still bad after days, some resolve after few hours. This means the impact of a passthrough is really minimal (around 1 for 100k articles). And I strongly suspect the upstream bug is some kind of cache issue, so it probably impact all / most wikimedia instances. So I don't agree that we can postpone this issue, it is most probably going to be blocking many ZIMs from being produced. And I agree that we need to create a passthrough, we already know how to limit the impact on ZIM quality: add a clear "we are sorry" page in place of the missing article, because we know there should be something here but scraper failed to retrieve it. |
I know, i prefer not to generate any broken -by-design ZIM and scraper. We should better focus to get problem fixed upstream! |
Cannot render [] into an article
This is perfectly clear to me that any big ZIM will probably be impacted by this upstream bug... and made impossible do for the moment. Once 1.14 released we should focus on fixing our bugs and for your part integrating latest node-libzim. I have already taken measures to raise the awareness on Wikimedia side about this serious API bug. |
Recipe: https://farm.openzim.org/recipes/wikipedia_arz_all
Error is:
Looks like the article is not empty online: https://arz.wikipedia.org/wiki/%D9%83%D8%B1%D9%8A%D8%B3_%D8%B3%D8%AA%D8%A7%D9%86%D8%AF%D8%B1%D9%8A%D9%86%D8%AC
The text was updated successfully, but these errors were encountered: