-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RequestTimeTooSkewed error while downloading from cache #2137
Comments
When it comes to downloading files, scraper iterates over the files queued in Redis with a given worker concurrency ; when CLI speed is 1, the mwoffliner/src/util/saveArticles.ts Lines 27 to 36 in fc2af69
mwoffliner/src/mwoffliner.lib.ts Lines 129 to 130 in fc2af69
Each of these "redis" workers will get a share of the files to download, and each will in turn start ( mwoffliner/src/util/saveArticles.ts Lines 89 to 126 in fc2af69
Line 8 in fc2af69
All this means that in general, we have up to 80 workers downloading files at the same time Since NodeJS sockets are limited, not all these 80 workers can proceed at the same time. But they all build the download request and then wait for a free socket to use. Under some conditions, the time between the moment the request is built and the download really happens is too big (network congestion, slow hosts, ...). While reimplementing again #2093 (which was reverted then in #2120) could be a solution, do we really confirm that 80 simultaneous downloads is OK? To me it looks like way too much, especially since we allow multiple tasks on the Zimfarm in parallel on a single worker (typically 4), and have multiple workers (at least the 4 mwoffliner machines). All this combined means that mwoffliner tasks alone can easily perform 1k downloads in parallel, with many targeting upload.wikimedia.org and Wasabi S3 bucket by design. But maybe I miss something, I do not have sufficient background on this scraper. |
You probably cannot imagine which journey I've been embarked on with this issue. I will only push conclusions in this comment, but the journey was more tortuous than it might look. One root of the issue is as mentioned in #2093 that scraper does not have sufficient sockets for S3 calls (default limit in AWS SDK is at 50 sockets). This is only happening when we have the image cached in S3 but the MediaWiki does not answer with a 304 (i.e. content has changed). The real cause is that S3 GetObject request returns a I've hence modified the code to destroy the Another thing I've noticed and I will change is to remove the limit on number of sockets usable by AWS SDK. 50 sockets is a limiting factor, and when removing the limit the number of sockets used is free to flow (up to 130 in my tests), and this in turn avoids slowdowns due to asynchronous nature of the scraper. Since we are not on a web server where we want to keep sockets available for the clients, it makes quite more sense from my PoV to let this flow as needed. I also noticed that there is no timeout at all on S3 calls, or at least it is only the default value from the SDK (which I did not found in the documentation) instead of the same timeout we use for other HTTP calls (or should use, because not all axios calls are homogeneous, this will have to be fixed as well). I also investigated the question of whether 80 // calls make sense. When downloading from the S3 cache and having an up-to-date image there, having 80 // calls is 10 times faster than 8 // calls, so there is a clear benefit. When S3 cached image is outdated or missing in the cache, 80 // calls is still 3 to 4 times faster than 8 // calls (we are a bit limited by the CPU because we have to reencode images which is CPU intensive). Since we have many selections of most wikis, in many cases the image will be cached and up-to-date in S3, and we did not really got any complaints or problems with this, so I suggest to not change current levels of parallelism which seems to make sense. All these investigations have only been possible thanks to Clinic doctor tool which I strongly recommend to investigate mwoffliner behaviors (or any other Node.JS program). Below a screenshot of a run of 100 articles and 3508 images already cached in S3 but outdated. I will open a PR with required changes (and a bit more ^^) |
This is kinda a reoccurence of #2118 but for download this time.
We have many times this kind of log in https://farm.openzim.org/pipeline/fdf407f7-3b58-4160-9e58-8c20734cdbfd/debug
The text was updated successfully, but these errors were encountered: