RequestTimeTooSkewed error while downloading from cache #2137

benoit74 · 2025-01-27T07:50:36Z

This is kinda a reoccurence of #2118 but for download this time.

We have many times this kind of log in https://farm.openzim.org/pipeline/fdf407f7-3b58-4160-9e58-8c20734cdbfd/debug

[log] [2025-01-25T08:19:57.068Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Signal%C3%A9tiqueS%C3%A9curit%C3%A9IncendieLuminescence.JPG/170px-Signal%C3%A9tiqueS%C3%A9curit%C3%A9IncendieLuminescence.JPG due to RequestTimeTooSkewed: The difference between the request time and the current time is too large.

The text was updated successfully, but these errors were encountered:

benoit74 · 2025-01-28T13:41:02Z

When it comes to downloading files, scraper iterates over the files queued in Redis with a given worker concurrency ; when CLI speed is 1, the downloader.speed is up to 8, meaning we can start 8 workers in parallel at redis processing side.

mwoffliner/src/util/saveArticles.ts

Lines 27 to 36 in fc2af69

    
           await fileStore.iterateItems(downloader.speed, async (fileDownloadPairs, workerId) => { 
        
             logger.info(`Worker [${workerId}] processing batch of [${Object.keys(fileDownloadPairs).length}] files`) 
        
             // todo align fileDownloadPairs and listOfArguments 
        
             const listOfArguments = [] 
        
             for (const [path, { url, namespace, mult, width }] of Object.entries(fileDownloadPairs)) { 
        
               listOfArguments.push({ path, url, namespace, mult, width }) 
        
             } 
        
             const responses = await downloadBulk(listOfArguments, downloader)

mwoffliner/src/mwoffliner.lib.ts

Lines 129 to 130 in fc2af69

    
           const cpuCount = Math.min(os.cpus().length, MAX_CPU_CORES) 
        
           const speed = Math.max(1, Math.round(cpuCount * (_speed || 1)))

Each of these "redis" workers will get a share of the files to download, and each will in turn start (pmap) a number of workers, up to 10 when needed.

mwoffliner/src/util/saveArticles.ts

Lines 89 to 126 in fc2af69

    
           async function downloadBulk(listOfArguments: any[], downloader: Downloader): Promise<any> { 
        
             try { 
        
               // Enhance arguments array to have an index of the argument at hand 
        
               const argsCopy = [].concat(listOfArguments.map((val, ind) => ({ val, ind }))) 
        
               const argList = [] 
        
               while (argsCopy.length > 0) { 
        
                 const arg = argsCopy.shift() 
        
                 argList.push(arg) 
        
               } 
        
               return pmap( 
        
                 argList, 
        
                 async (arg) => { 
        
                   const resp: any = {} 
        
                   resp.path = arg.val.path 
        
                   resp.url = arg.val.url 
        
                   resp.namespace = arg.val.namespace 
        
                   resp.mult = arg.val.mult 
        
                   resp.width = arg.val.width 
        
                   return downloader 
        
                     .downloadContent(arg.val.url, false) 
        
                     .then((r) => { 
        
                       resp.result = r 
        
                       resp.path += resp.result.responseHeaders.path_postfix || '' 
        
                       return resp 
        
                     }) 
        
                     .catch((err) => { 
        
                       resp.status = err.response?.status 
        
                       return resp 
        
                     }) 
        
                 }, 
        
                 { concurrency: CONCURRENCY_LIMIT }, 
        
               ) 
        
             } catch (err) { 
        
               logger.log(`Not able download in bulk due to ${err}`) 
        
             } 
        
           }

mwoffliner/src/util/const.ts

Line 8 in fc2af69

export const CONCURRENCY_LIMIT = 10

All this means that in general, we have up to 80 workers downloading files at the same time

Since NodeJS sockets are limited, not all these 80 workers can proceed at the same time. But they all build the download request and then wait for a free socket to use. Under some conditions, the time between the moment the request is built and the download really happens is too big (network congestion, slow hosts, ...).

While reimplementing again #2093 (which was reverted then in #2120) could be a solution, do we really confirm that 80 simultaneous downloads is OK? To me it looks like way too much, especially since we allow multiple tasks on the Zimfarm in parallel on a single worker (typically 4), and have multiple workers (at least the 4 mwoffliner machines). All this combined means that mwoffliner tasks alone can easily perform 1k downloads in parallel, with many targeting upload.wikimedia.org and Wasabi S3 bucket by design. But maybe I miss something, I do not have sufficient background on this scraper.

kelson42 · 2025-01-28T13:46:42Z

While reimplementing again #2093 (which was reverted then in #2120) could be a solution, do we really confirm that 80 simultaneous downloads is OK?

I fully agree, I would make some basic speed tests with half or even a quarter of this value.

benoit74 · 2025-01-30T08:23:31Z

You probably cannot imagine which journey I've been embarked on with this issue. I will only push conclusions in this comment, but the journey was more tortuous than it might look.

One root of the issue is as mentioned in #2093 that scraper does not have sufficient sockets for S3 calls (default limit in AWS SDK is at 50 sockets). This is only happening when we have the image cached in S3 but the MediaWiki does not answer with a 304 (i.e. content has changed). The real cause is that S3 GetObject request returns a Readable (kinda stream) body, so that we can consume it only when needed. But when Mediawiki returns a 304, we do not consume the body ... and the socket hence stays open. Since the scraper continues (we are not blocked right after 50 requests have been made), I suspect something is detecting this is abnormal and freeing some sockets, but I don't understand what, and don't need to.

I've hence modified the code to destroy the Readable as soon as we know we do not need it anymore ... and guess what, processing images is at least one order of magnitude faster. I can't guarantee it will cause RequestTimeTooSkewed errors to not happen anymore since I never achieved to reproduce it on my machine, but I'm pretty sure it will.

Another thing I've noticed and I will change is to remove the limit on number of sockets usable by AWS SDK. 50 sockets is a limiting factor, and when removing the limit the number of sockets used is free to flow (up to 130 in my tests), and this in turn avoids slowdowns due to asynchronous nature of the scraper. Since we are not on a web server where we want to keep sockets available for the clients, it makes quite more sense from my PoV to let this flow as needed.

I also noticed that there is no timeout at all on S3 calls, or at least it is only the default value from the SDK (which I did not found in the documentation) instead of the same timeout we use for other HTTP calls (or should use, because not all axios calls are homogeneous, this will have to be fixed as well).

I also investigated the question of whether 80 // calls make sense. When downloading from the S3 cache and having an up-to-date image there, having 80 // calls is 10 times faster than 8 // calls, so there is a clear benefit. When S3 cached image is outdated or missing in the cache, 80 // calls is still 3 to 4 times faster than 8 // calls (we are a bit limited by the CPU because we have to reencode images which is CPU intensive). Since we have many selections of most wikis, in many cases the image will be cached and up-to-date in S3, and we did not really got any complaints or problems with this, so I suggest to not change current levels of parallelism which seems to make sense.

All these investigations have only been possible thanks to Clinic doctor tool which I strongly recommend to investigate mwoffliner behaviors (or any other Node.JS program). Below a screenshot of a run of 100 articles and 3508 images already cached in S3 but outdated.

I will open a PR with required changes (and a bit more ^^)

benoit74 added the bug label Jan 27, 2025

benoit74 added this to the 1.14.1 milestone Jan 27, 2025

benoit74 self-assigned this Jan 27, 2025

This was referenced Jan 30, 2025

Document clinic doctor usage #2146

Open

Enhance handling of HTTP settings and TCP sockets #2149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RequestTimeTooSkewed error while downloading from cache #2137

RequestTimeTooSkewed error while downloading from cache #2137

benoit74 commented Jan 27, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 30, 2025

RequestTimeTooSkewed error while downloading from cache #2137

RequestTimeTooSkewed error while downloading from cache #2137

Comments

benoit74 commented Jan 27, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 30, 2025