Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestTimeTooSkewed error while downloading from cache #2137

Open
benoit74 opened this issue Jan 27, 2025 · 3 comments
Open

RequestTimeTooSkewed error while downloading from cache #2137

benoit74 opened this issue Jan 27, 2025 · 3 comments
Assignees
Labels
Milestone

Comments

@benoit74
Copy link
Contributor

This is kinda a reoccurence of #2118 but for download this time.

We have many times this kind of log in https://farm.openzim.org/pipeline/fdf407f7-3b58-4160-9e58-8c20734cdbfd/debug

[log] [2025-01-25T08:19:57.068Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Signal%C3%A9tiqueS%C3%A9curit%C3%A9IncendieLuminescence.JPG/170px-Signal%C3%A9tiqueS%C3%A9curit%C3%A9IncendieLuminescence.JPG due to RequestTimeTooSkewed: The difference between the request time and the current time is too large.
@benoit74 benoit74 added the bug label Jan 27, 2025
@benoit74 benoit74 added this to the 1.14.1 milestone Jan 27, 2025
@benoit74 benoit74 self-assigned this Jan 27, 2025
@benoit74
Copy link
Contributor Author

When it comes to downloading files, scraper iterates over the files queued in Redis with a given worker concurrency ; when CLI speed is 1, the downloader.speed is up to 8, meaning we can start 8 workers in parallel at redis processing side.

await fileStore.iterateItems(downloader.speed, async (fileDownloadPairs, workerId) => {
logger.info(`Worker [${workerId}] processing batch of [${Object.keys(fileDownloadPairs).length}] files`)
// todo align fileDownloadPairs and listOfArguments
const listOfArguments = []
for (const [path, { url, namespace, mult, width }] of Object.entries(fileDownloadPairs)) {
listOfArguments.push({ path, url, namespace, mult, width })
}
const responses = await downloadBulk(listOfArguments, downloader)

const cpuCount = Math.min(os.cpus().length, MAX_CPU_CORES)
const speed = Math.max(1, Math.round(cpuCount * (_speed || 1)))

Each of these "redis" workers will get a share of the files to download, and each will in turn start (pmap) a number of workers, up to 10 when needed.

async function downloadBulk(listOfArguments: any[], downloader: Downloader): Promise<any> {
try {
// Enhance arguments array to have an index of the argument at hand
const argsCopy = [].concat(listOfArguments.map((val, ind) => ({ val, ind })))
const argList = []
while (argsCopy.length > 0) {
const arg = argsCopy.shift()
argList.push(arg)
}
return pmap(
argList,
async (arg) => {
const resp: any = {}
resp.path = arg.val.path
resp.url = arg.val.url
resp.namespace = arg.val.namespace
resp.mult = arg.val.mult
resp.width = arg.val.width
return downloader
.downloadContent(arg.val.url, false)
.then((r) => {
resp.result = r
resp.path += resp.result.responseHeaders.path_postfix || ''
return resp
})
.catch((err) => {
resp.status = err.response?.status
return resp
})
},
{ concurrency: CONCURRENCY_LIMIT },
)
} catch (err) {
logger.log(`Not able download in bulk due to ${err}`)
}
}

export const CONCURRENCY_LIMIT = 10

All this means that in general, we have up to 80 workers downloading files at the same time

Since NodeJS sockets are limited, not all these 80 workers can proceed at the same time. But they all build the download request and then wait for a free socket to use. Under some conditions, the time between the moment the request is built and the download really happens is too big (network congestion, slow hosts, ...).

While reimplementing again #2093 (which was reverted then in #2120) could be a solution, do we really confirm that 80 simultaneous downloads is OK? To me it looks like way too much, especially since we allow multiple tasks on the Zimfarm in parallel on a single worker (typically 4), and have multiple workers (at least the 4 mwoffliner machines). All this combined means that mwoffliner tasks alone can easily perform 1k downloads in parallel, with many targeting upload.wikimedia.org and Wasabi S3 bucket by design. But maybe I miss something, I do not have sufficient background on this scraper.

@kelson42
Copy link
Collaborator

While reimplementing again #2093 (which was reverted then in #2120) could be a solution, do we really confirm that 80 simultaneous downloads is OK?

I fully agree, I would make some basic speed tests with half or even a quarter of this value.

@benoit74
Copy link
Contributor Author

You probably cannot imagine which journey I've been embarked on with this issue. I will only push conclusions in this comment, but the journey was more tortuous than it might look.

One root of the issue is as mentioned in #2093 that scraper does not have sufficient sockets for S3 calls (default limit in AWS SDK is at 50 sockets). This is only happening when we have the image cached in S3 but the MediaWiki does not answer with a 304 (i.e. content has changed). The real cause is that S3 GetObject request returns a Readable (kinda stream) body, so that we can consume it only when needed. But when Mediawiki returns a 304, we do not consume the body ... and the socket hence stays open. Since the scraper continues (we are not blocked right after 50 requests have been made), I suspect something is detecting this is abnormal and freeing some sockets, but I don't understand what, and don't need to.

I've hence modified the code to destroy the Readable as soon as we know we do not need it anymore ... and guess what, processing images is at least one order of magnitude faster. I can't guarantee it will cause RequestTimeTooSkewed errors to not happen anymore since I never achieved to reproduce it on my machine, but I'm pretty sure it will.

Another thing I've noticed and I will change is to remove the limit on number of sockets usable by AWS SDK. 50 sockets is a limiting factor, and when removing the limit the number of sockets used is free to flow (up to 130 in my tests), and this in turn avoids slowdowns due to asynchronous nature of the scraper. Since we are not on a web server where we want to keep sockets available for the clients, it makes quite more sense from my PoV to let this flow as needed.

I also noticed that there is no timeout at all on S3 calls, or at least it is only the default value from the SDK (which I did not found in the documentation) instead of the same timeout we use for other HTTP calls (or should use, because not all axios calls are homogeneous, this will have to be fixed as well).

I also investigated the question of whether 80 // calls make sense. When downloading from the S3 cache and having an up-to-date image there, having 80 // calls is 10 times faster than 8 // calls, so there is a clear benefit. When S3 cached image is outdated or missing in the cache, 80 // calls is still 3 to 4 times faster than 8 // calls (we are a bit limited by the CPU because we have to reencode images which is CPU intensive). Since we have many selections of most wikis, in many cases the image will be cached and up-to-date in S3, and we did not really got any complaints or problems with this, so I suggest to not change current levels of parallelism which seems to make sense.

All these investigations have only been possible thanks to Clinic doctor tool which I strongly recommend to investigate mwoffliner behaviors (or any other Node.JS program). Below a screenshot of a run of 100 articles and 3508 images already cached in S3 but outdated.

Image

I will open a PR with required changes (and a bit more ^^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants