Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logic to set .webp path prefix on reencoded images is skewed #2140

Closed
benoit74 opened this issue Jan 27, 2025 · 2 comments · Fixed by #2143
Closed

Logic to set .webp path prefix on reencoded images is skewed #2140

benoit74 opened this issue Jan 27, 2025 · 2 comments · Fixed by #2143

Comments

@benoit74
Copy link
Contributor

Currently, we have to places where the scraper handles webp reencoding:

  • in image processing, to reencode to webp when possible
  • in HTML processing, to place the proper src attribute on <img> tags

This has been meant to add a .webp extension for images which have been reencoded to webp.

Current logic is however wrong because:

  • in image processing, decision to reencode to webp is based on upstream image format (some images cannot be converted to webp) and on the fact that we do not want to reencode images used by CSS ; image format is the format detected from HTTP content-type response header, or file extension if content-type is missing

private async getCompressedBody(resp: any): Promise<any> {
if (isBitmapImageMimeType(resp.headers['content-type'])) {
if (isWebpCandidateImageMimeType(this.webp, resp.headers['content-type']) && !this.cssDependenceUrls.hasOwnProperty(resp.config.url)) {
resp.data = await (imagemin as any)
.buffer(resp.data, imageminOptions.get('webp').get(resp.headers['content-type']))
.catch(async (err) => {
if (/Unsupported color conversion request/.test(err.stderr)) {
return (imagemin as any)
.buffer(await sharp(resp.data).toColorspace('srgb').toBuffer(), imageminOptions.get('webp').get(resp.headers['content-type']))
.catch(() => {
return resp.data
})
.then((data) => {
resp.headers['content-type'] = 'image/webp'
return data
})
} else {
return (imagemin as any).buffer(resp.data, imageminOptions.get('default').get(resp.headers['content-type'])).catch(() => {
return resp.data
})
}
})
.then((data) => {
resp.headers['content-type'] = 'image/webp'
return data
})
resp.headers.path_postfix = '.webp'
} else {
resp.data = await (imagemin as any).buffer(resp.data, imageminOptions.get('default').get(resp.headers['content-type'])).catch(() => {
return resp.data
})
}
return true
}
return false
}

  • in HTML processing, decision to add the .webp suffix is based only on upstream image format (we do not take into account images used by CSS, probably because we do not have the information at this stage) and image format is only detected from file extension (we do not yet have the HTTP content-type response header

videoEl.setAttribute('poster', isWebpCandidateImageMimeType(webp, getMimeType(newVideoPosterUrl)) ? newVideoPosterUrl + '.webp' : newVideoPosterUrl)

img.setAttribute('src', isWebpCandidateImageMimeType(webp, getMimeType(src)) ? newSrc + '.webp' : newSrc)

articleEntry.internalThumbnailUrl + (isWebpCandidateImageMimeType(webp, getMimeType(articleEntry.internalThumbnailUrl)) ? '.webp' : '')

There are only few (rare obviously) edge cases where the decisions between the two logic will be different, but it is not bullet-proof at all.

I don't know exactly why we took the decision to add a .webp extension to files reencoded to webp, but I would like to question the decision:

  • the key used to upload to S3 cache does not have the .webp extension added ; it is kinda confusing to not have a match between S3 cache key and ZIM path
  • it is probably impossible to make the two logic mentioned above bullet-proof because we do not have sufficient context in HTML processing to take the good decision
  • nothing in HTTP standard or ZIM specifications / conventions mandates that image URLs end with something that looks like a file with proper suffix ; many websites even use images at URLs which do not have a suffix at all ; we have the content-type response header to pass this information to browsers
  • the URL of the images are mostly never visible to the end-user, so not having a proper suffix is not a big deal

Note that warc2zim and mindtouch scraper also do webp reencoding of images and they both store images at a ZIM path / S3 key corresponding to original HTTP URL, only modified to match ZIM path / S3 key requirements, no matter which real file format is used.

@benoit74 benoit74 added this to the 1.14.1 milestone Jan 27, 2025
@benoit74 benoit74 self-assigned this Jan 27, 2025
@kelson42
Copy link
Collaborator

@benoit74 Thx for the issue. I should answer ASAP to it. To me related to #2088.

@benoit74
Copy link
Contributor Author

Indeed, this is the explanation of #2088, almost forget about this one (or #2088 is a practical illustration of this generic issue ^^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants