You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have to places where the scraper handles webp reencoding:
in image processing, to reencode to webp when possible
in HTML processing, to place the proper src attribute on <img> tags
This has been meant to add a .webp extension for images which have been reencoded to webp.
Current logic is however wrong because:
in image processing, decision to reencode to webp is based on upstream image format (some images cannot be converted to webp) and on the fact that we do not want to reencode images used by CSS ; image format is the format detected from HTTP content-type response header, or file extension if content-type is missing
in HTML processing, decision to add the .webp suffix is based only on upstream image format (we do not take into account images used by CSS, probably because we do not have the information at this stage) and image format is only detected from file extension (we do not yet have the HTTP content-type response header
There are only few (rare obviously) edge cases where the decisions between the two logic will be different, but it is not bullet-proof at all.
I don't know exactly why we took the decision to add a .webp extension to files reencoded to webp, but I would like to question the decision:
the key used to upload to S3 cache does not have the .webp extension added ; it is kinda confusing to not have a match between S3 cache key and ZIM path
it is probably impossible to make the two logic mentioned above bullet-proof because we do not have sufficient context in HTML processing to take the good decision
nothing in HTTP standard or ZIM specifications / conventions mandates that image URLs end with something that looks like a file with proper suffix ; many websites even use images at URLs which do not have a suffix at all ; we have the content-type response header to pass this information to browsers
the URL of the images are mostly never visible to the end-user, so not having a proper suffix is not a big deal
Note that warc2zim and mindtouch scraper also do webp reencoding of images and they both store images at a ZIM path / S3 key corresponding to original HTTP URL, only modified to match ZIM path / S3 key requirements, no matter which real file format is used.
The text was updated successfully, but these errors were encountered:
Currently, we have to places where the scraper handles webp reencoding:
src
attribute on<img>
tagsThis has been meant to add a
.webp
extension for images which have been reencoded to webp.Current logic is however wrong because:
mwoffliner/src/Downloader.ts
Lines 480 to 515 in fc2af69
.webp
suffix is based only on upstream image format (we do not take into account images used by CSS, probably because we do not have the information at this stage) and image format is only detected from file extension (we do not yet have the HTTP content-type response headermwoffliner/src/renderers/abstract.renderer.ts
Line 197 in fc2af69
mwoffliner/src/renderers/abstract.renderer.ts
Line 310 in fc2af69
mwoffliner/src/util/articleListMainPage.ts
Line 11 in fc2af69
There are only few (rare obviously) edge cases where the decisions between the two logic will be different, but it is not bullet-proof at all.
I don't know exactly why we took the decision to add a
.webp
extension to files reencoded to webp, but I would like to question the decision:.webp
extension added ; it is kinda confusing to not have a match between S3 cache key and ZIM pathNote that warc2zim and mindtouch scraper also do webp reencoding of images and they both store images at a ZIM path / S3 key corresponding to original HTTP URL, only modified to match ZIM path / S3 key requirements, no matter which real file format is used.
The text was updated successfully, but these errors were encountered: