Logic to set `.webp` path prefix on reencoded images is skewed #2140

benoit74 · 2025-01-27T08:22:02Z

Currently, we have to places where the scraper handles webp reencoding:

in image processing, to reencode to webp when possible
in HTML processing, to place the proper src attribute on <img> tags

This has been meant to add a .webp extension for images which have been reencoded to webp.

Current logic is however wrong because:

in image processing, decision to reencode to webp is based on upstream image format (some images cannot be converted to webp) and on the fact that we do not want to reencode images used by CSS ; image format is the format detected from HTTP content-type response header, or file extension if content-type is missing

Lines 480 to 515 in fc2af69

    
           private async getCompressedBody(resp: any): Promise<any> { 
        
             if (isBitmapImageMimeType(resp.headers['content-type'])) { 
        
               if (isWebpCandidateImageMimeType(this.webp, resp.headers['content-type']) && !this.cssDependenceUrls.hasOwnProperty(resp.config.url)) { 
        
                 resp.data = await (imagemin as any) 
        
                   .buffer(resp.data, imageminOptions.get('webp').get(resp.headers['content-type'])) 
        
                   .catch(async (err) => { 
        
                     if (/Unsupported color conversion request/.test(err.stderr)) { 
        
                       return (imagemin as any) 
        
                         .buffer(await sharp(resp.data).toColorspace('srgb').toBuffer(), imageminOptions.get('webp').get(resp.headers['content-type'])) 
        
                         .catch(() => { 
        
                           return resp.data 
        
                         }) 
        
                         .then((data) => { 
        
                           resp.headers['content-type'] = 'image/webp' 
        
                           return data 
        
                         }) 
        
                     } else { 
        
                       return (imagemin as any).buffer(resp.data, imageminOptions.get('default').get(resp.headers['content-type'])).catch(() => { 
        
                         return resp.data 
        
                       }) 
        
                     } 
        
                   }) 
        
                   .then((data) => { 
        
                     resp.headers['content-type'] = 'image/webp' 
        
                     return data 
        
                   }) 
        
                 resp.headers.path_postfix = '.webp' 
        
               } else { 
        
                 resp.data = await (imagemin as any).buffer(resp.data, imageminOptions.get('default').get(resp.headers['content-type'])).catch(() => { 
        
                   return resp.data 
        
                 }) 
        
               } 
        
               return true 
        
             } 
        
             return false 
        
           }

in HTML processing, decision to add the .webp suffix is based only on upstream image format (we do not take into account images used by CSS, probably because we do not have the information at this stage) and image format is only detected from file extension (we do not yet have the HTTP content-type response header

mwoffliner/src/renderers/abstract.renderer.ts

Line 197 in fc2af69

    
           videoEl.setAttribute('poster', isWebpCandidateImageMimeType(webp, getMimeType(newVideoPosterUrl)) ? newVideoPosterUrl + '.webp' : newVideoPosterUrl)

mwoffliner/src/renderers/abstract.renderer.ts

Line 310 in fc2af69

    
           img.setAttribute('src', isWebpCandidateImageMimeType(webp, getMimeType(src)) ? newSrc + '.webp' : newSrc)

mwoffliner/src/util/articleListMainPage.ts

Line 11 in fc2af69

    
                         articleEntry.internalThumbnailUrl + (isWebpCandidateImageMimeType(webp, getMimeType(articleEntry.internalThumbnailUrl)) ? '.webp' : '')

There are only few (rare obviously) edge cases where the decisions between the two logic will be different, but it is not bullet-proof at all.

I don't know exactly why we took the decision to add a .webp extension to files reencoded to webp, but I would like to question the decision:

the key used to upload to S3 cache does not have the .webp extension added ; it is kinda confusing to not have a match between S3 cache key and ZIM path
it is probably impossible to make the two logic mentioned above bullet-proof because we do not have sufficient context in HTML processing to take the good decision
nothing in HTTP standard or ZIM specifications / conventions mandates that image URLs end with something that looks like a file with proper suffix ; many websites even use images at URLs which do not have a suffix at all ; we have the content-type response header to pass this information to browsers
the URL of the images are mostly never visible to the end-user, so not having a proper suffix is not a big deal

Note that warc2zim and mindtouch scraper also do webp reencoding of images and they both store images at a ZIM path / S3 key corresponding to original HTTP URL, only modified to match ZIM path / S3 key requirements, no matter which real file format is used.

The text was updated successfully, but these errors were encountered:

kelson42 · 2025-01-27T10:12:07Z

@benoit74 Thx for the issue. I should answer ASAP to it. To me related to #2088.

benoit74 · 2025-01-27T10:32:34Z

Indeed, this is the explanation of #2088, almost forget about this one (or #2088 is a practical illustration of this generic issue ^^)

benoit74 added bug pending clarification labels Jan 27, 2025

benoit74 added this to the 1.14.1 milestone Jan 27, 2025

benoit74 self-assigned this Jan 27, 2025

benoit74 mentioned this issue Jan 28, 2025

Revisit handling of images processing and other fixes #2143

Merged

benoit74 closed this as completed in #2143 Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logic to set `.webp` path prefix on reencoded images is skewed #2140

Logic to set `.webp` path prefix on reencoded images is skewed #2140

benoit74 commented Jan 27, 2025

kelson42 commented Jan 27, 2025

benoit74 commented Jan 27, 2025

Logic to set .webp path prefix on reencoded images is skewed #2140

Logic to set .webp path prefix on reencoded images is skewed #2140

Comments

benoit74 commented Jan 27, 2025

kelson42 commented Jan 27, 2025

benoit74 commented Jan 27, 2025

Logic to set `.webp` path prefix on reencoded images is skewed #2140

Logic to set `.webp` path prefix on reencoded images is skewed #2140