Revisit handling of images processing and other fixes #2143

benoit74 · 2025-01-28T10:26:25Z

This is kinda a significant PR to fix many issues around images processing.

Fix #2140
Fix #2136
Fix #2088
Fix #2138

Changes

when processing HTML, make a distinction between images, videos and other medias (PDF, ...)
- store this information in redis for later retrieval
- do not guess if a given media is an image based on its URL anymore
- do not guess content-type from image URL or from response header, compute it with file-types based on real image data
- remove corresponding functions isImageUrl, getMimeType and constants: IMAGE_URL_REGEX
- remove now useless mime-type package
- for now, videos do not have any special treatment (e.g. reencoding) but everything is ready for that
- other medias are always simply downloaded
do not add a .webp suffix to the path of images which have been converted to webp
- as mentioned in Logic to set .webp path prefix on reencoded images is skewed #2140, and as observed in Do not rely on URL filename extension to detect images #2088, we cannot have this information at HTML processing time
- not having proper extension in ZIM path has no consequence
- this allows to also convert images referenced in CSS stylesheets to webp without having to worry about this
stop pushing content-type to S3 metadata
- we do not need this information anymore
- there are too many risks this information is wrong due to a bug
- we can let things already in S3 with this metadata live as they are, there is mostly 0 consequences
define a clear API of information returned by downloader.downloadContent when downloading content, instead of the whole response upstream (which could contain "anything")

kelson42 · 2025-01-28T10:40:48Z

@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension).

benoit74 · 2025-01-28T10:51:11Z

Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO.

I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage.

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

benoit74 · 2025-01-28T10:59:08Z

Sample ZIMs:

wikipedia_hi_basketball_maxi_2025-01.zim : second run on dev S3 bucket (i.e. all images are coming from the S3 cache)
psychonautwiki_en_all_maxi_2025-01.zim : first run on dev S3 bucket (i.e. all images are coming from online)

kelson42 · 2025-01-28T12:23:04Z

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

benoit74 · 2025-01-28T12:29:31Z

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

S3 key is computed from online URL directly without any logic handling webp conversion:

mwoffliner/src/Downloader.ts

Line 591 in fc2af69

    
           await this.s3.uploadBlob(stripHttpFromUrl(url), mwResp.data, etag, mwResp.headers['content-type'], this.webp ? 'webp' : '1')

codecov · 2025-01-30T11:04:16Z

Codecov Report

Attention: Patch coverage is 80.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 75.63%. Comparing base (7c3857a) to head (0402980).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Downloader.ts	63.26%	17 Missing and 1 partial ⚠️
src/mwoffliner.lib.ts	60.00%	2 Missing ⚠️
src/renderers/abstract.renderer.ts	94.73%	2 Missing ⚠️
src/util/misc.ts	0.00%	2 Missing ⚠️
src/MediaWiki.ts	66.66%	1 Missing ⚠️
src/util/articleListMainPage.ts	0.00%	1 Missing ⚠️
src/util/saveArticles.ts	95.83%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
+ Coverage   75.29%   75.63%   +0.33%     
==========================================
  Files          41       41              
  Lines        3202     3213      +11     
  Branches      706      704       -2     
==========================================
+ Hits         2411     2430      +19     
+ Misses        674      666       -8     
  Partials      117      117

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kelson42

@benoit74 Thank you for your PR. Even if I was not convinced at start about the approach (around 'wepb'), the result is convincing. The approach is coherent and seems correct to me. It didn't make a proper code review because of lack of time, but I had a look to the code and have reported the few points. I have also tested the code which seems to work as expected.

src/types.d.ts

src/Downloader.ts

audiodude

I only briefly looked at the code, I don't have time for a full review atm. But this is amazing! This definitely solves the problem that I identified around "Can't detect image type from extension reliably at HTML creation time" and I'm thrilled to see it fixed!

benoit74 · 2025-02-10T08:38:49Z

Thank you all for your time, I fixed obvious things and moved the rest to issues.

…id mistakes

benoit74 · 2025-02-10T09:02:58Z

Rebased on new main latest commit

Jaifroid · 2025-02-10T09:10:43Z

Just to let you know that I tested the basketball ZIM above with KJS (PWA), and I don't see any issues in Chrome, IE11 and Firefox. Images display as normal. I'm not quite sure how, but the browser seems capable of displaying a file with a jpeg ending that in fact contains webp data without any MIME type set in the <img> tag.

At least for old browsers, I include in the reader a conversion utility that converts webp to png based on the MIME type declared in the dirEntry, so I suppose that covers most cases where there might have been an incompatibility.

benoit74 · 2025-02-10T09:44:55Z

Thank you @Jaifroid for the test.

I'm pretty sure most browsers never use the "file extension" since this notion of "filename" and "file extension" does not really exists in the HTML / HTTP specs as far as I can tell, there is just URI/URL.

benoit74 self-assigned this Jan 28, 2025

benoit74 force-pushed the images_processing branch from 8ab34d2 to 8cf2027 Compare January 28, 2025 10:36

benoit74 force-pushed the images_processing branch from 8cf2027 to 35b686a Compare January 28, 2025 12:26

benoit74 force-pushed the images_processing branch 2 times, most recently from c413004 to 88bbeec Compare January 30, 2025 10:35

benoit74 marked this pull request as ready for review January 30, 2025 11:31

benoit74 requested a review from kelson42 January 30, 2025 11:32

benoit74 mentioned this pull request Jan 30, 2025

Pre-install all Node.JS dependencies to make image smaller/faster #2148

Merged

kelson42 approved these changes Feb 9, 2025

View reviewed changes

src/types.d.ts Outdated Show resolved Hide resolved

src/Downloader.ts Outdated Show resolved Hide resolved

src/Downloader.ts Show resolved Hide resolved

audiodude approved these changes Feb 9, 2025

View reviewed changes

benoit74 added 2 commits February 10, 2025 09:02

Revisit handling of images processing and other fixes

Verified

This commit was signed with the committer’s verified signature.

ronnnnn Seiya Kokushi

GPG key ID: B927D86C6064A7CB

Verified
Learn about vigilant mode

d7756ba

Download kind should be an enumarated type from TS perspective to avo…

Loading
Loading status checks…

0402980

…id mistakes

benoit74 force-pushed the images_processing branch from 9699603 to 0402980 Compare February 10, 2025 09:02

benoit74 merged commit 0dcef0b into main Feb 10, 2025
4 of 6 checks passed

benoit74 deleted the images_processing branch February 10, 2025 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit handling of images processing and other fixes #2143

Revisit handling of images processing and other fixes #2143

benoit74 commented Jan 28, 2025 •

edited

Loading

kelson42 commented Jan 28, 2025 •

edited

Loading

benoit74 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

codecov bot commented Jan 30, 2025 •

edited

Loading

kelson42 left a comment

audiodude left a comment

benoit74 commented Feb 10, 2025

benoit74 commented Feb 10, 2025

Jaifroid commented Feb 10, 2025

benoit74 commented Feb 10, 2025

Revisit handling of images processing and other fixes #2143

Revisit handling of images processing and other fixes #2143

Conversation

benoit74 commented Jan 28, 2025 • edited Loading

Changes

kelson42 commented Jan 28, 2025 • edited Loading

benoit74 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

codecov bot commented Jan 30, 2025 • edited Loading

Codecov Report

kelson42 left a comment

Choose a reason for hiding this comment

audiodude left a comment

Choose a reason for hiding this comment

benoit74 commented Feb 10, 2025

benoit74 commented Feb 10, 2025

Jaifroid commented Feb 10, 2025

benoit74 commented Feb 10, 2025

benoit74 commented Jan 28, 2025 •

edited

Loading

kelson42 commented Jan 28, 2025 •

edited

Loading

codecov bot commented Jan 30, 2025 •

edited

Loading