Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit handling of images processing and other fixes #2143

Merged
merged 2 commits into from
Feb 10, 2025
Merged

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Jan 28, 2025

This is kinda a significant PR to fix many issues around images processing.

Fix #2140
Fix #2136
Fix #2088
Fix #2138

Changes

  • when processing HTML, make a distinction between images, videos and other medias (PDF, ...)
    • store this information in redis for later retrieval
    • do not guess if a given media is an image based on its URL anymore
    • do not guess content-type from image URL or from response header, compute it with file-types based on real image data
    • remove corresponding functions isImageUrl, getMimeType and constants: IMAGE_URL_REGEX
    • remove now useless mime-type package
    • for now, videos do not have any special treatment (e.g. reencoding) but everything is ready for that
    • other medias are always simply downloaded
  • do not add a .webp suffix to the path of images which have been converted to webp
  • stop pushing content-type to S3 metadata
    • we do not need this information anymore
    • there are too many risks this information is wrong due to a bug
    • we can let things already in S3 with this metadata live as they are, there is mostly 0 consequences
  • define a clear API of information returned by downloader.downloadContent when downloading content, instead of the whole response upstream (which could contain "anything")

@benoit74 benoit74 self-assigned this Jan 28, 2025
@kelson42
Copy link
Collaborator

kelson42 commented Jan 28, 2025

@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension).

@benoit74
Copy link
Contributor Author

Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO.

I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage.

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

@benoit74
Copy link
Contributor Author

Sample ZIMs:

@kelson42
Copy link
Collaborator

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

@benoit74
Copy link
Contributor Author

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

S3 key is computed from online URL directly without any logic handling webp conversion:

await this.s3.uploadBlob(stripHttpFromUrl(url), mwResp.data, etag, mwResp.headers['content-type'], this.webp ? 'webp' : '1')

@benoit74 benoit74 force-pushed the images_processing branch 2 times, most recently from c413004 to 88bbeec Compare January 30, 2025 10:35
Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 75.63%. Comparing base (7c3857a) to head (0402980).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/Downloader.ts 63.26% 17 Missing and 1 partial ⚠️
src/mwoffliner.lib.ts 60.00% 2 Missing ⚠️
src/renderers/abstract.renderer.ts 94.73% 2 Missing ⚠️
src/util/misc.ts 0.00% 2 Missing ⚠️
src/MediaWiki.ts 66.66% 1 Missing ⚠️
src/util/articleListMainPage.ts 0.00% 1 Missing ⚠️
src/util/saveArticles.ts 95.83% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
+ Coverage   75.29%   75.63%   +0.33%     
==========================================
  Files          41       41              
  Lines        3202     3213      +11     
  Branches      706      704       -2     
==========================================
+ Hits         2411     2430      +19     
+ Misses        674      666       -8     
  Partials      117      117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review January 30, 2025 11:31
@benoit74 benoit74 requested a review from kelson42 January 30, 2025 11:32
Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benoit74 Thank you for your PR. Even if I was not convinced at start about the approach (around 'wepb'), the result is convincing. The approach is coherent and seems correct to me. It didn't make a proper code review because of lack of time, but I had a look to the code and have reported the few points. I have also tested the code which seems to work as expected.

Copy link
Member

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only briefly looked at the code, I don't have time for a full review atm. But this is amazing! This definitely solves the problem that I identified around "Can't detect image type from extension reliably at HTML creation time" and I'm thrilled to see it fixed!

@benoit74
Copy link
Contributor Author

Thank you all for your time, I fixed obvious things and moved the rest to issues.

Verified

This commit was signed with the committer’s verified signature.
ronnnnn Seiya Kokushi
…id mistakes
@benoit74
Copy link
Contributor Author

Rebased on new main latest commit

@Jaifroid
Copy link
Collaborator

Just to let you know that I tested the basketball ZIM above with KJS (PWA), and I don't see any issues in Chrome, IE11 and Firefox. Images display as normal. I'm not quite sure how, but the browser seems capable of displaying a file with a jpeg ending that in fact contains webp data without any MIME type set in the <img> tag.

At least for old browsers, I include in the reader a conversion utility that converts webp to png based on the MIME type declared in the dirEntry, so I suppose that covers most cases where there might have been an incompatibility.

@benoit74
Copy link
Contributor Author

Thank you @Jaifroid for the test.

I'm pretty sure most browsers never use the "file extension" since this notion of "filename" and "file extension" does not really exists in the HTML / HTTP specs as far as I can tell, there is just URI/URL.

@benoit74 benoit74 merged commit 0dcef0b into main Feb 10, 2025
4 of 6 checks passed
@benoit74 benoit74 deleted the images_processing branch February 10, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants