- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit handling of images processing and other fixes #2143
Conversation
8ab34d2
to
8cf2027
Compare
@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension). |
I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage. For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it. |
Sample ZIMs:
|
It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK. |
8cf2027
to
35b686a
Compare
S3 key is computed from online URL directly without any logic handling webp conversion: Line 591 in fc2af69
|
c413004
to
88bbeec
Compare
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (80.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2143 +/- ##
==========================================
+ Coverage 75.29% 75.63% +0.33%
==========================================
Files 41 41
Lines 3202 3213 +11
Branches 706 704 -2
==========================================
+ Hits 2411 2430 +19
+ Misses 674 666 -8
Partials 117 117 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benoit74 Thank you for your PR. Even if I was not convinced at start about the approach (around 'wepb'), the result is convincing. The approach is coherent and seems correct to me. It didn't make a proper code review because of lack of time, but I had a look to the code and have reported the few points. I have also tested the code which seems to work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only briefly looked at the code, I don't have time for a full review atm. But this is amazing! This definitely solves the problem that I identified around "Can't detect image type from extension reliably at HTML creation time" and I'm thrilled to see it fixed!
Thank you all for your time, I fixed obvious things and moved the rest to issues. |
…id mistakes
9699603
to
0402980
Compare
Rebased on new |
Just to let you know that I tested the basketball ZIM above with KJS (PWA), and I don't see any issues in Chrome, IE11 and Firefox. Images display as normal. I'm not quite sure how, but the browser seems capable of displaying a file with a jpeg ending that in fact contains webp data without any MIME type set in the At least for old browsers, I include in the reader a conversion utility that converts webp to png based on the MIME type declared in the dirEntry, so I suppose that covers most cases where there might have been an incompatibility. |
Thank you @Jaifroid for the test. I'm pretty sure most browsers never use the "file extension" since this notion of "filename" and "file extension" does not really exists in the HTML / HTTP specs as far as I can tell, there is just URI/URL. |
This is kinda a significant PR to fix many issues around images processing.
Fix #2140
Fix #2136
Fix #2088
Fix #2138
Changes
isImageUrl
,getMimeType
and constants:IMAGE_URL_REGEX
mime-type
package.webp
suffix to the path of images which have been converted to webp.webp
path prefix on reencoded images is skewed #2140, and as observed in Do not rely on URL filename extension to detect images #2088, we cannot have this information at HTML processing timedownloader.downloadContent
when downloading content, instead of the whole response upstream (which could contain "anything")