-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the *local* image caching in PartialEvaluator.getOperatorList
#11930
Conversation
Currently the local `imageCache`, as used in `PartialEvaluator.getOperatorList`, will miss certain cases of repeated images because the caching is *only* done by name (usually using a format such as e.g. "Im0", "Im1", ...). However, in some PDF documents the `/XObject` dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table). With these changes we'll now cache *local* images using both name and (where applicable) reference, thus improving re-usage of images resources even further. This patch was tested using the PDF file from [bug 857031](https://bugzilla.mozilla.org/show_bug.cgi?id=857031), i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file: ``` [ { "id": "bug857031", "file": "../web/pdfs/bug857031.pdf", "md5": "", "rounds": 250, "lastPage": 1, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, page, stat -- browser | page | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05) ------- | ---- | ------------ | ----- | ------------ | ----------- | --- | ----- | ------------- firefox | 0 | Overall | 250 | 2749 | 2656 | -93 | -3.38 | faster firefox | 0 | Page Request | 250 | 3 | 4 | 1 | 50.14 | slower firefox | 0 | Rendering | 250 | 2746 | 2652 | -94 | -3.44 | faster ``` While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case. In pathological cases, such as e.g. the PDF document in issue 4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue 4958, the rendering time drops from ~60 seconds with `master` to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely). Finally, note that there's also potential for additional improvements by re-using `LocalImageCache` instances for e.g. /XObject data of the `Form`-type. However, given that recent changes in this area I purposely didn't want to complicate *this* patch more than necessary.
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/f65d37406d62960/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/b7bb26111d73da0/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/f65d37406d62960/output.txt Total script time: 25.92 mins
Image differences available at: http://54.67.70.0:8877/f65d37406d62960/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/b7bb26111d73da0/output.txt Total script time: 29.49 mins
Image differences available at: http://54.215.176.217:8877/b7bb26111d73da0/reftest-analyzer.html#web=eq.log |
…Content` method It turns out that `getTextContent` suffers from *similar* problems with repeated images as `getOperatorList`; please see the previous patch. While only `/XObject` resources of the `Form`-type will actually be *parsed* in `PartialEvaluator.getTextContent`, since those are the only ones that may contain text, we're still forced to fetch repeated image resources where the name differs (but not the reference). Obviously it's less bad in this case, since we're not actually parsing `/XObject`s of e.g. the `Image`-type. However, you still want to avoid even fetching the data whenever possible, since `Stream`s are not cached on the `XRef` instance (given their potential size) and the lookup can thus be somewhat expensive in general. To address these issues, we can simply replace the exiting name-only caching in `PartialEvaluator.getTextContent` with a new cache backed by `LocalImageCache` instead.
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/ef397ed0260cc0c/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/286af8c2734f0ba/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/ef397ed0260cc0c/output.txt Total script time: 25.98 mins
Image differences available at: http://54.67.70.0:8877/ef397ed0260cc0c/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/286af8c2734f0ba/output.txt Total script time: 28.20 mins
Image differences available at: http://54.215.176.217:8877/286af8c2734f0ba/reftest-analyzer.html#web=eq.log |
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/bdee6a3bdd60d72/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/40880d315e9d570/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/bdee6a3bdd60d72/output.txt Total script time: 3.77 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/40880d315e9d570/output.txt Total script time: 4.68 mins
|
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/a0220b67a6221b0/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/a0220b67a6221b0/output.txt Total script time: 3.21 mins Published |
Looks good; thanks a lot! |
Thanks for landing this; hopefully the recent image-caching improvements will help reduce both rendering times and memory usage in lots of PDF files :-) Edit: Also, if you have the time, now might be a good point to create a new release since there's been a fair number of |
I have planned to do that on Monday since, if for example the NPM sync doesn't work, @brendandahl might be able to do that manually on Tuesday for example. |
Currently the local
imageCache
, as used inPartialEvaluator.getOperatorList
, will miss certain cases of repeated images because the caching is only done by name (usually using a format such as e.g. "Im0", "Im1", ...).However, in some PDF documents the
/XObject
dictionaries many contain hundreds (or even thousands) of distinctly named images, despite them referring to only a handful of actual image objects (via the XRef table).With these changes we'll now cache local images using both name and (where applicable) reference, thus improving re-usage of images resources even further.
This patch was tested using the PDF file from bug 857031, i.e. https://bug857031.bmoattachments.org/attachment.cgi?id=732270, with the following manifest file:
which gave the following results when comparing this patch against the
master
branch:While this is certainly an improvement, since we now avoid re-parsing ~1000 images on the first page, all of the image resources are small enough that the total rendering time doesn't improve that much in this particular case.
In pathological cases, such as e.g. the PDF document in issue #4958, the improvements with this patch can be very significant. Looking for example at page 2, from issue #4958, the rendering time drops from ~60 seconds with
master
to ~30 seconds with this patch (obviously still slow, but it really showcases the potential of this patch nicely).Finally, note that there's also potential for additional improvements by re-using
LocalImageCache
instances for e.g. /XObject data of theForm
-type. However, given that recent changes in this area I purposely didn't want to complicate this patch more than necessary.