Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving memory usage for PDF files loaded by chunks #5332

Closed
wants to merge 3 commits into from

Conversation

bh213
Copy link

@bh213 bh213 commented Sep 23, 2014

as described in #5329, this significantly reduces memory usage for large files loaded by chunking


This change is Reviewable

@@ -15,13 +15,22 @@
* limitations under the License.
*/
/* globals assert, MissingDataException, isInt, NetworkManager, Promise,
isEmptyObj, createPromiseCapability */
isEmptyObj, createPromiseCapability */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this white-space change.

@Snuffleupagus
Copy link
Collaborator

I've added a few initial comments on the style of the code. When you have addressed those, please squash the commits into one, see https://github.com/mozilla/pdf.js/wiki/Squashing-Commits.

@bh213 bh213 force-pushed the chunkedstream_with_chunked_memory branch 2 times, most recently from fa040d3 to e464033 Compare September 24, 2014 10:56
@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Sep 24, 2014

I think that it would be a good idea to benchmark the changes in this PR, to make sure that we don't regress performance improvements from PRs such as e.g. #4966. See https://github.com/mozilla/pdf.js/wiki/Benchmarking-your-changes.

Edit: According to the above, this PR reduces memory consumption for large files. Does that apply to smaller files, of more "normal" size, too? Or is there no measurable difference for smaller files?

@bh213
Copy link
Author

bh213 commented Sep 24, 2014

Yep, I agree about testing performance, I'll take a look, I think there is a lot of room for optimizations.

Also note that at the moment there is no memory eviction, mostly because I can't decide what would be a good way to do that. I think eviction behaviour should be configurable, maybe in API.js - something like evict when total memory usage is over X bytes and evict LRU parts. I'd also move chunk size in there. Or would pdfDataRangeTransport be a better place?

@bh213
Copy link
Author

bh213 commented Sep 25, 2014

This PR significantly reduces memory usage for partially loaded files - since there is no eviction if file is 100% loaded, memory usage could even be somewhat larger when there is request for data on non-continuous chunks but given how chunks are requested I don't think this is likely. But as long as the file is not completely loaded, memory usage is reduced for pretty much every chunk that is not loaded regardless of file size.

It should be pretty easy to add eviction when memory reaches some sort of threshold - say 10 or 50 or 100MB but that might cause reloading of some data if wrong data is evicted. Still, I'd assume this is preferable to browser/tab crash.

@bh213 bh213 force-pushed the chunkedstream_with_chunked_memory branch from e464033 to 3c1a89d Compare September 25, 2014 14:41
@bh213
Copy link
Author

bh213 commented Sep 25, 2014

Added performance optimizations, I see small performance gains over current master (settings from benchmarking wiki)

browser stat Count Baseline(ms) Current(ms) +/- % Result(P<.05)
firefox Overall 100 141 135 -6 -4.07
firefox Page Request 100 5 5 -1 -12.05
firefox Rendering 100 135 130 -5 -3.76

@timvandermeij
Copy link
Contributor

@bh213 I have added some review comments. The code already looks really good, so these are quite minor comments. It's also great that you have included tests; thanks for that!

@timvandermeij
Copy link
Contributor

Let's also /cc @fkaelberer, @CodingFabian and @nnethercote for an additional look on this.

@nnethercote
Copy link
Contributor

I'll let others review the code, but I did some measurements.

On one 19.9 MB file, it reduced peak RSS while loading (from disk) from ~570 MiB to ~550 MiB. A slight improvement, not bad.

But on a 203 MB file (the largest I have on my machine) it reduced peak RSS while loading (from disk) from ~700 MiB to ~510 MiB. Nice! My ad hoc JS memory profiler confirms that it replaced a 203 MB allocation in chunked_stream.js with a 10.5 MB allocation.

I can imagine it makes a huge difference on a 1.9 GB file.

@bh213 bh213 force-pushed the chunkedstream_with_chunked_memory branch from 3c1a89d to 81d1fd9 Compare September 26, 2014 08:53
@yurydelendik
Copy link
Contributor

We just landed stream loading support, which conflicts with this PR (see #5263). Could you rebase this patch on top of the new master?

@bh213 bh213 force-pushed the chunkedstream_with_chunked_memory branch from b820a26 to 768b9ee Compare May 13, 2016 12:07
@bh213
Copy link
Author

bh213 commented May 19, 2016

Review status: 3 of 8 files reviewed at latest revision, 15 unresolved discussions.


src/core/chunked_stream.js, line 461 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

Logic looks somewhat similar to onReceiveProgressiveData, can we merge into one function? Before onReceiveData could not handle incomplete and "uneven" chunks, now it looks like it can.

onReceiveData depends on chunks being aligned since loaded state is stored in this.loadedChunks. Having unaligned chunks would be a lot more work.

Comments from Reviewable

@yurydelendik
Copy link
Contributor

src/core/chunked_stream.js, line 461 [r1] (raw file):

Previously, bh213 (Gorazd Breskvar) wrote…

onReceiveData depends on chunks being aligned since loaded state is stored in this.loadedChunks. Having unaligned chunks would be a lot more work.

By reading the code, that's what you are trying to do; if not, then I don't understand the intent. The current implementation is able to handle incomplete chunks from progressive data, and range requests chunks (which are always aligned), also both processes can happen simultaneously. Is new implementation can handle only progressive data and chunked data at one time?

Comments from Reviewable

@bh213
Copy link
Author

bh213 commented May 20, 2016

Review status: 3 of 8 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


src/core/chunked_stream.js, line 461 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

By reading the code, that's what you are trying to do; if not, then I don't understand the intent. The current implementation is able to handle incomplete chunks from progressive data, and range requests chunks (which are always aligned), also both processes can happen simultaneously. Is new implementation can handle only progressive data and chunked data at one time?

I rechecked the code - what happens in onReceiveData (continuous mode) is that there is special handling for initial chunk (this was probably in original code too) so less than chunk size can be accepted and after the initial chunk it expects aligned and full chunk data only.

Both progressive and chunked data can be active at the same time, I added tests to make sure.


Comments from Reviewable

@bh213
Copy link
Author

bh213 commented May 20, 2016

Review status: 3 of 8 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


src/core/chunked_stream.js, line 39 [r1] (raw file):

Previously, Snuffleupagus (Jonas Jenwald) wrote…

Nit: incorrect comment style, please use // instead. And please remember to always add a period at the end of sentences (applies elsewhere in this patch as well). E.g. like this:

// If the length of stream is less than ALLOCATE_NO_CHUNKS_SIZE then use

// ChunkedStreamContinuous which works with single block of memory.
Done.

src/core/chunked_stream.js, line 59 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

all three above are not used in the base and continuous stream (move into fragmented only?)

Done.

src/core/chunked_stream.js, line 95 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

add "abstract" method for prepareBuffer.

Done.

src/core/chunked_stream.js, line 139 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

we are using this.buffer below, can you instantiate it in the constructor above with e.g. 'null' value (to know it will be there).

Done.

src/core/chunked_stream.js, line 264 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

Add "abstract" createGetByteFast method. btw, you can call it from "subStream." and update "getByte" in this method (probably name would be ensureGetByteFast)

Done.

src/core/chunked_stream.js, line 280 [r1] (raw file):

Previously, Snuffleupagus (Jonas Jenwald) wrote…

Similar to https://github.com/mozilla/pdf.js/pull/5332/files#r60896625.

Done.

src/core/chunked_stream.js, line 297 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

nit: ChunkedStreamContinuous_createGetByteFast

Done.

src/core/chunked_stream.js, line 303 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

this function body can be placed inside createGetByteFast/ensureGetByteFast

Done.

src/core/chunked_stream.js, line 327 [r1] (raw file):

Previously, yurydelendik (Yury Delendik) wrote…

not sure what this calculation do, shall it be just position?

Done.

src/core/chunked_stream.js, line 469 [r1] (raw file):

Previously, Snuffleupagus (Jonas Jenwald) wrote…

Similar to https://github.com/mozilla/pdf.js/pull/5332/files#r60896625

Done.

Comments from Reviewable

@bh213 bh213 force-pushed the chunkedstream_with_chunked_memory branch from 768b9ee to f78ef66 Compare May 20, 2016 09:18
@bh213
Copy link
Author

bh213 commented May 20, 2016

rebased code to the latest master

}
};

beforeEach(function() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to register this before every test, couldn't you do it once instead? E.g.

beforeAll(function (done) {
  jasmine.addCustomEqualityTester(arrayComparerThatFixesJasmineIssue786);
  done();
});

@bh213
Copy link
Author

bh213 commented May 23, 2016

@Snuffleupagus - updated PR with test review changes

@bh213
Copy link
Author

bh213 commented May 27, 2016

Anything else I should take a look at?

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/960afd1d46ec148/output.txt

@timvandermeij
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://107.21.233.14:8877/a37442f3e732033/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://107.22.172.223:8877/44525715f681ae9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/44525715f681ae9/output.txt

Total script time: 20.78 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/a37442f3e732033/output.txt

Total script time: 26.56 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@bh213
Copy link
Author

bh213 commented Jun 14, 2016

Anything else needed for this PR?

@frankmehlhose
Copy link

The server which delivers the PDF has to support Byte serving (Accept-Ranges http header) for this PR to work, correct?
We have a use case with PDF/A documents for digital archive systems.
In extreme cases we are facing documents with up to 2000 pages where each page contains scanned image data and is about 1MB in size.
In such cases, an eviction policy for the chunk cache might be a good idea. Maybe with LRU eviction algorithm and configurable cache size.

@bh213
Copy link
Author

bh213 commented Jun 22, 2017

I don't think I'll be rebasing this PR yet again...

@bh213 bh213 closed this Jun 22, 2017
@bh213 bh213 deleted the chunkedstream_with_chunked_memory branch January 11, 2018 11:25
@bh213 bh213 restored the chunkedstream_with_chunked_memory branch January 11, 2018 11:31
@bh213 bh213 deleted the chunkedstream_with_chunked_memory branch January 11, 2018 11:31
@bh213 bh213 restored the chunkedstream_with_chunked_memory branch February 15, 2021 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants