-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ObjStm compression and PDF linearization doesn't work together #3603
Comments
Thank you for submitting this. This happens inside the base library MuPDF. I am going to transfer the issue to the team for investigation. |
MuPDF issue reference https://bugs.ghostscript.com/show_bug.cgi?id=707835 |
@JorjMcKie Thanks! |
The MuPDF team has determined that object streams and linearization cannot be used together. |
Re-opening until the corresponding changes have been published. |
The label "fix developed" refers to changes that prevent combined specification of options "linear" and "use_objstms" in |
@JorjMcKie Thanks for the follow-up! Just out of curiosity, can you elaborate on the reasons behind these two options cannot be used together? e.g. it's due to some implementation limitations, or by design/spec it's impossible, or something else. I'd like to know if this is something we can maybe hope for getting implemented in the future. |
First thing to realize is that this PDF concept has ever been problematic. The PDF specification does not seem to make anyone happy because of its overall (undue) complexity and its actual benefits are doubted by many. We do see a trend away from linear file formats towards files with optimum compression that can be downloaded fast as a whole across today's highspeed networks. Second, the goal of fast web access is contradictory to the desire having an as-small-as-possible PDF file. A linear PDF by its very nature duplicates information - thus adding to the file size. These duplicated structures should not be compressed anyway in order to foster easy access to objects needed early. When combining linearity with standard compression plus the maximum garbage collection, the resulting file size is already a good-enough result for PDFs intended for page-wise display across internet connections. So the MuPDF team came to the described conclusion - which will not ever be reverted as far as we can see. |
Your explanation is super informative and helpful, thank you so much!
Soooo true, couldn't agree more. They are a nightmare to deal with.
This is very interesting. At work, we have met some extreme edge cases, where the gigantic monster PDF files can be tens of thousands of pages long, and over a gigabyte large. More compression means saving a buck on object storage, and linearization means customer can view the file faster (we have some enterprise customer who only have a 20mbps connection to their desk, which could take over 10 min for the file to load, which sucks). That's why I tried to use these two options together.
That explains a lot, so I guess object stream compression versus linearization is a pretty much a space-time tradeoff that we have to choose from. |
I understand. Thanks for your appreciation! |
That's so cool to know! I did know how much space object streams can save, and I went back to check on some of the impressive cases i tried. And it turns out, the last example that I experimented was originally 19.9MB, and object stream compressed it down to exactly 8.3MB. What a coincidence! |
Fixed in 1.24.10. |
Description of the bug
Since v1.24.1 introduced
use_objstms
option inDocument.save()
, settinguse_objstms=1
andlinear=True
together doesn't work on some documents, results in a broken PDF file. On version >= 1.24.3, some documents even cause the program to crash.How to reproduce the bug
Here's a minimal reproducible program:
We ran into the problem when processing some internal documents, but managed to reproduce the issue on two random paper downloaded from arXiv. Here are the files:
1706.03762v7.pdf
2401.08541v1.pdf
When running the program, it spits out error logs like below during the pixmap generation, possibly due to the file is broken.
And the result PDF file is either blank or only contains some lines with no texts when opening in Ubuntu's Evince document viewer. Opening it in chrome does show texts, but the font is altered and figures are gone.
Also, it seems like turning on garbage collection affects the crash pattern, when using
ez_save
, the first file crashes the program, when usingsave
with no gc, the second file crashes the program. They all crash with such log:PyMuPDF version
1.24.5
Operating system
Linux
Python version
3.11
The text was updated successfully, but these errors were encountered: