Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim duplicate objects #969

Closed
leesei opened this issue Oct 15, 2019 · 10 comments
Closed

Trim duplicate objects #969

leesei opened this issue Oct 15, 2019 · 10 comments
Labels
performance Too slow renderings sponsored Issues sponsored to be resolved faster
Milestone

Comments

@leesei
Copy link

leesei commented Oct 15, 2019

I'm using WeasyPrint to create lucky draw tickets, using a jinja template with images (logos and stuffs).
It seems WeasyPrint will embed the image for each instance rather then reusing the same object, resulting in a bloated PDF file.

I tried pts/pdfsizeopt: PDF file size optimizer and the file size is reduced to half.

> pdfsizeopt --do-optimize-images=no --do-require-image-optimizers=no 1.pdf out.pdf
info: prepending to PATH: /usr/bin
info: This is pdfsizeopt ZIP rUNKNOWN size=69734.
info: prepending to PATH: /usr/bin
error: image optimizer not found on PATH: jbig2
error: image optimizer not found on PATH: pngout
error: image optimizer not found on PATH: sam2p
error: image optimizer not found on PATH: sam2p
info: loading PDF from: 1.pdf
info: loaded PDF of 6195019 bytes
info: separated to 827 objs + xref + trailer
info: parsed 827 objs
info: eliminated 208 unused objs, depth=9
info: found 0 Type1 fonts loaded
info: found 2 Type1C fonts loaded
info: optimized 206 streams, kept 203 #orig, 3 zip
info: eliminated 396 duplicate objs
info: compressed 0 streams, kept 0 of them uncompressed
info: saving PDF with 223 objs to: out.pdf
info: generated object stream of 2076 bytes in 114 objects (5%)
info: generated 3130331 bytes (51%)

Is is possible for WeasyPrint to have this optimization built-in?

@liZe liZe added the performance Too slow renderings label Nov 8, 2019
@DimaBogdan-phie
Copy link

Is there any progress?
I think the problem is not very difficult to solve.
Looks like such duplicates optimization already implemented in every single document after HTML().render().
But when I try to do this:

from weasyprint import HTML
d = HTML('html.html').render()
d2 = HTML('html.html').render()
documents = [d, d2]
all_pages = [p for doc in documents for p in doc.pages]
documents[0].copy(all_pages).write_pdf('combined.pdf')

I get dublesized pdf (2,8mb instead of 1,4mb)

How to do such multipage optimization?

@liZe liZe closed this as completed in 3d4cb04 Jun 22, 2020
@liZe
Copy link
Member

liZe commented Jun 22, 2020

A new image_cache option can be passed to render() to share the cache used for images.

I think the problem is not very difficult to solve.

I think it was!

@liZe liZe added this to the 52 milestone Jun 22, 2020
@leesei
Copy link
Author

leesei commented Jun 29, 2020

Thanks.
So to use it I just need to maintain the cache dict in my app and pass it through various API calls?

@liZe
Copy link
Member

liZe commented Jun 29, 2020

So to use it I just need to maintain the cache dict in my app and pass it through various API calls?

That’s it!

@grewn0uille grewn0uille added the sponsored Issues sponsored to be resolved faster label Jan 26, 2021
@leesei
Copy link
Author

leesei commented Jun 13, 2021

I revisited and updated my code to add image_cache, but it doesn't seems to work at all.
The generated file is still 6.2MB and pdfsizeopt is able to optimize half of it.
Am I doing anything wrong? I'm using WeasyPrint==52.5

html = template.render(context=context)
# html is a long page contains of 1000 logos and 1000 QR codes (both png), totaling 200 A4 pages

image_cache = {}
doc = HTML(string=html, base_url="./")
doc.write_pdf("1.pdf", image_cache=image_cache)

@liZe
Copy link
Member

liZe commented Jun 13, 2021

Hello!

The feature added in this issue is the possibility to store only once the same images in the PDF when it’s created using multiple render() calls. When you use the HTML class only once, these images are already de-duplicated.

pdfsizeopt has a lot of ways to optimize the size of PDF files, and some seem to be very efficient with your files. But trimming duplicate images, as discussed in this issue, is already done by WeasyPrint.

You can try to use the current 53 beta version, because you’ll probably get better results. If pdfsizeopt is still able to greatly optimize your document size, you may open a new issue with your HTML/CSS/images sample, we’ll try to find a way to generate a smaller file.

@leesei
Copy link
Author

leesei commented Jun 15, 2021

I see, I'll try and report later.

Any pointer to study the instructions in the generated PDF and how pdfsizeopt optimized them?
I can help out in studying this issue.
Thanks.

@leesei
Copy link
Author

leesei commented Jun 23, 2021

I've tried 53.0b2 but unfortunately it was a fail.
It took 48 seconds to create the PDF of 141MB (the CPU load was not high).
v52.5 took 13 seconds to create the PDF of 5.9MB.

@liZe
Copy link
Member

liZe commented Jul 10, 2021

I've tried 53.0b2 but unfortunately it was a fail.
It took 48 seconds to create the PDF of 141MB (the CPU load was not high).
v52.5 took 13 seconds to create the PDF of 5.9MB.

That’s a big problem, but probably unrelated to this issue. Could you please open a new issue with a sample showing the problem? Thank you!

@leesei
Copy link
Author

leesei commented Jul 10, 2021

Repo to reproduce this issue: https://github.com/leesei/WeasyPrint_1392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Too slow renderings sponsored Issues sponsored to be resolved faster
Projects
None yet
Development

No branches or pull requests

4 participants