Trim duplicate objects #969

leesei · 2019-10-15T04:39:15Z

I'm using WeasyPrint to create lucky draw tickets, using a jinja template with images (logos and stuffs).
It seems WeasyPrint will embed the image for each instance rather then reusing the same object, resulting in a bloated PDF file.

I tried pts/pdfsizeopt: PDF file size optimizer and the file size is reduced to half.

> pdfsizeopt --do-optimize-images=no --do-require-image-optimizers=no 1.pdf out.pdf
info: prepending to PATH: /usr/bin
info: This is pdfsizeopt ZIP rUNKNOWN size=69734.
info: prepending to PATH: /usr/bin
error: image optimizer not found on PATH: jbig2
error: image optimizer not found on PATH: pngout
error: image optimizer not found on PATH: sam2p
error: image optimizer not found on PATH: sam2p
info: loading PDF from: 1.pdf
info: loaded PDF of 6195019 bytes
info: separated to 827 objs + xref + trailer
info: parsed 827 objs
info: eliminated 208 unused objs, depth=9
info: found 0 Type1 fonts loaded
info: found 2 Type1C fonts loaded
info: optimized 206 streams, kept 203 #orig, 3 zip
info: eliminated 396 duplicate objs
info: compressed 0 streams, kept 0 of them uncompressed
info: saving PDF with 223 objs to: out.pdf
info: generated object stream of 2076 bytes in 114 objects (5%)
info: generated 3130331 bytes (51%)

Is is possible for WeasyPrint to have this optimization built-in?

DimaBogdan-phie · 2020-05-30T09:57:22Z

Is there any progress?
I think the problem is not very difficult to solve.
Looks like such duplicates optimization already implemented in every single document after HTML().render().
But when I try to do this:

from weasyprint import HTML
d = HTML('html.html').render()
d2 = HTML('html.html').render()
documents = [d, d2]
all_pages = [p for doc in documents for p in doc.pages]
documents[0].copy(all_pages).write_pdf('combined.pdf')

I get dublesized pdf (2,8mb instead of 1,4mb)

How to do such multipage optimization?

liZe · 2020-06-22T14:34:15Z

A new image_cache option can be passed to render() to share the cache used for images.

I think the problem is not very difficult to solve.

I think it was!

leesei · 2020-06-29T06:50:01Z

Thanks.
So to use it I just need to maintain the cache dict in my app and pass it through various API calls?

liZe · 2020-06-29T11:13:25Z

So to use it I just need to maintain the cache dict in my app and pass it through various API calls?

That’s it!

leesei · 2021-06-13T15:17:59Z

I revisited and updated my code to add image_cache, but it doesn't seems to work at all.
The generated file is still 6.2MB and pdfsizeopt is able to optimize half of it.
Am I doing anything wrong? I'm using WeasyPrint==52.5

html = template.render(context=context)
# html is a long page contains of 1000 logos and 1000 QR codes (both png), totaling 200 A4 pages

image_cache = {}
doc = HTML(string=html, base_url="./")
doc.write_pdf("1.pdf", image_cache=image_cache)

liZe · 2021-06-13T16:56:20Z

Hello!

The feature added in this issue is the possibility to store only once the same images in the PDF when it’s created using multiple render() calls. When you use the HTML class only once, these images are already de-duplicated.

pdfsizeopt has a lot of ways to optimize the size of PDF files, and some seem to be very efficient with your files. But trimming duplicate images, as discussed in this issue, is already done by WeasyPrint.

You can try to use the current 53 beta version, because you’ll probably get better results. If pdfsizeopt is still able to greatly optimize your document size, you may open a new issue with your HTML/CSS/images sample, we’ll try to find a way to generate a smaller file.

leesei · 2021-06-15T12:19:28Z

I see, I'll try and report later.

Any pointer to study the instructions in the generated PDF and how pdfsizeopt optimized them?
I can help out in studying this issue.
Thanks.

leesei · 2021-06-23T15:30:03Z

I've tried 53.0b2 but unfortunately it was a fail.
It took 48 seconds to create the PDF of 141MB (the CPU load was not high).
v52.5 took 13 seconds to create the PDF of 5.9MB.

liZe · 2021-07-10T09:31:09Z

I've tried 53.0b2 but unfortunately it was a fail.
It took 48 seconds to create the PDF of 141MB (the CPU load was not high).
v52.5 took 13 seconds to create the PDF of 5.9MB.

That’s a big problem, but probably unrelated to this issue. Could you please open a new issue with a sample showing the problem? Thank you!

leesei · 2021-07-10T14:00:37Z

Repo to reproduce this issue: https://github.com/leesei/WeasyPrint_1392

liZe added the performance Too slow renderings label Nov 8, 2019

liZe closed this as completed in 3d4cb04 Jun 22, 2020

liZe added this to the 52 milestone Jun 22, 2020

grewn0uille added the sponsored Issues sponsored to be resolved faster label Jan 26, 2021

leesei mentioned this issue Jul 10, 2021

53.0b2 generate much larger PDF (and much slower) #1392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim duplicate objects #969

Trim duplicate objects #969

leesei commented Oct 15, 2019 •

edited

Loading

DimaBogdan-phie commented May 30, 2020

liZe commented Jun 22, 2020

leesei commented Jun 29, 2020 •

edited

Loading

liZe commented Jun 29, 2020

leesei commented Jun 13, 2021 •

edited

Loading

liZe commented Jun 13, 2021

leesei commented Jun 15, 2021

leesei commented Jun 23, 2021

liZe commented Jul 10, 2021

leesei commented Jul 10, 2021

Trim duplicate objects #969

Trim duplicate objects #969

Comments

leesei commented Oct 15, 2019 • edited Loading

DimaBogdan-phie commented May 30, 2020

liZe commented Jun 22, 2020

leesei commented Jun 29, 2020 • edited Loading

liZe commented Jun 29, 2020

leesei commented Jun 13, 2021 • edited Loading

liZe commented Jun 13, 2021

leesei commented Jun 15, 2021

leesei commented Jun 23, 2021

liZe commented Jul 10, 2021

leesei commented Jul 10, 2021

leesei commented Oct 15, 2019 •

edited

Loading

leesei commented Jun 29, 2020 •

edited

Loading

leesei commented Jun 13, 2021 •

edited

Loading