Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not defining the document encoding can be slow when chardet is installed #1183

Closed
EugenMayer opened this issue Aug 4, 2020 · 19 comments
Closed
Labels
documentation Problems or improvements needed on the documentation or on the website

Comments

@EugenMayer
Copy link

I have a 9 pages sized pdf document which includes 5 images. those 5 images are included via base64 encoded sources inline.

  • When i exclude those images during the rendering, the entire pipeline takes about 16seconds.
  • When i include those images, i end up with 49 seconds.

That's a fairly big hit performance wise - are there any tools to optimize this at all? You find the PDF attached - it is not really complex but more or less is test document for our print.

We are using weasyprint using a REST api like this

@app.route('/pdf', methods=['POST'])
def generate():
    name = request.args.get('filename', 'unnamed.pdf')
    app.logger.info('POST  /pdf?filename=%s' % name)

    html = HTML(string=request.data)
    document = html.render(stylesheets=[CSS('css/local.css')], presentational_hints=True)
    pdf = document.write_pdf(zoom=0.7936507936507937)

    response = make_response(pdf)
    response.headers['Content-Type'] = 'application/pdf'
    response.headers['Content-Disposition'] = 'inline;filename=%s' % name
    app.logger.info(' ==> POST  /pdf?filename=%s  ok' % name)
    return response

This whole service runs on an developer machine (linux desktop) under docker

  • Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  • with 32GB ram
  • very fast m2 SSD

I would have expected it to be quicker then that, but it seems those images have a huge impact

@liZe
Copy link
Member

liZe commented Aug 4, 2020

Hello,

Could you please provide your HTML and CSS files? That would be very helpful to reproduce your problem.

@EugenMayer
Copy link
Author

Sure, the CSS is inline since we generate the CSS on the backend (i tried if that is the impact, but it is not)

i had to attach it as an .txt

test.txt

IMHO the performance relevant part is 'with or without images'. Thank you for investigating!

@liZe
Copy link
Member

liZe commented Aug 5, 2020

Your document takes 1.2 s to be rendered on my (quite old) laptop. Does it take really longer on your machine?

@EugenMayer
Copy link
Author

do you render using the cli variant? if yes, which invocation are you using (and yes it takes a lot longer)

@liZe
Copy link
Member

liZe commented Aug 5, 2020

weasyprint --presentational-hints /tmp/speed.html /tmp/speed.pdf

@EugenMayer
Copy link
Author

interesting - using the cli on my container:

time weasyprint --presentational-hints test.html test.pdf
real    0m1.595s
user    0m1.511s
sys     0m0.077s

using the rest based conversion

time curl -XPOST --data-binary "@test.html" -H"Content-type:text/html" localhost:5001/pdf?filename=rest.pdf -o rest.pdf
real    0m1.422s
user    0m0.004s
sys     0m0.004s

That is a surprising outcome, on my environment (docker container) cli and rest based trigger are about the same ~1.4s

Using it through my microservice via REST end up to be 35 seconds. This seems to be an internal issue obviously, i will investigate and report back / close the issue when i know which aspect is causing this. Thanks for your help once again

@liZe
Copy link
Member

liZe commented Aug 5, 2020

Using it through my microservice via REST end up to be 35 seconds. This seems to be an internal issue obviously, i will investigate and report back / close the issue when i know which aspect is causing this. Thanks for your help once again

Good luck!

@EugenMayer
Copy link
Author

Ok i found the reason for the discrepancy - the reason is i uploaded the wrong file. I attached the correct file now:

test.html.txt

time weasyprint --presentational-hints test.html test.pdf
real    0m20.161s
user    0m19.568s
sys     0m0.195s

using the rest based conversion

time curl --data-binary '@test.html' -H'Content-type:text/html' http://localhost:5001/pdf -o /tmp/rest.pdf
real    0m21.567s
user    0m0.003s
sys     0m0.012s

So we are now, cli and direct rest call, about the expected long rendering time. Hopes this helps for the reproduction

@liZe
Copy link
Member

liZe commented Aug 5, 2020

Hmmmmm…

It only takes 3 seconds for me, with Python 3.8, on Linux.

time python -m weasyprint -e utf8 -q --presentational-hints /tmp/speed.html /tmp/speed.pdf

________________________________________________________
Executed in    3,16 secs   fish           external 
   usr time    3,05 secs    0,00 micros    3,05 secs 
   sys time    0,11 secs  991,00 micros    0,11 secs

I don’t know what to do :/…

@EugenMayer
Copy link
Author

Are you sure you are using the correct file (latest)?

Also on python 3.8 here. which os are you using? I would consider setting up a neutral docker image for us to both test in the same environment. Do you have docker ?

@liZe
Copy link
Member

liZe commented Aug 5, 2020

You can even try with the file you uploaded on GitHub:

python -m weasyprint -e utf8 -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/speed.pdf

Downloading the file from GitHub adds 1 second to the rendering time for me, but I’m far from 20 seconds.

@EugenMayer
Copy link
Author

interestin

time weasyprint -e utf8 -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real    0m5.344s
user    0m3.864s
sys     0m0.083s

so that is a lot faster. Downloading the file via wget and then running, same as you had

time weasyprint -e utf8 -v --presentational-hints test.html.txt /tmp/test.pdf
real    0m3.991s
user    0m3.887s
sys     0m0.080s

After some testing it seems to be the -e utf8 flag. without it, directly from github

time weasyprint -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real    0m16.292s
user    0m15.276s
sys     0m0.126s

so what is this param doing?

@liZe
Copy link
Member

liZe commented Aug 5, 2020

so what is this param doing?

This parameter tells that the HTML’s encoding is utf-8. Without it, you’ll get the wrong characters in your PDF. You can (should?) also set the encoding in the HTML file directly.

But… Removing this parameter doesn’t change anything about the rendering time for me. I’m not sure at all that your problem comes from this parameter.

@EugenMayer
Copy link
Author

EugenMayer commented Aug 5, 2020

Well it is what is causing this. The above statements have been done on the same system, one time with -e utf8 and one time without. We end up having

  • 0m5.344s (with)
  • 0m16.292s (without).

In my webservice i added encoding='utf-8'

pdf = write_pdf(HTML(string=request.data, encoding='utf-8'))

with the same result, the rendering time dropped by a third and more. So whatever this does, it is related. Does it try to guess the encoding and somehow is wasting time on that?

@liZe
Copy link
Member

liZe commented Aug 5, 2020

with the same result, the rendering time dropped by a third and more. So whatever this does, it is related. Does it try to guess the encoding and somehow is wasting time on that?

Well… The encoding is just given to html5lib, the library that parses the HTML file. I have no idea about what takes so much time, I have no idea why it doesn’t change anything for me. But if your problem really comes from this parameter, then there’s nothing WeasyPrint can do for you!

@EugenMayer
Copy link
Author

EugenMayer commented Aug 5, 2020

Well actually the question maybe is, if we should document the importance of setting then encoding. While i never had an encoding issue in the resulting PDF (which would be one of the important / obvious reasons one would ask about encoding) .. finding out the performance is affected in that way is odd.

you should be able to reproduce it with

docker run -it alpine:latest sh
# install deps
apk --update --upgrade add supervisor cairo pango gdk-pixbuf py3-cffi py3-pillow py-lxml py3-pip
apk --update --upgrade add git gcc musl-dev jpeg-dev zlib-dev libffi-dev cairo-dev cairo-dev gdk-pixbuf-dev
pip3 install gunicorn flask

#install weasyprint
pip3 install git+https://github.com/Kozea/WeasyPrint.git

#install font
apk --no-cache add ttf-liberation
fc-cache

# the test
time weasyprint -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf

should end up with

real    0m 16.27s
user    0m 15.14s
sys     0m 0.14s

and on the same container with -e utf8

time weasyprint -e utf8 -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real    0m 5.40s
user    0m 3.80s
sys     0m 0.08s

@liZe
Copy link
Member

liZe commented Aug 5, 2020

Problem solved: chardet is slow for your document.

Chardet is an optional dependency of html5lib that tries to detect a document encoding. It’s not slow for me, because it’s not installed, and that’s why I had to use -e utf8 to get the correct rendering. You don’t need the -e option, because chardet is installed on your system and (very slowly) detects the right encoding.

Well actually the question maybe is, if we should document the importance of setting then encoding.

That’s a good question. I really don’t know why chardet is installed on your system, because it’s not a dependency of gunicorn, flask or WeasyPrint. It’s probably installed as a dependency of your alpine packages.

By the way, the documentation will be rewritten. I can keep this ticket open to add a comment about this.

@liZe liZe added the documentation Problems or improvements needed on the documentation or on the website label Aug 5, 2020
@liZe liZe changed the title Rendering performance massively degrades with included base64 images Not defining the document encoding can be slow when chardet is installed Aug 5, 2020
@EugenMayer
Copy link
Author

Great we could nail this one down. Since this docker image is only meant for weasyprint i'am suprised that i have more then the required dependencies. Alpine usually tries to keep the packages tiny and slim, but well, idnk where it comes from.

Having this in the docs makes a lot of sense, installation and usage docs.

Thank you so much for your time!

@liZe
Copy link
Member

liZe commented Aug 17, 2021

This bug has been fixed with version 4.0.0 of chardet 🚀.

@liZe liZe closed this as completed Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Problems or improvements needed on the documentation or on the website
Projects
None yet
Development

No branches or pull requests

2 participants