-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not defining the document encoding can be slow when chardet is installed #1183
Comments
Hello, Could you please provide your HTML and CSS files? That would be very helpful to reproduce your problem. |
Sure, the CSS is inline since we generate the CSS on the backend (i tried if that is the impact, but it is not) i had to attach it as an IMHO the performance relevant part is 'with or without images'. Thank you for investigating! |
Your document takes 1.2 s to be rendered on my (quite old) laptop. Does it take really longer on your machine? |
do you render using the cli variant? if yes, which invocation are you using (and yes it takes a lot longer) |
|
interesting - using the cli on my container: time weasyprint --presentational-hints test.html test.pdf
real 0m1.595s
user 0m1.511s
sys 0m0.077s
using the rest based conversion time curl -XPOST --data-binary "@test.html" -H"Content-type:text/html" localhost:5001/pdf?filename=rest.pdf -o rest.pdf
real 0m1.422s
user 0m0.004s
sys 0m0.004s That is a surprising outcome, on my environment (docker container) cli and rest based trigger are about the same ~1.4s Using it through my microservice via REST end up to be 35 seconds. This seems to be an internal issue obviously, i will investigate and report back / close the issue when i know which aspect is causing this. Thanks for your help once again |
Good luck! |
Ok i found the reason for the discrepancy - the reason is i uploaded the wrong file. I attached the correct file now: time weasyprint --presentational-hints test.html test.pdf
real 0m20.161s
user 0m19.568s
sys 0m0.195s using the rest based conversion time curl --data-binary '@test.html' -H'Content-type:text/html' http://localhost:5001/pdf -o /tmp/rest.pdf
real 0m21.567s
user 0m0.003s
sys 0m0.012s So we are now, cli and direct rest call, about the expected long rendering time. Hopes this helps for the reproduction |
Hmmmmm… It only takes 3 seconds for me, with Python 3.8, on Linux.
I don’t know what to do :/… |
Are you sure you are using the correct file (latest)? Also on python 3.8 here. which os are you using? I would consider setting up a neutral docker image for us to both test in the same environment. Do you have docker ? |
You can even try with the file you uploaded on GitHub:
Downloading the file from GitHub adds 1 second to the rendering time for me, but I’m far from 20 seconds. |
interestin time weasyprint -e utf8 -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real 0m5.344s
user 0m3.864s
sys 0m0.083s so that is a lot faster. Downloading the file via wget and then running, same as you had time weasyprint -e utf8 -v --presentational-hints test.html.txt /tmp/test.pdf
real 0m3.991s
user 0m3.887s
sys 0m0.080s After some testing it seems to be the time weasyprint -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real 0m16.292s
user 0m15.276s
sys 0m0.126s so what is this param doing? |
This parameter tells that the HTML’s encoding is utf-8. Without it, you’ll get the wrong characters in your PDF. You can (should?) also set the encoding in the HTML file directly. But… Removing this parameter doesn’t change anything about the rendering time for me. I’m not sure at all that your problem comes from this parameter. |
Well it is what is causing this. The above statements have been done on the same system, one time with
In my webservice i added pdf = write_pdf(HTML(string=request.data, encoding='utf-8')) with the same result, the rendering time dropped by a third and more. So whatever this does, it is related. Does it try to guess the encoding and somehow is wasting time on that? |
Well… The encoding is just given to html5lib, the library that parses the HTML file. I have no idea about what takes so much time, I have no idea why it doesn’t change anything for me. But if your problem really comes from this parameter, then there’s nothing WeasyPrint can do for you! |
Well actually the question maybe is, if we should document the importance of setting then encoding. While i never had an encoding issue in the resulting PDF (which would be one of the important / obvious reasons one would ask about encoding) .. finding out the performance is affected in that way is odd. you should be able to reproduce it with docker run -it alpine:latest sh
# install deps
apk --update --upgrade add supervisor cairo pango gdk-pixbuf py3-cffi py3-pillow py-lxml py3-pip
apk --update --upgrade add git gcc musl-dev jpeg-dev zlib-dev libffi-dev cairo-dev cairo-dev gdk-pixbuf-dev
pip3 install gunicorn flask
#install weasyprint
pip3 install git+https://github.com/Kozea/WeasyPrint.git
#install font
apk --no-cache add ttf-liberation
fc-cache
# the test
time weasyprint -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf should end up with
and on the same container with time weasyprint -e utf8 -v --presentational-hints https://github.com/Kozea/WeasyPrint/files/5028609/test.html.txt /tmp/test.pdf
real 0m 5.40s
user 0m 3.80s
sys 0m 0.08s |
Problem solved: chardet is slow for your document. Chardet is an optional dependency of html5lib that tries to detect a document encoding. It’s not slow for me, because it’s not installed, and that’s why I had to use
That’s a good question. I really don’t know why chardet is installed on your system, because it’s not a dependency of gunicorn, flask or WeasyPrint. It’s probably installed as a dependency of your alpine packages. By the way, the documentation will be rewritten. I can keep this ticket open to add a comment about this. |
Great we could nail this one down. Since this docker image is only meant for weasyprint i'am suprised that i have more then the required dependencies. Alpine usually tries to keep the packages tiny and slim, but well, idnk where it comes from. Having this in the docs makes a lot of sense, installation and usage docs. Thank you so much for your time! |
This bug has been fixed with version 4.0.0 of chardet 🚀. |
I have a 9 pages sized pdf document which includes 5 images. those 5 images are included via base64 encoded sources inline.
That's a fairly big hit performance wise - are there any tools to optimize this at all? You find the PDF attached - it is not really complex but more or less is test document for our print.
We are using weasyprint using a REST api like this
This whole service runs on an developer machine (linux desktop) under docker
I would have expected it to be quicker then that, but it seems those images have a huge impact
The text was updated successfully, but these errors were encountered: