-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow performance on large JSON files #720
Comments
Hello! Thanks a lot for your corner cases, they're very useful! The rendering time is now lower than 2s on my computer 😉. |
Wow, that's even a 5−10% speed improvement for Alice. 🎉 |
Wow, thanks so much for the quick follow-up! Any chance the 2s you saw was based on other changes too? I just installed 07d3da6, but I'm afraid I'm still seeing the below, wherein JSON-to-PDF rendering is definitely faster (19m to 51s), but HTML-to-PDF rendering is actually a bit slower (4m40s to 5m31s)?
Thank you! |
Yes, unfortunately it was based on a truncated JSON file 😒. It's 66s for the whole file on my computer.
It shouldn't be slower. I'll be back. |
I get almost the same rendering time with (301s) or without (306s) optimization. |
Interesting! Just re-ran the tests on another machine and got the below, whereby there's definitely a significant improvement for JSON-to-PDF, though not for HTML-to-PDF (as you note). Remarkably, the former somehow took 26m this time on version 0.42.3! Even ran that test two times to be sure. Do you have a sense of where other bottlenecks might be? Happy to take a look ourselves if we can be helpful. Thank you! JSON to PDF
HTML to PDF
|
Thanks a lot for your tests. My tests with the HTML version were before and after my optimization commit, but your tests are between 0.42.3 and master, so there's probably something wrong between 0.42.3 and 43rc1. I'll check that! |
I've launched the rendering with 0.42.3, 43rc1 (without optimization) and master (with optimization), using the CPU performance governor set to performance. Here are the results:
You're right, 0.42.3 is faster here. The slowdown has been introduced between 0.42.3 and 43rc1. The optimization for JSON seems to (slightly) improve the situation with HTML, as seen in my previous post. But… Trying to have a faster test, I've truncated the HTML file, and the results are strange… With a truncated version of the HTML file (first 1000 lines instead of about 8000):
So I can't bisect the short version. Bisecting a 5-minute-long test is quite painful, and I'm afraid to find that there's only one culprit commit making both the short version faster and the long version slower… Do you have a better idea? |
Hm, so I just ran
which, after 78s, yielded a summary that I've reformatted as a CSV: https://docs.google.com/spreadsheets/d/173l5QJo5scn9MRFEalFLLAy8AmY0oJDRTVP84JrmndQ/edit?usp=sharing If you sort Z-to-A by tottime, it looks like most of the time is spent in https://github.com/Kozea/WeasyPrint/blob/master/weasyprint/text.py#L665-L672 Any chance that yields any insights? Or to be expected? |
I can't reach the document.
Sorry, you're talking about the JSON file. |
Oops, sorry, here you are! https://docs.google.com/spreadsheets/d/173l5QJo5scn9MRFEalFLLAy8AmY0oJDRTVP84JrmndQ/edit?usp=sharing |
It's quite common to have a high number of calls of The problem is that it's sometimes called with the whole text, and that's really, really long. My commit should exactly fix this case, but obviously it doesn't in every case. I have to check why. |
My fix assumes that you can always break lines after spaces. But sometimes you can't, for example in We have a function finding the correct place where to break lines, I'll try to use it instead of my naive optimization based on spaces. |
What we're trying to do here is to avoid the layout of the whole text when trying to layout the first line. When lines have a "normal" available length, we keep a random number of characters set to (available width / font-size × 2.5). We try to render these characters, and if a second line has been found (ie. the first line is full), we keep the first line and everything is OK. When lines have an available length of 0 to find the minimum width of the line, (available width / font-size × 2.5) equals 0, the first line is empty and we use the fallback rendering the whole text. To avoid this, we have to find another random values when available length is 0. My first try was to find the first space and break after this space. It always works for "normal" text, but not in your case as you have JSON with "strange" characters strings where we can't break lines after some spaces. So, here's what I've tried with your JSON file:
For Alice:
For the Odyssey (same as Alice but longer):
Well… It's hard to choose the right solution. I think that the solution that will be faster in most cases (for "normal" text) is the current solution, finding spaces. Finding real break points may be longer for long blocks of texts (it's the case for your JSON file), but it's THE clean solution. Truncating may seem appealing, but… 100 is really a random value. If I use 50 it doesn't work for your file (you have sha keys with 95 characters). Using real break points is the way to go, don't you think? |
(And for the future: we must find a Pango way to get the first line without rendering the whole text.) |
Version 43 ---------- Released on 2018-11-09. Bug fixes: * `#726 <https://github.com/Kozea/WeasyPrint/issues/726>`_: Make empty strings clear previous values of named strings * `#729 <https://github.com/Kozea/WeasyPrint/issues/729>`_: Include tools in packaging This version also includes the changes from unstable rc1 and rc2 versions listed below. Version 43rc2 ------------- Released on 2018-11-02. **This version is experimental, don't use it in production. If you find bugs, please report them!** Bug fixes: * `#706 <https://github.com/Kozea/WeasyPrint/issues/706>`_: Fix text-indent at the beginning of a page * `#687 <https://github.com/Kozea/WeasyPrint/issues/687>`_: Allow query strings in file:// URIs * `#720 <https://github.com/Kozea/WeasyPrint/issues/720>`_: Optimize minimum size calculation of long inline elements * `#717 <https://github.com/Kozea/WeasyPrint/issues/717>`_: Display <details> tags as blocks * `#691 <https://github.com/Kozea/WeasyPrint/issues/691>`_: Don't recalculate max content widths when distributing extra space for tables * `#722 <https://github.com/Kozea/WeasyPrint/issues/722>`_: Fix bookmarks and strings set on images * `#723 <https://github.com/Kozea/WeasyPrint/issues/723>`_: Warn users when string() is not used in page margin Version 43rc1 ------------- Released on 2018-10-15. **This version is experimental, don't use it in production. If you find bugs, please report them!** Dependencies: * Python 3.4+ is now needed, Python 2.x is not supported anymore * Cairo 1.15.4+ is now needed, but 1.10+ should work with missing features (such as links, outlines and metadata) * Pdfrw is not needed anymore New features: * `Beautiful website <https://weasyprint.org>`_ * `#579 <https://github.com/Kozea/WeasyPrint/issues/579>`_: Initial support of flexbox * `#592 <https://github.com/Kozea/WeasyPrint/pull/592>`_: Support @font-face on Windows * `#306 <https://github.com/Kozea/WeasyPrint/issues/306>`_: Add a timeout parameter to the URL fetcher functions * `#594 <https://github.com/Kozea/WeasyPrint/pull/594>`_: Split tests using modern pytest features * `#599 <https://github.com/Kozea/WeasyPrint/pull/599>`_: Make tests pass on Windows * `#604 <https://github.com/Kozea/WeasyPrint/pull/604>`_: Handle target counters and target texts * `#631 <https://github.com/Kozea/WeasyPrint/pull/631>`_: Enable counter-increment and counter-reset in page context * `#622 <https://github.com/Kozea/WeasyPrint/issues/622>`_: Allow pathlib.Path objects for HTML, CSS and Attachment classes * `#674 <https://github.com/Kozea/WeasyPrint/issues/674>`_: Add extensive installation instructions for Windows Bug fixes: * `#558 <https://github.com/Kozea/WeasyPrint/issues/558>`_: Fix attachments * `#565 <https://github.com/Kozea/WeasyPrint/issues/565>`_, `#596 <https://github.com/Kozea/WeasyPrint/issues/596>`_, `#539 <https://github.com/Kozea/WeasyPrint/issues/539>`_: Fix many PDF rendering, printing and compatibility problems * `#614 <https://github.com/Kozea/WeasyPrint/issues/614>`_: Avoid crashes and endless loops caused by a Pango bug * `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_: Fix warnings and errors when generating documentation * `#666 <https://github.com/Kozea/WeasyPrint/issues/666>`_, `#685 <https://github.com/Kozea/WeasyPrint/issues/685>`_: Fix many table layout rendering problems * `#680 <https://github.com/Kozea/WeasyPrint/pull/680>`_: Don't crash when there's no font available * `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_: Fix support of some align values in tables
Not sure if this is a regression from #553, but we noticed that large JSON files can take quite a few minutes to render (in an Ubuntu 18.04 Docker container using Docker 18.06.1-ce-mac73 on macOS 10.14 on a 3.1 GHz Intel Core i7). For instance, given https://gist.github.com/dmalan/284f73968484b1fe0eef6bb532235a7c, which contains 7844 lines totaling 274,359 B, it takes 19+ minutes to render the PDF:
Rendering to HTML first with
pygmentize
reduces the time, but it still takes 4+ minutes to render the PDF:Is there perhaps another performance bottleneck that wasn't addressed by 46da8b5 or 6087982?
Thank you!
The text was updated successfully, but these errors were encountered: