Memory heavy to render a 900kB pdf #1950

Chris2L · 2023-08-29T11:18:41Z

I'm using python 3.10.6 with weasyprint 59.0 on a linux VPS. I'm trying to generate a report that is about 220 pages long that is mainly tables (2 very long tables). First column is a name and job title and the rest of the columns are fontawesome icons.

I use the following code to generate a simple test:

from weasyprint import HTML
HTML('./test.html').write_pdf('./test.pdf')

When generating the pdf it takes up about 2 GB of RAM. I also need to exit the python application to reclaim that RAM.
I have a test.html, just not sure how to upload it here.

The questions are:

why is it so RAM-heavy? I've noticed that removing whitespace from the html reduces the memory size by about 200 MB (surely this should not make a difference).
What should I do to get the memory back after calling write_pdf?

The text was updated successfully, but these errors were encountered:

pmjdebruijn · 2023-08-29T12:51:51Z

Possibly related? #1923

Chris2L · 2023-08-30T03:42:25Z

Definitely seems related, thank you. Seems like there is active work going on that problem. Maybe we can look into the problem remaining? How can I reclaim the memory after printing the document?
I've tried the following and it took the memory from 2GB down to 1.3GB:

import gc
from weasyprint import HTML

html = HTML('./test.html')
content = html.write_pdf('./test.pdf')
del(content)
del(html)
gc.collect()

I'm not familiar with memory stuff in Python. Tried to print the sizes of all the global variables but did not get much useful data

ssjkamei · 2023-08-30T03:53:06Z

#1496 may help you.

Chris2L · 2023-08-30T07:43:24Z

At first I believed that #1923 was related (and it still might be) but I created a script that uses no CSS and the memory usage is still very high.
I think #1496 might be related, but I don't believe it is a leak as such. The memory just grows A LOT but then stays stable. I would like to know how to reclaim that memory. Here is my script:

import datetime
import os
import time

import psutil
import weasyprint

delim = ' | '
fields = ['rss', 'vms', 'uss', 'swap']
time_fmt = '%y/%m/%d %H:%I:%S'

def printmem():
    pid = os.getpid()
    mem = psutil.Process(pid).memory_full_info()
    now = datetime.datetime.now()
    print(delim.join((
        now.strftime(time_fmt),
        *[str(round(getattr(mem, field, 0) / 1000000)) + 'mb' for field in fields]
    )))


HEADER = \
"""
<!DOCTYPE HTML>
<html lang='en'>
    <head></head>
    <body>
        <div>
            <table>
                <thead>
                    <tr>
                        <th>
                            <div>Name</div>
                            <div>Department - Job Title</div>
                        </th>
                        <th>header 1</th>
                        <th rowspan='2'>header 2</th>
                        <th rowspan='2'>header 3</th>
                        <th rowspan='2'>header 4</th>
                    </tr>
                </thead>
                <tbody>
                """
ROW = \
"""
                    <tr>
                        <td>
                            <div>John Doe</div>
                            <div>The best job - Ever!</div>
                        </td>
                        <td>
                            <div>
                                1
                            </div>
                        </td>
                        <td>
                            <div>
                                2
                            </div>
                        </td>
                        <td>
                            <div>
                                3
                            </div>
                        </td>
                        <td>
                            <div>
                                4
                            </div>
                        </td>
                    </tr>
"""

FOOTER = \
"""
                </tbody>
            </table>
        </div>
    </body>
</html>
"""

the_html = HEADER
# Create some rows
for _ in range(5000):
    the_html += ROW
the_html += FOOTER

printmem()

for i in range(10):
    start = time.perf_counter()

    html = weasyprint.HTML(string=the_html)
    content = html.write_pdf(f'./test_{i}.pdf')

    end = time.perf_counter()

    printmem()
    print(f"It took {(end - start):.2f} s for round {i}")

    time.sleep(1)

while 1:
    time.sleep(10)

It creates an HTML document with a table with 5000 rows. The memory grows to about 1.5 GB, but then stays stable to subsequent generations. A user in #1496 stated that stuff gets cached and that is why is remains so big, but the speed for subsequent generations is not increased, it takes the same time of each. 1.5 GB is a lot to just keep allocated. Is there some way for me to reclaim it after generation as the same process might generate a few documents, but some time passes between each and I would rather allocate the memory when needed.

ssjkamei · 2023-08-30T09:50:01Z

I don't know much about memory handling either, but I looked at the memory size while stopping the process with the debugger.

for i in range(10):
    start = time.perf_counter()

    html = weasyprint.HTML(string=the_html)
    content = html.write_pdf(f'./test_{i}.pdf')

    html = None
    content = None

    end = time.perf_counter()

    printmem()
    print(f"It took {(end - start):.2f} s for round {i}")

    time.sleep(1)

I tried overwriting it with None as shown below, but the memory usage did not change when changed to None.
When re-executing weasyprint.HTML(string=the_html), the amount of memory decreased to a certain extent and the memory was used again.

Although it says that it does not use css, it seems that it is affected by the css memory size because it uses default values to calculate the drawing position.

I think it's probably the same story as #1496.

I think you need to check Python implementation.

Chris2L · 2023-08-30T10:30:57Z

@ssjkamei Thank you for taking a look. I think I'm going to have to spend a few hours trying to understand how weasyprint works under the hood to try and figure out what is going on here.

Not sure what you mean by "check your Python implementation." Do you mean this is what it is and I need to make another plan to restart the process or something to clear out the memory? It's not meant in a bad way, just trying to make sure I understand your statement.

Thanks

liZe · 2023-08-30T12:54:30Z

Hi!

The memory just grows A LOT but then stays stable. I would like to know how to reclaim that memory.

That’s something users want to do, but the short answer is: you can’t. Python doesn’t provide an interface to manage memory, even when using Python’s C interface, and that’s intended. Quoting the documentation:

It is important to understand that the management of the Python heap is performed by the interpreter itself and that the user has no control over it, even if they regularly manipulate object pointers to memory blocks inside that heap.

Manually collecting the garbage collector can help, but that’s not deterministic: launch the same process multiple times, you may not get the same result. You can even delete HTML before collecting the garbage collector to be sure that there’s no Python object left related to WeasyPrint.

So, if you really want to free the memory, the easiest way is probably to use dedicated processes. Here’s a StackOverflow thread about this topic.

ssjkamei · 2023-08-31T02:17:42Z

@Chris2L
I'm sorry, I'm not good at English. I was wrong.

"check your Python implementation."
I wanted to write "Need to check Python itself."

Chris2L · 2023-08-31T04:01:32Z

No problem. Your English might not be good ;-) but your python rocks! Loving the updates you are pushing for your pull request.

liZe · 2023-08-31T18:26:55Z

As the memory part of issue is already discussed by #1496 and the table part in #1923, could we close this issue and continue to discuss in #1923?

Chris2L closed this as completed Aug 31, 2023

liZe mentioned this issue Aug 31, 2023

WeasyPrint consuming a lot of memory when rendering tables with 5000 rows #1104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory heavy to render a 900kB pdf #1950

Memory heavy to render a 900kB pdf #1950

Chris2L commented Aug 29, 2023

pmjdebruijn commented Aug 29, 2023

Chris2L commented Aug 30, 2023

ssjkamei commented Aug 30, 2023

Chris2L commented Aug 30, 2023

ssjkamei commented Aug 30, 2023 •

edited

Loading

Chris2L commented Aug 30, 2023

liZe commented Aug 30, 2023

ssjkamei commented Aug 31, 2023

Chris2L commented Aug 31, 2023

liZe commented Aug 31, 2023

Memory heavy to render a 900kB pdf #1950

Memory heavy to render a 900kB pdf #1950

Comments

Chris2L commented Aug 29, 2023

pmjdebruijn commented Aug 29, 2023

Chris2L commented Aug 30, 2023

ssjkamei commented Aug 30, 2023

Chris2L commented Aug 30, 2023

ssjkamei commented Aug 30, 2023 • edited Loading

Chris2L commented Aug 30, 2023

liZe commented Aug 30, 2023

ssjkamei commented Aug 31, 2023

Chris2L commented Aug 31, 2023

liZe commented Aug 31, 2023

ssjkamei commented Aug 30, 2023 •

edited

Loading