Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory heavy to render a 900kB pdf #1950

Closed
Chris2L opened this issue Aug 29, 2023 · 10 comments
Closed

Memory heavy to render a 900kB pdf #1950

Chris2L opened this issue Aug 29, 2023 · 10 comments

Comments

@Chris2L
Copy link

Chris2L commented Aug 29, 2023

I'm using python 3.10.6 with weasyprint 59.0 on a linux VPS. I'm trying to generate a report that is about 220 pages long that is mainly tables (2 very long tables). First column is a name and job title and the rest of the columns are fontawesome icons.

I use the following code to generate a simple test:

from weasyprint import HTML
HTML('./test.html').write_pdf('./test.pdf')

When generating the pdf it takes up about 2 GB of RAM. I also need to exit the python application to reclaim that RAM.
I have a test.html, just not sure how to upload it here.

The questions are:

  1. why is it so RAM-heavy? I've noticed that removing whitespace from the html reduces the memory size by about 200 MB (surely this should not make a difference).
  2. What should I do to get the memory back after calling write_pdf?
@pmjdebruijn
Copy link
Contributor

Possibly related? #1923

@Chris2L
Copy link
Author

Chris2L commented Aug 30, 2023

Definitely seems related, thank you. Seems like there is active work going on that problem. Maybe we can look into the problem remaining? How can I reclaim the memory after printing the document?
I've tried the following and it took the memory from 2GB down to 1.3GB:

import gc
from weasyprint import HTML

html = HTML('./test.html')
content = html.write_pdf('./test.pdf')
del(content)
del(html)
gc.collect()

I'm not familiar with memory stuff in Python. Tried to print the sizes of all the global variables but did not get much useful data

@ssjkamei
Copy link

#1496 may help you.

@Chris2L
Copy link
Author

Chris2L commented Aug 30, 2023

At first I believed that #1923 was related (and it still might be) but I created a script that uses no CSS and the memory usage is still very high.
I think #1496 might be related, but I don't believe it is a leak as such. The memory just grows A LOT but then stays stable. I would like to know how to reclaim that memory. Here is my script:

import datetime
import os
import time

import psutil
import weasyprint

delim = ' | '
fields = ['rss', 'vms', 'uss', 'swap']
time_fmt = '%y/%m/%d %H:%I:%S'

def printmem():
    pid = os.getpid()
    mem = psutil.Process(pid).memory_full_info()
    now = datetime.datetime.now()
    print(delim.join((
        now.strftime(time_fmt),
        *[str(round(getattr(mem, field, 0) / 1000000)) + 'mb' for field in fields]
    )))


HEADER = \
"""
<!DOCTYPE HTML>
<html lang='en'>
    <head></head>
    <body>
        <div>
            <table>
                <thead>
                    <tr>
                        <th>
                            <div>Name</div>
                            <div>Department - Job Title</div>
                        </th>
                        <th>header 1</th>
                        <th rowspan='2'>header 2</th>
                        <th rowspan='2'>header 3</th>
                        <th rowspan='2'>header 4</th>
                    </tr>
                </thead>
                <tbody>
                """
ROW = \
"""
                    <tr>
                        <td>
                            <div>John Doe</div>
                            <div>The best job - Ever!</div>
                        </td>
                        <td>
                            <div>
                                1
                            </div>
                        </td>
                        <td>
                            <div>
                                2
                            </div>
                        </td>
                        <td>
                            <div>
                                3
                            </div>
                        </td>
                        <td>
                            <div>
                                4
                            </div>
                        </td>
                    </tr>
"""

FOOTER = \
"""
                </tbody>
            </table>
        </div>
    </body>
</html>
"""

the_html = HEADER
# Create some rows
for _ in range(5000):
    the_html += ROW
the_html += FOOTER

printmem()

for i in range(10):
    start = time.perf_counter()

    html = weasyprint.HTML(string=the_html)
    content = html.write_pdf(f'./test_{i}.pdf')

    end = time.perf_counter()

    printmem()
    print(f"It took {(end - start):.2f} s for round {i}")

    time.sleep(1)

while 1:
    time.sleep(10)

It creates an HTML document with a table with 5000 rows. The memory grows to about 1.5 GB, but then stays stable to subsequent generations. A user in #1496 stated that stuff gets cached and that is why is remains so big, but the speed for subsequent generations is not increased, it takes the same time of each. 1.5 GB is a lot to just keep allocated. Is there some way for me to reclaim it after generation as the same process might generate a few documents, but some time passes between each and I would rather allocate the memory when needed.

@ssjkamei
Copy link

ssjkamei commented Aug 30, 2023

I don't know much about memory handling either, but I looked at the memory size while stopping the process with the debugger.

for i in range(10):
    start = time.perf_counter()

    html = weasyprint.HTML(string=the_html)
    content = html.write_pdf(f'./test_{i}.pdf')

    html = None
    content = None

    end = time.perf_counter()

    printmem()
    print(f"It took {(end - start):.2f} s for round {i}")

    time.sleep(1)

I tried overwriting it with None as shown below, but the memory usage did not change when changed to None.
When re-executing weasyprint.HTML(string=the_html), the amount of memory decreased to a certain extent and the memory was used again.

Although it says that it does not use css, it seems that it is affected by the css memory size because it uses default values to calculate the drawing position.

I think it's probably the same story as #1496.

I think you need to check Python implementation.

@Chris2L
Copy link
Author

Chris2L commented Aug 30, 2023

@ssjkamei Thank you for taking a look. I think I'm going to have to spend a few hours trying to understand how weasyprint works under the hood to try and figure out what is going on here.

Not sure what you mean by "check your Python implementation." Do you mean this is what it is and I need to make another plan to restart the process or something to clear out the memory? It's not meant in a bad way, just trying to make sure I understand your statement.

Thanks

@liZe
Copy link
Member

liZe commented Aug 30, 2023

Hi!

The memory just grows A LOT but then stays stable. I would like to know how to reclaim that memory.

That’s something users want to do, but the short answer is: you can’t. Python doesn’t provide an interface to manage memory, even when using Python’s C interface, and that’s intended. Quoting the documentation:

It is important to understand that the management of the Python heap is performed by the interpreter itself and that the user has no control over it, even if they regularly manipulate object pointers to memory blocks inside that heap.

Manually collecting the garbage collector can help, but that’s not deterministic: launch the same process multiple times, you may not get the same result. You can even delete HTML before collecting the garbage collector to be sure that there’s no Python object left related to WeasyPrint.

So, if you really want to free the memory, the easiest way is probably to use dedicated processes. Here’s a StackOverflow thread about this topic.

@ssjkamei
Copy link

@Chris2L
I'm sorry, I'm not good at English. I was wrong.

"check your Python implementation."
I wanted to write "Need to check Python itself."

@Chris2L
Copy link
Author

Chris2L commented Aug 31, 2023

No problem. Your English might not be good ;-) but your python rocks! Loving the updates you are pushing for your pull request.

@liZe
Copy link
Member

liZe commented Aug 31, 2023

As the memory part of issue is already discussed by #1496 and the table part in #1923, could we close this issue and continue to discuss in #1923?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants