Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory usage increase for big files #376

Closed
josteinl opened this issue Oct 31, 2023 · 1 comment · Fixed by #377
Closed

[BUG] Memory usage increase for big files #376

josteinl opened this issue Oct 31, 2023 · 1 comment · Fixed by #377
Labels
bug Something isn't working

Comments

@josteinl
Copy link

josteinl commented Oct 31, 2023

Describe the bug
After upgrading from version 3.2.0 to either 3.3.0 or 3.31 I notice a huge increase in memory usage. Run from_bytes() on a 25 MB file, now results in using almost 3 GB of memory.

To Reproduce
Run this file, placed inside the charset_normalizer folder, with the scalene memory profiler (Linux/WSL):

memory_profile_test.py:

"""
Run from the project root:

    poetry run python3 -m scalene charset_normalizer/memory_profile_test.py

or (with an activated virtual environment)

    pip install scalene
    scalene charset_normalizer/memory_profile_test.py
"""

from charset_normalizer.api import from_bytes

file_name = "data/memory_profile_test.txt"

with open(file_name, "rb") as file:
    data = file.read()
    result = from_bytes(data)
    best = result.best()
    print(f"{best=}")

Data file used (25 MB), placed in the data folder :
memory_profile_test.txt

Profiler result (download and view in browser):
profile_charset_normalizer_3.3.1.html

Expected behaviour
Expected that the function did use just a bit more memory than the file I passed into from_bytes().

Testing Environment

  • OS: Ubuntu on WSL
  • Python version 3.11.6
  • Package version 3.3.0/1

Additional context
We use the charset-normalizer in our program running in containers with strict memory limits. We noticed the change in behaviour after our pods were Out Of Memory (OOM) killed.

Doing some debugging, it seems that the increase in memory consumption comes from storing the decoded_payload in the CharsetMatch().

Finally
A big thank you to the authors and maintainers! This library is much needed, used and appreciated!

@Ousret
Copy link
Member

Ousret commented Oct 31, 2023

You are welcome.
The report you gave us helped us understand and fix the issue quickly.
We will be publishing a patch release soon.

Ousret added a commit that referenced this issue Oct 31, 2023
@Ousret Ousret removed the help wanted Extra attention is needed label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
2 participants