Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

Closed
2 tasks done
karlwiese opened this issue Apr 23, 2024 · 1 comment · Fixed by #15888
Closed
2 tasks done

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

karlwiese opened this issue Apr 23, 2024 · 1 comment · Fixed by #15888
Assignees
Labels
A-dtype-string Area: string data type accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@karlwiese
Copy link

karlwiese commented Apr 23, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import time

from memory_profiler import profile
import numpy as np
import polars as pl


@profile
def run():
    print(pl.__version__)
    limit = 500_000
    print(limit)

    def gen_long_string(nchar, nrows):
        rng = np.random.default_rng(seed=1)
        return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")

    # CASE 1
    # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    # CASE 2
    lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})

    # lf = lf.cast({"string": pl.Categorical})  # casting to categorical solves it for CASE 2
    lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()


if __name__ == "__main__":
    run()

Log output

CASE 1

0.20.5

0.20.5
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION
group_by keys are sorted; running sorted key fast path

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    124.9 MiB    124.9 MiB           1   @profile
    72                                         def run():
    73    124.9 MiB      0.0 MiB           1       print(pl.__version__)
    74    124.9 MiB      0.0 MiB           1       limit = 500_000
    75    124.9 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    124.9 MiB      0.0 MiB           2       def gen_long_string(nchar, nrows):
    78    124.9 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    153.6 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81    183.3 MiB     29.6 MiB           1       lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82                                             # lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    677.8 MiB    494.6 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

0.20.22

0.20.22
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION
group_by keys are sorted; running sorted key fast path

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    117.6 MiB    117.6 MiB           1   @profile
    72                                         def run():
    73    117.6 MiB      0.0 MiB           1       print(pl.__version__)
    74    117.6 MiB      0.0 MiB           1       limit = 500_000
    75    117.6 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    117.6 MiB      0.0 MiB           2       def gen_long_string(nchar, nrows):
    78    117.6 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    146.3 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81    180.1 MiB     33.8 MiB           1       lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82                                             # lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    668.2 MiB    488.1 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

CASE 2

0.20.5

0.20.5
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    111.8 MiB    111.8 MiB           1   @profile
    72                                         def run():
    73    111.8 MiB      0.0 MiB           1       print(pl.__version__)
    74    111.8 MiB      0.0 MiB           1       limit = 500_000
    75    111.8 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    120.8 MiB      9.0 MiB           2       def gen_long_string(nchar, nrows):
    78    120.8 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    149.5 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82    170.1 MiB     20.6 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    299.4 MiB    129.3 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

0.20.22

0.20.22
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    119.8 MiB    119.8 MiB           1   @profile
    72                                         def run():
    73    119.8 MiB      0.0 MiB           1       print(pl.__version__)
    74    119.8 MiB      0.0 MiB           1       limit = 500_000
    75    119.8 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    128.8 MiB      9.0 MiB           2       def gen_long_string(nchar, nrows):
    78    128.8 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    157.5 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82    186.1 MiB     28.6 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84   4858.1 MiB   4672.0 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

Issue description

The memory usage for the operation in the reproducible example increases from version 0.20.5 to version 0.20.22 (I also tried 0.20.6 and 0.20.21 and it shows the same increased memory). The memory usage seems to be linear to the amount of elements in table. So it doesn't seem to be an endless leakage but rather uses more space than necessary for the operation. Compare CASE 2 for both versions. The problem does not occur when I cast the string column to categorical. It seems to be related to the string dtype.

However, I also observed other interesting behavior and I thought I add it here in case it helps resolving the issue: Looking at CASE 1 now. The difference is how the lazyframe is constructed. The final result between CASE 1 and CASE 2 is the same. But CASE 1 is using range(limit) and CASE 2 is using np.repeat(range(limit), 1). Both versions (0.20.5 and 0.20.22) in combination with CASE 1 use the same memory. But in CASE 2, for version 0.20.5 it leads to a reduction of memory usage and in version 0.20.22 to the above discussed increase.

CASE 1 CASE 2
0.20.5 494.6 129.3
0.20.22 488.1 4672.0

I checked other issues and some mention memory issues but It's hard to judge if they relate. I observed this as a follow-up on #15615.

Expected behavior

Be as memory efficient as in version 0.20.5.

Installed versions

----Optional dependencies----
adbc_driver_manager:  0.10.0
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.0.0
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@karlwiese karlwiese added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 23, 2024
@stinodego stinodego added the A-dtype-string Area: string data type label Apr 24, 2024
@ritchie46
Copy link
Member

fixed by @15888

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8     90.6 MiB     90.6 MiB           1   @profile
     9                                         def run():
    10     90.6 MiB      0.0 MiB           1       print(pl.__version__)
    11     90.6 MiB      0.0 MiB           1       limit = 500_000
    12     90.6 MiB      0.0 MiB           1       print(limit)
    13                                         
    14     95.0 MiB      4.4 MiB           2       def gen_long_string(nchar, nrows):
    15     95.1 MiB      0.1 MiB           1           rng = np.random.default_rng(seed=1)
    16    123.8 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    17                                         
    18                                             # CASE 1
    19                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    20                                             # CASE 2
    21    122.6 MiB     -1.2 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    22                                         
    23                                             # lf = lf.cast({"string": pl.Categorical})  # casting to categorical solves it for CASE 2
    24    259.7 MiB    137.1 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

@c-peters c-peters added the accepted Ready for implementation label Apr 29, 2024
@c-peters c-peters added this to Backlog Apr 29, 2024
@c-peters c-peters moved this to Done in Backlog Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-string Area: string data type accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants