High memory usage for string operation previously (<= 0.20.5) using low memory #15847

karlwiese · 2024-04-23T14:08:29Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import time

from memory_profiler import profile
import numpy as np
import polars as pl


@profile
def run():
    print(pl.__version__)
    limit = 500_000
    print(limit)

    def gen_long_string(nchar, nrows):
        rng = np.random.default_rng(seed=1)
        return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")

    # CASE 1
    # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    # CASE 2
    lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})

    # lf = lf.cast({"string": pl.Categorical})  # casting to categorical solves it for CASE 2
    lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()


if __name__ == "__main__":
    run()

Log output

CASE 1

0.20.5

0.20.5
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION
group_by keys are sorted; running sorted key fast path

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    124.9 MiB    124.9 MiB           1   @profile
    72                                         def run():
    73    124.9 MiB      0.0 MiB           1       print(pl.__version__)
    74    124.9 MiB      0.0 MiB           1       limit = 500_000
    75    124.9 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    124.9 MiB      0.0 MiB           2       def gen_long_string(nchar, nrows):
    78    124.9 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    153.6 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81    183.3 MiB     29.6 MiB           1       lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82                                             # lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    677.8 MiB    494.6 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

0.20.22

0.20.22
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION
group_by keys are sorted; running sorted key fast path

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    117.6 MiB    117.6 MiB           1   @profile
    72                                         def run():
    73    117.6 MiB      0.0 MiB           1       print(pl.__version__)
    74    117.6 MiB      0.0 MiB           1       limit = 500_000
    75    117.6 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    117.6 MiB      0.0 MiB           2       def gen_long_string(nchar, nrows):
    78    117.6 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    146.3 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81    180.1 MiB     33.8 MiB           1       lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82                                             # lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    668.2 MiB    488.1 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

CASE 2

0.20.5

0.20.5
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    111.8 MiB    111.8 MiB           1   @profile
    72                                         def run():
    73    111.8 MiB      0.0 MiB           1       print(pl.__version__)
    74    111.8 MiB      0.0 MiB           1       limit = 500_000
    75    111.8 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    120.8 MiB      9.0 MiB           2       def gen_long_string(nchar, nrows):
    78    120.8 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    149.5 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82    170.1 MiB     20.6 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84    299.4 MiB    129.3 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

0.20.22

0.20.22
500000
keys/aggregates are not partitionable: running default HASH AGGREGATION

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    71    119.8 MiB    119.8 MiB           1   @profile
    72                                         def run():
    73    119.8 MiB      0.0 MiB           1       print(pl.__version__)
    74    119.8 MiB      0.0 MiB           1       limit = 500_000
    75    119.8 MiB      0.0 MiB           1       print(limit)
    76                                         
    77    128.8 MiB      9.0 MiB           2       def gen_long_string(nchar, nrows):
    78    128.8 MiB      0.0 MiB           1           rng = np.random.default_rng()
    79    157.5 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    80                                         
    81                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    82    186.1 MiB     28.6 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    83                                         
    84   4858.1 MiB   4672.0 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

Issue description

The memory usage for the operation in the reproducible example increases from version 0.20.5 to version 0.20.22 (I also tried 0.20.6 and 0.20.21 and it shows the same increased memory). The memory usage seems to be linear to the amount of elements in table. So it doesn't seem to be an endless leakage but rather uses more space than necessary for the operation. Compare CASE 2 for both versions. The problem does not occur when I cast the string column to categorical. It seems to be related to the string dtype.

However, I also observed other interesting behavior and I thought I add it here in case it helps resolving the issue: Looking at CASE 1 now. The difference is how the lazyframe is constructed. The final result between CASE 1 and CASE 2 is the same. But CASE 1 is using range(limit) and CASE 2 is using np.repeat(range(limit), 1). Both versions (0.20.5 and 0.20.22) in combination with CASE 1 use the same memory. But in CASE 2, for version 0.20.5 it leads to a reduction of memory usage and in version 0.20.22 to the above discussed increase.

	CASE 1	CASE 2
0.20.5	494.6	129.3
0.20.22	488.1	4672.0

I checked other issues and some mention memory issues but It's hard to judge if they relate. I observed this as a follow-up on #15615.

Expected behavior

Be as memory efficient as in version 0.20.5.

Installed versions

----Optional dependencies----
adbc_driver_manager:  0.10.0
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.0.0
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-04-25T14:03:31Z

fixed by @15888

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8     90.6 MiB     90.6 MiB           1   @profile
     9                                         def run():
    10     90.6 MiB      0.0 MiB           1       print(pl.__version__)
    11     90.6 MiB      0.0 MiB           1       limit = 500_000
    12     90.6 MiB      0.0 MiB           1       print(limit)
    13                                         
    14     95.0 MiB      4.4 MiB           2       def gen_long_string(nchar, nrows):
    15     95.1 MiB      0.1 MiB           1           rng = np.random.default_rng(seed=1)
    16    123.8 MiB     28.7 MiB           1           return rng.integers(low=96, high=122, size=nrows * nchar, dtype="uint32").view(f"U{nchar}")
    17                                         
    18                                             # CASE 1
    19                                             # lf = pl.LazyFrame({"id": range(limit), "string": gen_long_string(15, limit * 1)})
    20                                             # CASE 2
    21    122.6 MiB     -1.2 MiB           1       lf = pl.LazyFrame({"id": np.repeat(range(limit), 1), "string": gen_long_string(15, limit * 1)})
    22                                         
    23                                             # lf = lf.cast({"string": pl.Categorical})  # casting to categorical solves it for CASE 2
    24    259.7 MiB    137.1 MiB           1       lf.group_by("id").agg(pl.struct(pl.exclude("id"))).collect()

karlwiese added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 23, 2024

stinodego added the A-dtype-string Area: string data type label Apr 24, 2024

This was referenced Apr 25, 2024

perf: Ensure we hit specialized gather for binary/strings #15886

Merged

perf: Improve non-trivial list aggregations #15888

Merged

ritchie46 closed this as completed in #15888 Apr 25, 2024

c-peters added the accepted Ready for implementation label Apr 29, 2024

c-peters assigned ritchie46 Apr 29, 2024

c-peters added this to Backlog Apr 29, 2024

c-peters moved this to Done in Backlog Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

karlwiese commented Apr 23, 2024 •

edited

Loading

ritchie46 commented Apr 25, 2024

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

High memory usage for string operation previously (<= 0.20.5) using low memory #15847

Comments

karlwiese commented Apr 23, 2024 • edited Loading

Checks

Reproducible example

Log output

CASE 1

0.20.5

0.20.22

CASE 2

0.20.5

0.20.22

Issue description

Expected behavior

Installed versions

ritchie46 commented Apr 25, 2024

karlwiese commented Apr 23, 2024 •

edited

Loading