High memory usage for string operation previously (<= 0.20.5) using low memory #15847
Closed
2 tasks done
Labels
A-dtype-string
Area: string data type
accepted
Ready for implementation
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Log output
CASE 1
0.20.5
0.20.22
CASE 2
0.20.5
0.20.22
Issue description
The memory usage for the operation in the reproducible example increases from version
0.20.5
to version0.20.22
(I also tried0.20.6
and0.20.21
and it shows the same increased memory). The memory usage seems to be linear to the amount of elements in table. So it doesn't seem to be an endless leakage but rather uses more space than necessary for the operation. Compare CASE 2 for both versions. The problem does not occur when I cast the string column to categorical. It seems to be related to the string dtype.However, I also observed other interesting behavior and I thought I add it here in case it helps resolving the issue: Looking at CASE 1 now. The difference is how the lazyframe is constructed. The final result between CASE 1 and CASE 2 is the same. But CASE 1 is using
range(limit)
and CASE 2 is usingnp.repeat(range(limit), 1)
. Both versions (0.20.5
and0.20.22
) in combination with CASE 1 use the same memory. But in CASE 2, for version0.20.5
it leads to a reduction of memory usage and in version0.20.22
to the above discussed increase.I checked other issues and some mention memory issues but It's hard to judge if they relate. I observed this as a follow-up on #15615.
Expected behavior
Be as memory efficient as in version
0.20.5
.Installed versions
The text was updated successfully, but these errors were encountered: