-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grouping to list of struct
is slower in 0.20.6 than in 0.20.5 and leads to out-of-memory eventually
#15615
Comments
Is it only slower or does it never release memory? We did a huge rework of our internal string representation, so some stuff might be slower. More stuff sped up though. |
Trying On |
Alright, that seems something quadratic. Will look into it. |
First of all, thank you for your amazing speed! The momentum here is awesome!
My "production" code always goes OOM with version 0.20.6. In version 0.20.19 it just runs forever and keeps a certain memory level. |
Completely unrelated to your issue but here's a little function I picked up from someone somewhere that generates random letters way faster than a comprehension with string.ascii_letters
if you want str_len to be random then you'd have to tweak that though. |
Relevant backtrace:
|
Hey! I can still create a quadratic behavior. Should I open a new issue or is it fine re-using this one? import time
import numpy as np
import polars as pl
def gen_long_string(str_len, n_rows):
rng = np.random.default_rng()
return rng.integers(low=96, high=122, size=n_rows * str_len, dtype="uint32").view(
f"U{str_len}"
)
print(pl.__version__)
limit = 30_000
lf1 = pl.LazyFrame({"index1": np.repeat(range(int(limit / 2)), 2), "index2": range(limit)})
lf2 = pl.LazyFrame(
{
"index2": np.repeat(range(limit), 2),
"string": gen_long_string(15, limit * 2),
}
)
lf2 = lf2.group_by("index2").agg(pl.struct(pl.exclude(["index1", "index2"])))
lf1 = lf1.join(lf2, on="index2") # this line is now fast and was previously slow (<0.20.20)
t0 = time.time()
(
lf1
.group_by("index1")
.agg(pl.struct(pl.exclude(["index1"])))
.collect()
)
t1 = time.time()
print(t1 - t0) It prints around 5 seconds for I observed different behavior in my "production" code. While it runs the memory slowly increases instead of staying at a certain level. |
I can reproduce this on main. As this is a p-high issue - tagging @ritchie46 seems appropriate. |
Will take a look tomorrow. |
Hey! Thank you for the extremely fast handling of the issue so far. And sorry to bother you again. I see OOM issues now after the fix of the quadratic growable. If I profile the exact same code as above with I searched for out of memory issues but reading through them they seem not related. Should I create another issue or is it fine to re-use this one? Code and outputimport time
import numpy as np
import polars as pl
from memory_profiler import profile
@profile
def run():
print(pl.__version__)
limit = 500_000
print(limit)
def gen_long_string(str_len, n_rows):
rng = np.random.default_rng()
return rng.integers(low=96, high=122, size=n_rows * str_len, dtype="uint32").view(
f"U{str_len}"
)
lf1 = pl.LazyFrame({"index1": np.repeat(range(int(limit / 2)), 2), "index2": range(limit)})
lf2 = pl.LazyFrame(
{
"index2": np.repeat(range(limit), 2),
"string": gen_long_string(15, limit * 2),
}
)
lf2 = lf2.group_by("index2").agg(pl.struct(pl.exclude(["index1", "index2"])))
lf1 = lf1.join(lf2, on="index2") # this line is now fast and was previously slow (<0.20.20)
t0 = time.time()
(
lf1
.group_by("index1")
.agg(pl.struct(pl.exclude(["index1"])))
.collect() # this line is now fast and was previously slow (<0.20.22) but uses a lot of memory
)
t1 = time.time()
print(t1 - t0)
if __name__ == "__main__":
run() Output for 0.20.5
Output for 0.20.22
|
I would guess that creating a new issue may be the better option here? From what I have experienced, it is less likely to get a response in a closed issue unless someone is tagged specifically. I did file #15834 yesterday which is also a list of structs inside (although I couldn't trigger it without |
@cmdlineluser Thank you for your feedback. Looking at your issue I doubt it is related. Your example faults with minimal data as where my issues is related to the size of data. I'll create a new issue. |
Checks
Reproducible example
Log output
Issue description
As mentioned in #14201 I originally noticed the issue also due to OOM errors. In contrast to #14201, it is not intermittent but always occurs. It seems to me that the combination of putting strings in a
struct
and then grouping thestruct
s to a list causes the issue.I can confirm that casting the strings to categorial helps.
Further, I found that saving the frame after the struct to a file, reading it again and then doing the grouping also solves the problem.
I could come up with a reproducible example. Please see the above code. Be careful when testing and increasing the
limit
. It's definitely not linear.From the Log output you can see that the latest version is faster than 0.20.6 but still much slower than 0.20.5. If you increase the
limit
you easily get high execution times for the latest version.Expected behavior
Be as fast and efficient as in 0.20.5
Installed versions
The text was updated successfully, but these errors were encountered: