-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow performance with set_sorted on string column #5163
Comments
Your column
In other words, sorting on column import numpy as np
import polars as pl
from time import perf_counter
iters = 100 #
N = 10_000_000
# Create df with an integer and corresponding string column
df = pl.DataFrame({"i":np.random.randint(0,100,N)}).with_column(pl.col("i").cast(pl.Utf8).alias("s"))
print("Timings with no sorting:")
start = perf_counter()
for _ in range(iters):
df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")
start = perf_counter()
for _ in range(iters):
df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")
# sorted
print("\nTimings with sorting:")
df = df.with_columns([
pl.col('i').sort(),
pl.col('s').sort()
])
start = perf_counter()
for _ in range(iters):
df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")
start = perf_counter()
for _ in range(iters):
df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")
print("\nTimings with sorting and set_sorted:")
df = df.with_column(pl.col("s").sort().set_sorted())
start = perf_counter()
for _ in range(iters):
df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")
start = perf_counter()
for _ in range(iters):
df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")
So, |
You're correct. I had tested this on a different dataset where the values are alphabetical strings but chose ints for an easier example. I'm not sure that adding |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
When
set_sorted
is True on a numerical column there is a major speed up. When it is True for a corresponding string column performance is worse.So with sorting the operation on the string column is 4x slower than with no sorting. This also holds if we sort by the string column using
.sort
.Reproducible example
Expected behavior
Should be faster with set_sorted (or at least not worse)
Installed versions
The text was updated successfully, but these errors were encountered: