Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance with set_sorted on string column #5163

Closed
2 tasks done
braaannigan opened this issue Oct 11, 2022 · 2 comments · Fixed by #5184
Closed
2 tasks done

Slow performance with set_sorted on string column #5163

braaannigan opened this issue Oct 11, 2022 · 2 comments · Fixed by #5184
Labels
bug Something isn't working python Related to Python Polars

Comments

@braaannigan
Copy link
Collaborator

braaannigan commented Oct 11, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When set_sorted is True on a numerical column there is a major speed up. When it is True for a corresponding string column performance is worse.

So with sorting the operation on the string column is 4x slower than with no sorting. This also holds if we sort by the string column using .sort.

Reproducible example

import numpy as np
import polars as pl

N = 10_000_000
# Create df with an integer and corresponding string column
df = pl.DataFrame({"i":np.random.randint(0,100,N)}).with_column(pl.col("i").cast(pl.Utf8).alias("s"))

# Timings with no sorting
df.select(pl.col("i").max())
# 3 ms
df.select(pl.col("s").max())
# 10 ms

# Sort the df by the integer column
df = df.sort("i")
# Timings with sorting on integers
df.select(pl.col("i").max())
# 0.08 ms
df.select(pl.col("s").max())
# 10 ms

# Do set_sorted on s
df = df.with_column(pl.col("s").set_sorted())
# Timings with sorting on integers
df.select(pl.col("i").max())
# 0.08 ms
df.select(pl.col("s").max())
# 40 ms

Expected behavior

Should be faster with set_sorted (or at least not worse)

Installed versions

---Version info---
Polars: 0.14.18
Index type: UInt32
Platform: macOS-10.16-x86_64-i386-64bit
Python: 3.8.3 (default, May 19 2020, 13:54:14)
[Clang 10.0.0 ]
---Optional dependencies---
pyarrow: 7.0.0
pandas: 1.4.1
numpy: 1.23.0
fsspec: 2022.7.1
connectorx: 0.3.0
xlsx2csv: 0.7.8
matplotlib: 3.3.2
@braaannigan braaannigan added bug Something isn't working python Related to Python Polars labels Oct 11, 2022
@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 11, 2022

Your column s is not sorted. The reason is that the string '9' is not less than the string "14", despite 9 being less than 14:

>>> '9' < '14'
False

In other words, sorting on column i does not properly sort s, so your set_sorted() is invalid. Thankfully, it looks like polars is smart enough to ignore set_sorted when it's improperly set, although I'll need @ritchie46 to verify that (update: it happens to return the correct result, but the documentation warns that you may get incorrect results). Let's re-do your test, but with sorting s. I've added the timing code as well:

import numpy as np
import polars as pl
from time import perf_counter

iters = 100 #

N = 10_000_000
# Create df with an integer and corresponding string column
df = pl.DataFrame({"i":np.random.randint(0,100,N)}).with_column(pl.col("i").cast(pl.Utf8).alias("s"))

print("Timings with no sorting:")
start = perf_counter()
for _ in range(iters):
    df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")

start = perf_counter()
for _ in range(iters):
    df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")


# sorted
print("\nTimings with sorting:")
df = df.with_columns([
    pl.col('i').sort(),
    pl.col('s').sort()
])

start = perf_counter()
for _ in range(iters):
    df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")

start = perf_counter()
for _ in range(iters):
    df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")


print("\nTimings with sorting and set_sorted:")
df = df.with_column(pl.col("s").sort().set_sorted())

start = perf_counter()
for _ in range(iters):
    df.select(pl.col("i").max())
print(f"col('i').max(): {perf_counter()-start:0.1f}s")

start = perf_counter()
for _ in range(iters):
    df.select(pl.col("s").max())
print(f"col('s').max(): {perf_counter()-start:0.1f}s")
Timings with no sorting:
col('i').max(): 0.3s
col('s').max(): 2.0s

Timings with sorting:
col('i').max(): 0.0s
col('s').max(): 4.4s

Timings with sorting and set_sorted:
col('i').max(): 0.0s
col('s').max(): 3.6s

So, set_sorted for column s did indeed improve timing, although not nearly as much as one would expect. @ritchie46, after doing a set_sorted(), wouldn't max() simply return the last element? Why is this still so slow?

@braaannigan
Copy link
Collaborator Author

Your column s is not sorted. The reason is that the string '9' is not less than the string "14", despite 9 being less than 14:

You're correct. I had tested this on a different dataset where the values are alphabetical strings but chose ints for an easier example.

I'm not sure that adding sort.set_sorted would change anything as Polars changes the flags for the column when you call sort. The difference in timings for strings in the last two examples is in the range of variability I see for the calculation, I think they are the same under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants