-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dt.truncate
supports broadcasting lhs
#15768
Conversation
thanks for doing this! is |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15768 +/- ##
=======================================
Coverage 81.35% 81.35%
=======================================
Files 1379 1379
Lines 176619 176683 +64
Branches 2544 2542 -2
=======================================
+ Hits 143686 143748 +62
- Misses 32449 32453 +4
+ Partials 484 482 -2 ☔ View full report in Codecov by Sentry. |
CodSpeed Performance ReportMerging #15768 will not alter performanceComparing Summary
|
Before #15736, I'm not aware that there would be such a performance penalty if the size is not good.
To be honest, I'm not totally sure as I didn't do the benchmark for the cache size, just the keep it the same as we use elsewhere(e.g. But for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this and for adding the cache!
I think there may have been a performance impact from this for when e.g.: import timeit
import numpy as np
setup = """
import pandas as pd
import polars as pl
import numpy as np
rng = np.random.default_rng(1)
s = pl.Series(rng.integers(-1_000_000, 1_000_000, size=1_000_000)).cast(pl.Datetime)
df = pl.DataFrame({'dt': s})
df = df.with_columns(every=pl.Series(['10s']*(len(s)//2) + ['100s']*(len(s)//2)))
"""
results = np.array(timeit.Timer(
stmt="df.select(pl.col('dt').dt.truncate('10s'))",
setup=setup,
)
.repeat(7, 3)
)/3
print(f"min: {min(results)}")
print(f"max: {max(results)}")
print(f"{np.mean(results)} +/- {np.std(results)/np.sqrt(len(results))}")
results = np.array(timeit.Timer(
stmt="df.select(pl.col('dt').dt.truncate(pl.col('every')))",
setup=setup,
)
.repeat(7, 3)
)/3
print(f"min: {min(results)}")
print(f"max: {max(results)}")
print(f"{np.mean(results)} +/- {np.std(results)/np.sqrt(len(results))}")
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/scratch$ On 0.20.21, minimum timings are:
On 0.20.22, they are:
So, I think the separate paths need to be restored? |
Probably not much, but this should also improve some performance as the duration parsing cache.
Closes #15743.