-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dt.truncate is slow #13157
Comments
thanks @stas-sl for the report this part is all done in Chrono wondering if there's a way to do it by operating on the underlying integers directly... |
@MarcoGorelli I think we can. Let's think about this a bit. In a non-leap-year month, first of every month corresponds to the following integers:
In a normal year, there are 365 days, and +1 in leap years. 1972 was a leap year, and our dates start with
Let's pick a random date and truncate it to the nearest month, say
Check: >>> pl.Series([19692]).cast(pl.Date)
shape: (1,)
Series: '' [date]
[
2023-12-01
] Next question is, would this be faster than using chrono, and are there any exceptions? |
if it's correct then yes, almost certainly 😄 if you want to give this a go, the function to search for is would need to hypothesis-test the hell out of this, but would be awesome if it could work |
@MarcoGorelli I'm working on expression-ifying it for now in order to validate as a proof-of-concept. If it holds we can move it into rust. |
@MarcoGorelli for # truncate by 1mo
df.select(col("date") - pl.duration(days=col("date").dt.day()-1)) It's less flexible though. Edit: slower than |
Seems like the issue is more general than just truncation, I found another issue which has similar symptoms and also uses dates #12895. And I did a few more tests converting datetime to string: # converting to string works a bit faster than truncate('1mo'), though still it utilises only single CPU core
%time df.group_by(pl.col.date.dt.strftime('%Y-%m')).agg(pl.count())
# CPU times: user 30.2 s, sys: 2.67 s, total: 32.9 s
# Wall time: 23 s
# CPU time for duckdb is comparable with polars, but due to utilising all cores, Wall time is much smaller
%time duckdb.sql("select strftime('%Y-%m', date) as month, count() from df group by month").pl()
# CPU times: user 26.6 s, sys: 162 ms, total: 26.7 s
# Wall time: 3.83 s So, maybe there are 2 separate problems:
If 1st point could be related to polars, then just fixing it could improve things. |
Another alternative, inspired by duckdb date_trunc implementation %time df.group_by(pl.datetime(year=pl.col.date.dt.year(), month=pl.col.date.dt.month(), day=1)).agg(pl.count())
# CPU times: user 12.9 s, sys: 11 s, total: 23.8 s
# Wall time: 18.6 s A bit faster, though still not very impressive |
Ok, expression implementation is actually slower: import numpy as np
import polars as pl
from polars import col, when
start = np.datetime64("2015-01-01", "s").astype("int")
end = np.datetime64("2020-01-01", "s").astype("int")
n = 1_000_000
df = pl.DataFrame({"date": np.random.randint(start, end, n).astype("datetime64[s]").astype("datetime64[ms]")})
df = df.with_columns(col("date").cast(pl.Date))
t = col("date")
y = t.dt.year() - 1970
p = (y + 1) // 4
s = y * 365 + p
d = t - s
days = pl.Series([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31])
nearest_offset = sum((day * (d >= cum_day) for day, cum_day in zip(days, days.cum_sum())))
is_leap_year = (y + 2) % 4 == 0
offset = is_leap_year & (d > 58)
truncated = (s + nearest_offset + offset).cast(pl.Date)
# 324 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.with_columns(col("date").dt.truncate("1mo"))
# 584 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.with_columns(truncated) Most of the time is taken up by the @stas-sl I know you've moved beyond truncate but I'm just having fun here. Edit: faster implementation now with logical sums, still slower than |
It is slower with expressions but it doesn't mean it will be slower if it's coded in rust directly |
@mcrumiller Just experimenting with your example, it seems coalesce + separate when/thens is faster than a single when/then chain nearest_offset = pl.coalesce(
when(d < 31).then(0),
when(d < 59).then(31),
...,
334
)
|
@cmdlineluser did you compare those to my edit in the above where I replaced the when/then with sums? |
Can someone explain >10x time difference between creating date using n = 100_000_000
%time pl.select(pl.date(year=pl.zeros(n) + 1970, month=1, day=1))
# CPU times: user 8.78 s, sys: 4.07 s, total: 12.9 s
# Wall time: 13.3 s
%time pl.select((pl.zeros(n)).cast(pl.Date))
# CPU times: user 602 ms, sys: 468 ms, total: 1.07 s
# Wall time: 1.1 s |
@stas-sl, a A python |
@mcrumiller hmm, are you sure it creates python datetime.date objects? I thought it return Expr which is then handled by Rust directly without any python objects. There is corresponding Rust datetime function, which I guess called from pl.date. |
Oh, sorry! I misread your comment and thought you were saying that construction from a list of |
I think this is a completely different issue than |
Unexpectedly, for trunc in ('1mo', '1d'):
for streaming in (True, False):
print(f"{trunc=} {streaming=}")
%time df.lazy().select(pl.col.date.dt.truncate(trunc)).collect(streaming=streaming) trunc='1mo' streaming=True
CPU times: user 1min 9s, sys: 1.33 s, total: 1min 11s
Wall time: 9.09 s
trunc='1mo' streaming=False
CPU times: user 16.7 s, sys: 1.35 s, total: 18.1 s
Wall time: 18.1 s
trunc='1d' streaming=True
CPU times: user 3.35 s, sys: 338 ms, total: 3.69 s
Wall time: 812 ms
trunc='1d' streaming=False
CPU times: user 2.03 s, sys: 561 ms, total: 2.59 s
Wall time: 2.6 s *It is due to the streaming engine evaluating the expression in parallel across batches DuckDB has the advantage here in part due to their execution being pipelined by default |
There are two issues. The first is that truncate by month is inefficient and the second is that polars, by default, parallelizes across columns rather than within them. For the first issue, it can be mitigated by constructing a date from the parts of the input (ie. Using @nameexhaustion's tip to use the streaming engine to force parallelization within the column along with using the data constructor methodology gets polars in line with duckdb. I imagine that it'd be pretty straight forward to change how truncate('1mo') works but to get polars to automatically parallelize within columns would be more of a challenge. Maybe the optimizer could have something like if n_rows>threadhold and n_columns<other_threshold then switch to streaming engine. Here's a graph and table of results and the code to generate the above
|
For a single expression, this is expected: https://discord.com/channels/908022250106667068/1014967651656814694/1187398134997975111 Anyway, thanks for having brought attention to this - I think @mcrumiller 's formula is on the right track (though the leap years part needs adjusting - years divisible by 100 but not by 400 are not leap years) and, if done in Rust, I'd expect it to noticeably speed things up @mcrumiller do you plan to implement this? I can do it if not |
Thanks for clarification, I wasn't aware that by default polars parallelizes horizontally over columns and not vertically. And I had impression that streaming=True is mainly to be able to process data larger than RAM, rather than to parallelize.
CPU times, though... it uses 4x more CPU resources than without streaming - 16 seconds vs 1min 9 seconds %time df.lazy().select(pl.col.date.dt.truncate('1mo')).collect(streaming=False)
# CPU times: user 31.9 s, sys: 510 ms, total: 32.4 s
# Wall time: 32.7 s
# POLARS_MAX_THREADS=4
%time df.lazy().select(pl.col.date.dt.truncate('1mo')).collect(streaming=True)
# CPU times: user 1min 55s, sys: 1min 12s, total: 3min 8s
# Wall time: 49.9 s On my quite dated Intel MacBook streaming=True is actually slower, though I was able to reproduce speed up on my Linux server. But that's probably another story, I will experiment more with different settings. Anyway, regardless of parallelization, there seems to be something fishy with current date truncation implementation. I am a bit skeptical about @mcrumiller's solution, it seems quite overcomplicated to me, and not even faster as of now. I don't know Rust, but maybe try to look what's the difference between |
it's not a fair comparison though, only way to know is to implement it in rust - integer arithmetic should be very fast
truncate_subweeky doesn't do anything with chrono (unless we're in the tz-aware case), and it just deals with fixed durations, so the integer arithmetic there is very simple - that's why it's a lot faster |
With respect, if @mcrumiller's method is going to utilize the year extractor then is it really going to be any better than just |
@deanm0000, year and month extractors seems to be relatively fast compared to construction new date from them using %time df.select(pl.col.date.dt.year())
# CPU times: user 1.33 s, sys: 152 ms, total: 1.48 s
# Wall time: 1.52 s
%time df.select(pl.col.date.dt.month())
# CPU times: user 1.32 s, sys: 53.9 ms, total: 1.37 s
# Wall time: 1.51 s
%time df.select(pl.date(pl.col.date.dt.year(), pl.col.date.dt.month(), 1))
# CPU times: user 10.9 s, sys: 2.83 s, total: 13.8 s
# Wall time: 12.7 s
# just constants
%time pl.select(pl.date(year=pl.zeros(n) + 2020, month=1, day=1))
# CPU times: user 8.63 s, sys: 2.96 s, total: 11.6 s
# Wall time: 11.8 s
# comparing to duckdb
%time duckdb.sql('select make_date(2020, 1, 1) from range(100000000)').pl()
# CPU times: user 1.54 s, sys: 516 ms, total: 2.06 s
# Wall time: 1.6 s
# datetime contsruction
%time duckdb.sql('select make_timestamp(2020, 1, 1, 0, 0, 0) from range(100000000)').pl()
CPU times: user 2.14 s, sys: 973 ms, total: 3.11 s
Wall time: 2.57 s I checked pl.datetime implementation polars/crates/polars-plan/src/dsl/function_expr/temporal.rs Lines 132 to 155 in 5e3ffd2
And it looks quite involved and it uses chrono as well I guess. So it is understandable why it is slow. I can create a separate issue for it as @MarcoGorelli suggested, though I'm not sure if it possible to do something with it. Or maybe it should be much broader task to replace chrono with something more performant, if you ever decide. |
the currently solution does:
as far as I can tell, @mcrumiller 's soln would be:
that's why I think it's got legs, even though it needs adjusting a bit - let's see |
Actually duckdb's date/time implementation https://github.com/duckdb/duckdb/blob/main/src/common/types/date.cpp is not that scary, only 622 LOC including a hundred lines with lookup arrays 🙂 Maybe one day you'll manage to reimplement date/time logic inside polars as well 🤔 Upd: |
might have something...gonna time it and found out update: #13192 |
Not sure if I'm supposed to create a separate issue for that, haven't done that before. Anyways there's another performance problem when using dt.truncate, specifically when it's in non-UTC timezone. Here's an example time_zone = 'America/New_York'
df = pl.DataFrame(
{'time' : pl.datetime_range(pl.date(2024, 6, 7), pl.date(2024, 6, 8),
'1ms', eager=True, time_unit='ns', time_zone=time_zone)}
)
%timeit df.select(pl.col('time').dt.truncate('1h'))
# 10.6 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
fast_trunc = lambda x: pl.col('time').dt.convert_time_zone('UTC').dt.truncate(x).dt.convert_time_zone(time_zone)
%timeit df.select(fast_trunc('1h'))
# 398 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.select(fast_trunc('1h')).equals(df.select(pl.col('time').dt.truncate('1h')))
# True The fast method that I showed doesn't match regular truncate for odd intervals though, like df.select(fast_trunc('7s'))
# ┌────────────────────────────────┐
# │ time │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2024-06-07 00:00:00 EDT │
# │ 2024-06-07 00:00:00 EDT │
# │ … │
# │ 2024-06-07 23:59:54 EDT │
# │ 2024-06-07 23:59:54 EDT │
# └────────────────────────────────┘
df.select(pl.col('time').dt.truncate('7s'))
# ┌────────────────────────────────┐
# │ time │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2024-06-06 23:59:54 EDT │
# │ 2024-06-06 23:59:54 EDT │
# │ … │
# │ 2024-06-07 23:59:55 EDT │
# │ 2024-06-07 23:59:55 EDT │
# └────────────────────────────────┘ As far as I understand, if the interval divides 1 hour evenly than we can use the fast method. It wouldn't be much of a stretch to assume that usually truncate is used with "nice" intervals, so checking for that and applying this optimization seems worth it. That's just a simple improvement, I'm sure this method can be made just as fast for any interval, but it's probably a bit more complicated. Finally, it seems operating on the physical gives further performance improvement, but it's marginal delta_phys = (pl.col('time').to_physical() - pl.col('time').to_physical() % int(3600e9))
very_fast_trunc = delta_phys.cast(pl.Datetime('ns', time_zone))
%timeit df.select(very_fast_trunc)
# 381 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
@Orventro As an example, #16531 was recently opened about
@MarcoGorelli has been doing all the (great!) work, if they don't notice your update and respond here - I think opening a new issue is OK. |
thanks for the comment / ping
this won't work for certain time zones, e.g. |
@MarcoGorelli I see. I checked the timezones and daylight savings and it seems that all of them offset UTC by multiples of 15 minutes. So if |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
dt.truncate
is slow when used with intervals >= '1mo', for smaller intervals it seems to be faster, though still it falls behind duckdb. Also seems like it doesn't utilise all CPU cores.Expected behavior
At least
dt.truncate
for month intervals to work as fast as for smaller intervals, or even better if it could match duckdb performance.Installed versions
The text was updated successfully, but these errors were encountered: