-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement string cast operations for Time32 and Time64 #2251
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @stuartcarnie -- this is looking great. I left some feedback but I ran out of time today -- I will complete my review first thing tomorrow.
(Utf8, | ||
Date32 | ||
| Date64 | ||
| Time32(TimeUnit::Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
arrow/src/compute/kernels/cast.rs
Outdated
fn seconds_since_midnight(time: &chrono::NaiveTime) -> i32 { | ||
let sec = time.num_seconds_from_midnight(); | ||
let frac = time.nanosecond(); | ||
let adjust = if frac < 1_000_000_000 { 0 } else { 1 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to grok this -- I think it is leap second handling
https://docs.rs/chrono/0.4.19/chrono/trait.Timelike.html#tymethod.nanosecond
Maybe we could add a comment explaining what was going on
let adjust = if frac < 1_000_000_000 { 0 } else { 1 }; | |
// handle leap second | |
// see https://docs.rs/chrono/0.4.19/chrono/trait.Timelike.html#tymethod.nanosecond | |
let adjust = if frac < 1_000_000_000 { 0 } else { 1 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this myself. It was my understanding that these were added manually by timekeepers? Does chrono keep a list of historical leap seconds? https://en.wikipedia.org/wiki/Leap_second#:~:text=Between%201972%20and%202020%2C%20a,every%2021%20months%2C%20on%20average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they can model them, but don't keep a database of them? https://github.com/chronotope/chrono/blob/ab688c384f7c797466fca69d69bfd5ee3c2cba96/src/naive/time/mod.rs#L480
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it can read unix TZif files to get the historical data https://github.com/chronotope/chrono/blob/ab688c384f7c797466fca69d69bfd5ee3c2cba96/src/offset/local/tz_info/timezone.rs#L662
https://manpages.ubuntu.com/manpages/xenial/man5/tzfile.5.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to parse a time with a leap second, using the 60th second:
https://docs.rs/chrono/0.3.1/chrono/naive/time/index.html#reading-and-writing-leap-seconds
The leap second is stored as a fractional second of 1_000_000_000 nanoseconds, but I realise now that I don't need any logic to handle it. The code is much simpler 😂
arrow/src/compute/kernels/cast.rs
Outdated
let (frac, adjust) = if frac < 1_000_000_000 { | ||
(frac, 0) | ||
} else { | ||
(frac - 1_000_000_000, MILLIS_PER_SEC) | ||
}; | ||
(sec + adjust + frac / NANOS_PER_MILLI) as i32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably a good reason, but my feeble mind can't figure it out. Why is it important to break up frac
and adjust
?
For example, isn't this equivalent?
let (frac, adjust) = if frac < 1_000_000_000 { | |
(frac, 0) | |
} else { | |
(frac - 1_000_000_000, MILLIS_PER_SEC) | |
}; | |
(sec + adjust + frac / NANOS_PER_MILLI) as i32 | |
(sec + frac / NANOS_PER_MILLI) as i32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, I concluded the same thing and pushed up the changes, thanks! I will move the functions back inline as I originally had them, before I overcomplicated it 😂
arrow/src/compute/kernels/cast.rs
Outdated
@@ -1584,6 +1625,303 @@ fn cast_string_to_date64<Offset: OffsetSizeTrait>( | |||
Ok(Arc::new(array) as ArrayRef) | |||
} | |||
|
|||
fn seconds_since_midnight(time: &chrono::NaiveTime) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/arrow-rs/blob/master/arrow/src/temporal_conversions.rs may be a good place to put these functions too (so they have a chance of being found / reused)
arrow/src/compute/kernels/cast.rs
Outdated
let iter = (0..string_array.len()).map(|i| { | ||
if string_array.is_null(i) { | ||
None | ||
} else { | ||
string_array | ||
.value(i) | ||
.parse::<chrono::NaiveTime>() | ||
.map(|time| seconds_since_midnight(&time)) | ||
.ok() | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be faster as it will not require checking bounds for calls to is_null
or value
:
let iter = (0..string_array.len()).map(|i| { | |
if string_array.is_null(i) { | |
None | |
} else { | |
string_array | |
.value(i) | |
.parse::<chrono::NaiveTime>() | |
.map(|time| seconds_since_midnight(&time)) | |
.ok() | |
} | |
}); | |
let iter = string_array | |
.iter() | |
.flat_map(|v| { | |
v.map(|v| { | |
v.parse::<chrono::NaiveTime>() | |
.map(|time| seconds_since_midnight(&time)) | |
.ok() | |
}) | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same type of transformation can be applied to the iterator below this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good – it looks like the string to date transformation functions would benefit from the same treatment, but I'll leave that to another PR so as to not cloud this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but unfortunately it fails at runtime with a panic:
thread 'compute::kernels::cast::tests::test_cast_string_to_time32second' panicked at 'trusted_len_unzip requires an upper limit', arrow/src/array/array_primitive.rs:470:25
at
arrow-rs/arrow/src/array/array_primitive.rs
Lines 469 to 470 in 91b2d7e
let (_, upper) = iterator.size_hint(); | |
let len = upper.expect("trusted_len_unzip requires an upper limit"); |
The previous iter
was ultimately of type Range<usize>
, which returns Some(_)
for the size_hint
:
whereas our new iter
does not, as it does not know the size.
There are probably good reasons, but it is unfortunate that size_hint
was not a separate trait, so this could be caught at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking @stuartcarnie -- I wonder if flat_map is the problem -- let me take another crack at this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way to do this cleanly -- in #2284
} else { | ||
let string = string_array | ||
.value(i); | ||
chrono::Duration::days(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
chrono::Duration::days(3);
Seems like a leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed – it is gone now, thanks for the spot
Remove the unnecessary conditionals to extract the leap second, as it is already handled when converting to a time unit relative to midnight 🤦🏻♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications! Hopefully having them on this PR will help someone in the future - they definitely helped me.
LGTM
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2251 +/- ##
==========================================
- Coverage 82.29% 82.26% -0.04%
==========================================
Files 243 245 +2
Lines 62443 62863 +420
==========================================
+ Hits 51387 51713 +326
- Misses 11056 11150 +94 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code's logic and tests looks good to me. I do think it would be nice to clean up the iterator logic a bit, though I think we can do that as a follow on PR.
])) as ArrayRef; | ||
let a2 = Arc::new(LargeStringArray::from(vec![ | ||
Some("08:08:35.091323414"), | ||
Some("08:08:60.091323414"), // leap second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for the leap second
@@ -2854,6 +3172,102 @@ mod tests { | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_cast_string_to_time32second() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 these are great tests.
assert!(c.is_null(2)); | ||
assert!(c.is_null(3)); | ||
assert!(c.is_null(4)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any tests for the fallable path (aka that throws an error). I want to have another crack at using the iterators so I'll add those tests in a follow up
Benchmark runs are scheduled for baseline = ed9fc56 and contender = 9a4b1c9. 9a4b1c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2053 and helps apache/datafusion#2883.
Rationale for this change
N/A
What changes are included in this PR?
Implements cast operations following precedence of existing implementations.
Are there any user-facing changes?
The
cast
API now supports string -> Time32 and Time64 transformations.