-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with "to_date" failing to process dates later than year 2262 #12227
Conversation
Due to using nanoseconds as an intermediate state, the values for dates processed via "to_date" cannot be later than the year 2262. The Arrow datatype for Date32 and Date64 supports much larger values. The statements in some areas in the code that the usage of nanoseconds is imposed by Arrow is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262. See: https://arrow.apache.org/docs/cpp/api/datatype.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MartinKolbAtWork for your contribution
I just checked we do have an issue with to_date
function whereas the cast
works correctly
> select extract(day from cast('9999-12-31' as date));
+-------------------------------------------+
| date_part(Utf8("DAY"),Utf8("9999-12-31")) |
+-------------------------------------------+
| 31.0 |
+-------------------------------------------+
1 row(s) fetched.
Elapsed 0.005 seconds.
> select to_date('9999-12-31');
Arrow error: Parser error: The dates that can be represented as nanoseconds have to be between 1677-09-21T00:12:44.0 and 2262-04-11T23:47:16.854775804
Perhaps we can use a the same implementation like CAST does(reuse it) instead of creating a new impl
The main thing of "to_date" is that is supports supplying a format string. As CAST does not support this, I don't think merging or re-using these two implementations is an approach that should be pursued. However with the latest commit I reduced the implementation effort by calling the parsing implementations from "arrow_cast". |
@jayzhan211 can you please trigger the build? |
@MartinKolbAtWork please check the build results.
it might be the expected value was wrong in this case |
Can you please update the docs?
|
Hi @findepi I fixed the test failures and updated the docs. You the build be re-triggered to check if this was successful? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update! couple minor comments
Hi @findepi , Now that we found out that adding support for negative years does add several nitty-gritty details, that need to be carefully thought through, I suggest that we split this up and handle the negative years topic in a separate PR. |
fyi @Omega359 |
I wonder what the cost is for using string_to_datetime_formatted vs a more limited string_to_date_formatted. There should be slightly less parsing involved for that. |
Hi @Omega359 ,
And, as a brief reminder… I just intended to fix the issue that the date calculation has an upper limit of 2262. Both, the full timestamp parsing as well as the UTC handling are already in the code-base. I introduced neither of them. |
Good points. I am unsure if I introduced that or just inherited it when I added format parsing and didn't think closely enough about it. Glad to see this being resolved |
I will check this PR out shortly |
@alamb , take your time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice (and well tested to me). THank you very much for the contribution @MartinKolbAtWork
Also, thank you @findepi and @comphead for the review
cc @Omega359
|
||
use super::ToDateFunc; | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
Very nice
"2020-09-08 12:13:29", | ||
]; | ||
for date_str in test_cases { | ||
let formatted_date_scalar = ScalarValue::Utf8(Some(date_str.into())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think you can write the same thing more concisely like
let formatted_date_scalar = ScalarValue::Utf8(Some(date_str.into())); | |
let formatted_date_scalar = ScalarValue::from(date_str); |
let to_date_result = | ||
ToDateFunc::new().invoke(&[ColumnarValue::Scalar(formatted_date_scalar)]); | ||
|
||
match to_date_result { | ||
Ok(ColumnarValue::Scalar(ScalarValue::Date32(date_val))) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The machinery to call the function is also a but clunky, I wonder if it would make the tests easier to read (as there would be less conversion machinery) if we added a function like
/// Invokes ToDate on the input string, and returns the resulting Date32 value
fn run_to_date(input: &str) -> Result<i32> {
..
}
I believe the CI failure https://github.com/apache/datafusion/actions/runs/10725200169/job/29798407173?pr=12227 was fixed on main by @mbrobbel in #12346 I took the liberty of merging the branch up from main to try and get a clean CI run |
We can consider potential changes to make the test more "readable" as part of a follow on PR. Thanks agian @MartinKolbAtWork @Omega359 @findepi and @comphead |
Which issue does this PR close?
Closes #12226
Rationale for this change
The "to_date" function (https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs) fails to process dates that are later than year 2262.
This is caused by the implementation detail that the conversion process uses nano-seconds based on epoch.
The Arrow datatype for Date32 and Date64 support much larger values. The statements in some areas in the code state that the usage of nanoseconds is imposed by the Date types of Arrow. This is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262.
See: https://arrow.apache.org/docs/cpp/api/datatype.html
NOTE:
Processing dates later than 2262 is not a theoretical issue. In widely used business software systems, unbounded dates (e.g. an "expiry_date" that is set to never expire) are set to the 31st of December of the year 9999 (i.e. "9999-12-31").
Processing such data with datafusion will fail if the current "to_date" implementation touches this data.
What changes are included in this PR?
Are these changes tested?
Unit tests are provided in https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs
Are there any user-facing changes?
n/a