Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"to_date" fails to process dates later than year 2262 #12226

Closed
MartinKolbAtWork opened this issue Aug 29, 2024 · 0 comments · Fixed by #12227
Closed

"to_date" fails to process dates later than year 2262 #12226

MartinKolbAtWork opened this issue Aug 29, 2024 · 0 comments · Fixed by #12227
Labels
bug Something isn't working

Comments

@MartinKolbAtWork
Copy link
Contributor

Describe the bug

The "to_date" function (https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs) fails to process dates that are later than year 2262.
This is caused by the implementation detail that the conversion process uses nano-seconds based on epoch.
The Arrow datatype for Date32 and Date64 support much larger values. The statements in some areas in the code state that the usage of nanoseconds is imposed by the Date types of Arrow. This is simply wrong. The Date32 type stores the number of days since epoch. The Date64 type stores the milliseconds (NOT nanoseconds) since epoch. Both Date32 and Date64 can therefore massively exceed the year 2262.
See: https://arrow.apache.org/docs/cpp/api/datatype.html

NOTE:
Processing dates later than 2262 is not a theoretical issue. In widely used business software systems, unbounded dates (e.g. an "expiry_date" that is set to never expire) are set to the 31st of December of the year 9999 (i.e. "9999-12-31").
Processing such data with datafusion will fail if the current "to_date" implementation touches this data.

To Reproduce

Add these unit tests to https://github.com/apache/datafusion/blob/main/datafusion/functions/src/datetime/to_date.rs

mod tests {
    use arrow::{compute::kernels::cast_utils::Parser, datatypes::Date32Type};
    use datafusion_common::ScalarValue;
    use datafusion_expr::{ColumnarValue, ScalarUDFImpl};

    use super::ToDateFunc;

    #[test]
    fn test_year_9999() {
        let date_str = "9999-12-31";
        let date_scalar = ScalarValue::Utf8(Some(date_str.to_string()));

        let res = ToDateFunc::new().invoke(&[ColumnarValue::Scalar(date_scalar)]);

        match res {
            Ok(ColumnarValue::Scalar(ScalarValue::Date32(date_val))) => {
                let expected = Date32Type::parse("9999-12-31");
                assert_eq!(date_val, expected, "to_date created wrong value");
            }
            _ => panic!("Could not convert '{}' to Date", date_str),
        }
    }

    #[test]
    fn test_year_9999_formatted() {
        let date_str = "99991231";
        let format_str = "%Y%m%d";
        let date_scalar = ScalarValue::Utf8(Some(date_str.to_string()));
        let format_scalar = ScalarValue::Utf8(Some(format_str.to_string()));

        let res = ToDateFunc::new().invoke(&[
            ColumnarValue::Scalar(date_scalar),
            ColumnarValue::Scalar(format_scalar),
        ]);

        match res {
            Ok(ColumnarValue::Scalar(ScalarValue::Date32(date_val))) => {
                let expected = Date32Type::parse("9999-12-31");
                assert_eq!(date_val, expected, "to_date created wrong value");
            }
            _ => panic!(
                "Could not convert '{}' with format string '{}'to Date",
                date_str, format_str
            ),
        }
    }
}

Expected behavior

"to_date" must be able to process date values in the whole range of values that are supported by Arrow's Date32 and Date64, e.g. the date "9999-12-31"

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant