Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issues with for timestamp literals #8193

Merged
merged 9 commits into from
Nov 26, 2023
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Nov 15, 2023

Which issue does this PR close?

Closes partially #7958

Rationale for this change

During #7958 investigation it was revealed that timestamp conversion has bunch of inconsistencies with PG and DF itself

What changes are included in this PR?

This PR is intended to fix:

  • to_timestamp, ::timestamp, cast(n as timestamp) should be consistent
  • functions above should work for negative integer values
  • fuctions above should treat integers input as seconds, to be in sync with PG

Are these changes tested?

Oh yes

Are there any user-facing changes?

@comphead comphead marked this pull request as draft November 15, 2023 17:05
@github-actions github-actions bot added sql SQL Planner physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Nov 15, 2023
@comphead comphead changed the title timestamp fixes Fixing issues with timestamp cast Nov 15, 2023
@comphead comphead marked this pull request as ready for review November 15, 2023 19:21
@comphead comphead requested review from viirya and alamb November 15, 2023 19:21
@comphead
Copy link
Contributor Author

@waitingkuo cc

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this item @comphead

It seems to me there is a potential problem with treating cast specially from other places with timestamps. I think it would be better if we treated all timestamps the same (either switch them to Timestamp::Seconds or Timestamp::Nanoseconds` for consisteny

datafusion/common/src/scalar.rs Show resolved Hide resolved
datafusion/sql/src/expr/mod.rs Outdated Show resolved Hide resolved
@comphead
Copy link
Contributor Author

Thanks @alamb I agree any kind of cast to timestamp is supposed to have consistent signature. Currently its not.
Major engines have timestamps with microsecond precision, I can try to do it within current PR, however tbh it might be difficult to review.

@comphead comphead marked this pull request as draft November 18, 2023 20:45
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Nov 20, 2023
@comphead
Copy link
Contributor Author

not sure, why clippy is complaining.... locally its okay

@comphead comphead marked this pull request as ready for review November 20, 2023 19:17
@comphead comphead requested a review from alamb November 20, 2023 19:18
@comphead
Copy link
Contributor Author

@alamb I have aligned signatures, the solution is still suboptimal(using 2 casts).

In followup PR I would get rid of 2 casts and we need to support extreme low and high integer values, currently it overflows because Timestamp backed by i64 which is supposed to be microsecond rather than nanosecond.

I see 2 options to handle this:

  • Have a default precision as a microsecond(now its nanosecond), imho its preferred as PG or other major engines have a microsecond precision
  • Modify underlying value type for Timestamp from i64 to i128

@alamb let me know your thoughts.

PS. Not sure why clippy failed...

Timestamp(Nanosecond, None) Timestamp(Nanosecond, None) Timestamp(Nanosecond, None)


# verify extreme values (expects default precision to be microsecond instead of nanoseconds. Work pending)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive intentionally added this test to be uncommented once we figure out the overflow solution

@comphead
Copy link
Contributor Author

@alamb please re review once you have some free time

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead -- I am worried about this PR potentially messing up downstream users of datafusion (e.g. IOx) and makes the timestamp type inconsistent within datafusion (I realize currently the timestamp type is inconsitent with what is done with postgres, etc).

If the goal is to be able to read data from Parquet that is stored as INT96 correctly as a timestamp, I am not sure why we are changing timetamp handling in DataFusion)

Perhaps a better approach would be to investigate updating the arrow-rs reader to read whatever spark generates as a proper timestamp type (@avantgardnerio hints at that in #7958 (comment)) 🤔

datafusion/expr/src/built_in_function.rs Show resolved Hide resolved
let expected = "Projection: person.id, person.first_name, person.last_name\
\n Filter: person.state = Utf8(\"CO\") AND person.age >= Int64(21) AND person.age <= Int64(65)\
\n TableScan: person";
let expected = "Projection: person.id, person.first_name, person.last_name\n Filter: person.state = Utf8(\"CO\") AND person.age >= Int64(21) AND person.age <= Int64(65)\n TableScan: person";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these tests actually different? They look the same to me except the formatting changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to revert these changes or reformat them if there is a difference, to make the diff clearer

datafusion/sql/src/expr/mod.rs Outdated Show resolved Hide resolved
@comphead
Copy link
Contributor Author

comphead commented Nov 21, 2023

Thanks @alamb this is my fault, I referenced too much problems and brought up the confusion. This particular PR has nothing to do with INT96 and Parquet, it relates to timestamp literals only.

It fixes next problems

  • to handle integer inputs the same way as PG, DuckDB, Spark, etc
  • negative ops extended for timestamps
  • align all timestamp syntaxes to return consistent result and same datatypes.

I have returned the signature to be NanoSecond like it was before, and now everything timestamp related in DF is consistent.
We still have problems with overflows on extreme high/low values which can be fixed with either:

  • change default timestamp precision to microsecond instead of nanosecond, this might be huge work involved, as all timestamp arithmetics refers to NanoSeconds
  • change Timestamp underlying storage to i128 instead of i64

@comphead comphead requested a review from alamb November 21, 2023 20:56
@comphead comphead changed the title Fixing issues with timestamp cast Fixing issues with for timestamp literals Nov 21, 2023
@alamb
Copy link
Contributor

alamb commented Nov 21, 2023

Seems reasonable to me -- thank you @comphead. Does this revert the behavior added to close #2979 (to align with postgres behavior?) Or does it just change to_timestamp to return a type of TimestampNanos (though the argument is still interpreted as seconds)?

-- @waitingkuo do you have a few moments to give this PR a look?

@comphead
Copy link
Contributor Author

Seems reasonable to me -- thank you @comphead. Does this revert the behavior added to close #2979 (to align with postgres behavior?) Or does it just change to_timestamp to return a type of TimestampNanos (though the argument is still interpreted as seconds)?

-- @waitingkuo do you have a few moments to give this PR a look?

It doesn't revert the behavior, the input still treated as seconds, but the return type is nanoseconds to keep DataFusion consistent with all timestamp arithmetics and functions.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2023

It doesn't revert the behavior, the input still treated as seconds, but the return type is nanoseconds to keep DataFusion consistent with all timestamp arithmetics and functions.

Ah, sorry for my confusion

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead for bearing with me. I took another look and pass through this PR and it makes sense to me. Thank you for your focus on detail

I left some suggestions for improvements, but I don't think they are necessary prior to merging this pR

datafusion/sql/src/expr/mod.rs Outdated Show resolved Hide resolved
datafusion/sql/src/expr/mod.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/timestamps.slt Outdated Show resolved Hide resolved
let expected = "Projection: person.id, person.first_name, person.last_name\
\n Filter: person.state = Utf8(\"CO\") AND person.age >= Int64(21) AND person.age <= Int64(65)\
\n TableScan: person";
let expected = "Projection: person.id, person.first_name, person.last_name\n Filter: person.state = Utf8(\"CO\") AND person.age >= Int64(21) AND person.age <= Int64(65)\n TableScan: person";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to revert these changes or reformat them if there is a difference, to make the diff clearer

@comphead
Copy link
Contributor Author

Addressed all comments

@comphead
Copy link
Contributor Author

Should I merge this PR @alamb ?

@alamb
Copy link
Contributor

alamb commented Nov 26, 2023

Should I merge this PR @alamb ?

Yes, please do so!

It is one of my goals for the year to grow the capacity for review / merging PRs (so as to avoid single-person bottlenecks, ahem...)

As long as the guidelines described here are followed any committer should feel free to merge approved PRs:
https://arrow.apache.org/datafusion/contributor-guide/index.html#pull-request-overview

@comphead comphead merged commit 2071259 into apache:main Nov 26, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 27, 2023

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants