Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: include precision parameter in timestamp types #594

Merged
merged 14 commits into from
Feb 22, 2024

Conversation

richtia
Copy link
Member

@richtia richtia commented Jan 25, 2024

This PR introduces a new PrecisionTimestamp type that accepts an optional precision parameter to specify fractional second precision.

Closes #592

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@richtia
Copy link
Member Author

richtia commented Jan 25, 2024

Keeping this as a draft and only making the changes for Timestamp as of right now. Once we agree on what changes need to be made, I'll make the updates for Timestamp_tz as well.

@richtia richtia changed the title [DRAFT] feat: include precision parameter in timestamp types feat: include precision parameter in timestamp types Jan 25, 2024
@richtia richtia force-pushed the parametrized_timestamp branch from c641f87 to 6dd360a Compare January 25, 2024 21:26
@@ -32,7 +32,8 @@ scalar_functions:
* SECOND Return the second (0-59).
* MILLISECOND Return number of milliseconds since the last full second.
* MICROSECOND Return number of microseconds since the last full millisecond.
* SUBSECOND Return number of microseconds since the last full second of the given timestamp.
* NANOSECOND Return number of nanoseconds since the last full microsecond.
* SUBSECOND Return number of nanoseconds since the last full second of the given timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we changing the behavior of SUBSECOND here? Would it be safer to add a new option and deprecate SUBSECOND?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, deprecating might be a better. Should we have a process to be followed for deprecating an option value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the standard definition uses millisecond for SUBSECOND then we might need to leave this version permanently.

@@ -796,7 +796,8 @@ message Expression {
string string = 12;
bytes binary = 13;
// Timestamp in units of microseconds since the UNIX epoch.
int64 timestamp = 14;
// Deprecated in favor of `Timestamp timestamp`
// int64 timestamp = 14 [deprecated = true];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the number would become reserved to prevent reuse. But usually one goes through a deprecation period to allow for interoperability. That means we need to have both around for a while. We could say that consumers always should use matched sets but I don't think that's always possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means we need to have both around for a while. We could say that consumers always should use matched sets but I don't think that's always possible.

For migration purposes we need a period of time where both of these timestamps types coexist. To migrate this properly we need to be able to do something like:

  1. Update consumers to handle the new timestamp type.
  2. Update producers to generate plans with the new timestamp type and remove the old type.
  3. Remove code for handling the old timestamp type in consumers.

If we remove this entirely we have to do 1 and 2 simultaneously, which doesn't work super well if they're entirely different systems 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. So we'll need a different name for the new timestamp. Maybe something like ParametrizedTimestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since parameterized_types.proto already uses Parameterized as a prefix in all the names, it'll be less confusing to use something like PrecisionTimestamp instead

@@ -845,6 +848,11 @@ message Expression {
int32 scale = 3;
}

message Timestamp {
// the maximum precision is 9.
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have a value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure why value would be necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a literal (the type is named ParameterizedPrecisionTimestamp?. In other words, if my SQL filter is event_time < '1999-01-08 04:05:06' and event_time is a "precision timestamp" column then I need to make a Substrait plan that looks like...

less_than(input_field(0), ???) where ??? is my timestamp literal. So we do need a value (and not a precision) here.

We could do something like:

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
      uint64 precision_timestamp = 34;

@@ -845,6 +848,11 @@ message Expression {
int32 scale = 3;
}

message Timestamp {
// the maximum precision is 9.
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default value is likely 6 for backwards compatibility (although this is a completely different message type so I'm not sure that matters).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the type, there's no need, but it looked like this was replacing the value version of int timestamp.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/type.proto Outdated Show resolved Hide resolved
site/docs/types/type_classes.md Outdated Show resolved Hide resolved
proto/substrait/type.proto Outdated Show resolved Hide resolved
proto/substrait/type_expressions.proto Outdated Show resolved Hide resolved
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| LIST&lt;T&gt; | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| TIMESTAMP&lt;P&gt; | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should have both the old and new types listed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the proto it looks like we are deprecating the old type? Are you saying we should list the old type and mark it deprecated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. If you store int64 nanoseconds since the epoch then the maximum value you can get is 2262-04-12 01:47:16.854775807 +0200 (slightly larger if you store unsigned but still not large enough). However, the first sentence claims within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999].

So either the literal representation needs to be larger than 8 bytes or the allowable range needs to depend on the precision.

Note that making literals larger than 8 bytes is tempting, but it could lead to system incompatibility. For example, you could create a Substrait filter my_date < year_4000 but if you tried to evaluate that on the consumer it might try and convert that literal to its internal representation and return an error (e.g. this is what would happen if you tried to use pandas, datafusion, or pyarrow to evaluate that).

So I think we have to state that the allowable range depends on the precision.

@richtia richtia force-pushed the parametrized_timestamp branch from 7b97dce to 990aa9b Compare January 30, 2024 19:34
extensions/functions_datetime.yaml Show resolved Hide resolved
@@ -845,6 +848,11 @@ message Expression {
int32 scale = 3;
}

message Timestamp {
// the maximum precision is 9.
int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a literal (the type is named ParameterizedPrecisionTimestamp?. In other words, if my SQL filter is event_time < '1999-01-08 04:05:06' and event_time is a "precision timestamp" column then I need to make a Substrait plan that looks like...

less_than(input_field(0), ???) where ??? is my timestamp literal. So we do need a value (and not a precision) here.

We could do something like:

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
      uint64 precision_timestamp = 34;

| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| LIST&lt;T&gt; | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| TIMESTAMP&lt;P&gt; | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the proto it looks like we are deprecating the old type? Are you saying we should list the old type and mark it deprecated?

| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| LIST&lt;T&gt; | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| TIMESTAMP&lt;P&gt; | A naive timestamp within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999], with precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `int64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. If you store int64 nanoseconds since the epoch then the maximum value you can get is 2262-04-12 01:47:16.854775807 +0200 (slightly larger if you store unsigned but still not large enough). However, the first sentence claims within [1000-01-01 00:00:00.000000000..9999-12-31 23:59:59.999999999].

So either the literal representation needs to be larger than 8 bytes or the allowable range needs to depend on the precision.

Note that making literals larger than 8 bytes is tempting, but it could lead to system incompatibility. For example, you could create a Substrait filter my_date < year_4000 but if you tried to evaluate that on the consumer it might try and convert that literal to its internal representation and return an error (e.g. this is what would happen if you tried to use pandas, datafusion, or pyarrow to evaluate that).

So I think we have to state that the allowable range depends on the precision.

@richtia richtia marked this pull request as ready for review February 9, 2024 18:54
@richtia richtia requested review from vbarua and westonpace February 9, 2024 18:55
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! Some suggestions and cleanup.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
@@ -159,6 +163,18 @@ message Type {
Nullability nullability = 4;
}

message PrecisionTimestamp {
optional int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we've used optional much elsewhere. I seem to recall it being largely a no-op as far as protobuf is concerned. However, I don't see much harm in it. Maybe document the default (6 I assume) again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional is from the version 2 of the protobuf format. These days most of the outside world is using version 3 (as is this file) where optional is assumed.

Comment on lines 44 to 45
| PRECISIONTIMESTAMP&lt;P&gt; | A timestamp with fractional second precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `uint64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
| PRECISIONTIMESTAMPTZ&lt;P&gt; | A timezone-aware timestamp, with fractional second precision (P, number of digits) <= 9. Similar to aware datetime in Python. | `int64` microseconds since 1970-01-01 00:00:00.000000 UTC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might go without saying but should we mention that P >= 0 as well?

| NSTRUCT&lt;N:T1,...,N:Tn&gt; | **Pseudo-type**: A struct that maps unique names to value types. Each name is a UTF-8-encoded string. Each value can have a distinct type. Note that NSTRUCT is actually a pseudo-type, because Substrait's core type system is based entirely on ordinal positions, not named fields. Nonetheless, when working with systems outside Substrait, names are important. | n/a
| LIST&lt;T&gt; | A list of values of type T. The list can be between [0..2,147,483,647] values in length. | `repeated Literal`, all types matching T
| MAP&lt;K, V&gt; | An unordered list of type K keys with type V values. Keys may be repeated. While the key type could be nullable, keys may not be null. | `repeated KeyValue` (in turn two `Literal`s), all key types matching K and all value types matching V
| PRECISIONTIMESTAMP&lt;P&gt; | A timestamp with fractional second precision (P, number of digits) <= 9. Does not include timezone information and can thus not be unambiguously mapped to a moment on the timeline without context. Similar to naive datetime in Python. | `uint64` nanoseconds since 1970-01-01 00:00:00.000000000 (in an unspecified timezone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like PRECISIONTIMESTAMP has a literal representation of uint64 nanoseconds and PRECISIONTIMESTAMPTZ has a literal representation of uint64 microseconds.

However, if I'm reading the proto correctly, they both have a literal representation of "either uint64 microseconds or nanoseconds since ..."

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of this, do we need add signature names for the new types as well?

| timestamp | ts |
| timestamp_tz | tstz |

@richtia
Copy link
Member Author

richtia commented Feb 10, 2024

As part of this, do we need add signature names for the new types as well?

| timestamp | ts |
| timestamp_tz | tstz |

Thanks for the catch! Added new signature names

@richtia richtia force-pushed the parametrized_timestamp branch from 16c5cce to 7c5da20 Compare February 10, 2024 02:46
@richtia richtia requested a review from EpsilonPrime February 13, 2024 19:27
@richtia richtia requested review from vbarua and westonpace February 13, 2024 19:27
westonpace
westonpace previously approved these changes Feb 17, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small typo

site/docs/types/type_classes.md Outdated Show resolved Hide resolved
EpsilonPrime
EpsilonPrime previously approved these changes Feb 22, 2024
@@ -159,6 +163,18 @@ message Type {
Nullability nullability = 4;
}

message PrecisionTimestamp {
optional int32 precision = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional is from the version 2 of the protobuf format. These days most of the outside world is using version 3 (as is this file) where optional is assumed.

@richtia richtia dismissed stale reviews from EpsilonPrime and westonpace via 4d349d3 February 22, 2024 01:30
@richtia richtia requested a review from EpsilonPrime February 22, 2024 01:30
@richtia richtia requested a review from westonpace February 22, 2024 01:41
@richtia richtia merged commit 087f87c into substrait-io:main Feb 22, 2024
13 checks passed
// If the precision is 6 or less then this is the microseconds since the UNIX epoch
// If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
uint64 precision_timestamp = 34;
uint64 precision_timestamp_tz = 35;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these have been custom types that'd include the precision? Like Decimal/VarChar below. Or how is the consumer mean to infer the precision? The uint64 presumably means the timestamp value itself

westonpace pushed a commit that referenced this pull request Jul 11, 2024
…nTimestampTZ literals (#659)

BREAKING CHANGE: changes the message type for Literal PrecisionTimestamp
and PrecisionTimestampTZ

The PrecisionTimestamp and PrecisionTimestampTZ literals were introduced
in #594, and there was some discussion about their contents in
#594 (comment).
In their current form, they only contain a i64 value, which I believe is
meant to be a precision-dependent number of fractional seconds since
epoch. However, the literals don't contain the precision itself, so it's
impossible to interpret a PrecisionTimestamp or PrecisionTimestampTZ
literal without other information. This is in contrast to e.g. varchar,
whose literal does specify the length, or decimal, whose literal
specifies scale and precision. @westonpace curious for your thoughts
since you were part of that original discussion - am I missing something
or is this a bug?

This PR changes the Literal types for PrecisionTimestamp and
PrecisionTimestampTZ to contain a PrecisionTimestamp message instead of
a i64. That message then contains the i64 value as well as the i32 type.
This is a breaking change, but given in their current form these
literals aren't usable (unless I'm missing something), would that be
okay?

Currently I used the same message for both PrecisionTimestamp and
PrecisionTimestampTZ, but I can also make a copy for PTTZ if that'd be
better for forwards-compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nanosecond precision in timestamps
5 participants