Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: add nanosecond timestamp types #8683

Merged
merged 8 commits into from
Oct 31, 2023

Conversation

jacobmarble
Copy link
Contributor

Helps #8657

This change embodies this design doc:
https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit

My IDE reformatted the Markdown tables edited. I'm happy to revert those if needed.

I also resorted the tables of types, so that date/time types land after other primitive types, and before nested types.

@Fokko Fokko added the Specification Issues that may introduce spec changes. label Oct 2, 2023
@jacobmarble
Copy link
Contributor Author

@Fokko - friendly reminder to review

@rdblue
Copy link
Contributor

rdblue commented Oct 5, 2023

@jacobmarble, I was talking with @danielcweeks a couple days ago about v3 additions and we came up with a good reason to add timestamp_ms and timestamptz_ms. Currently, type promotion is limited to very specific changes, but we think that we will need to open that up in v3. As long as we are open to promoting types, it makes sense to let people promote long to timestamp of some kind.

Now that we are adding timestamps with different units, this gives us a way to fix a long-standing problem: older data that used millisecond timestamps as longs couldn't be read as timestamps because we had no way to know the unit of the long (millis or micros). Now that we have separate types, we can use the type to carry that information, so that you can promote a long to timestamp_ms or timestamp and we know how to interpret the value.

What do you think? If you agree, then we should probably add ms types at the same time.

@jacobmarble
Copy link
Contributor Author

What do you think? If you agree, then we should probably add ms types at the same time.

Sounds good to me. I'll add milliseconds to this PR.

@jacobmarble jacobmarble requested a review from rdblue October 9, 2023 16:08
@jacobmarble jacobmarble changed the title Spec: add types timstamp_ns and timestamptz_ns Spec: add nanosecond and millisecond timestamp types Oct 9, 2023
@jacobmarble jacobmarble force-pushed the jgm-timestamp-nanos-spec branch 2 times, most recently from 449c5ff to ed36cfb Compare October 11, 2023 17:13
@jacobmarble jacobmarble changed the title Spec: add nanosecond and millisecond timestamp types Spec: add nanosecond timestamp types Oct 11, 2023
@jacobmarble
Copy link
Contributor Author

In the Iceberg community sync today, it was decided that we will not be adding millisecond timestamps after all.

Current status of this PR:

  • millisecond timestamps removed
  • note added to v3 Appendix
  • rebased and squashed
  • type promotion not included
    • someone with better context will follow up in a separate issue

@Fokko do you think we're ready to merge?

@Fokko
Copy link
Contributor

Fokko commented Oct 11, 2023

@jacobmarble Yes, this looks good to me. Thanks for joining the community sync. Could you avoid reformatting the table? History is especially important in the spec.

@jacobmarble jacobmarble force-pushed the jgm-timestamp-nanos-spec branch from ed36cfb to f6fbf04 Compare October 11, 2023 19:51
@jacobmarble
Copy link
Contributor Author

@jacobmarble Yes, this looks good to me. Thanks for joining the community sync. Could you avoid reformatting the table? History is especially important in the spec.

Done.

Notes:

1. Avro type annotation `adjust-to-utc` is an Iceberg convention; default value is `false` if not present.
2. Avro logical type `timestamp-nanos` is an Iceberg convention; the Avro specification does not define this type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Shouldn't we just go add this type to Avro, or use an Avro version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Google doc I already suggested adding this to the Avro spec: https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit?disco=AAAA5YbF34Y

Let me sent out an email on the Avro devlist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko please drop a link to Avro devlist post when you have a moment. The mailing list link on the Avro community page is just an email address, can't find the web UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we should update that page, that's not helping anyone. You can easily see the devlist on ponymail: https://lists.apache.org/[email protected]

On the bottom-left you can subscribe to the devlist. You can find my message here: https://lists.apache.org/thread/s985tc2m6bj5lvck017q06zkt3dg1f31 or the PR apache/avro#2554. Feel free to chime in anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue @Fokko I agree that we should ask this type be added to Avro.

It will be months before this change is added to Avro, then supported by dependent projects. Do you propose that:

  • We wait for the Avro spec to change before modifying the Iceberg spec?
  • Adjust the Iceberg spec language to reflect the intention to switch to the Avro type when it becomes available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most Iceberg implementations have their own Avro readers, I think we're fine with running this in parallel. I don't see any huge issues on the Avro side (maybe some discussion around the physical type), but I would not consider that blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko I understand; do I need to change this PR, then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what comes out of the Avro PR, but I think we're good for now.

format/spec.md Outdated
@@ -948,6 +961,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
Notes:

1. ORC's [TimestampColumnVector](https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.html) consists of a time field (milliseconds since epoch) and a nanos field (nanoseconds within the second). Hence the milliseconds within the second are reported twice; once in the time field and again in the nanos field. The read adapter should only use milliseconds within the second from one of these fields. The write adapter should also report milliseconds within the second twice; once in the time field and again in the nanos field. ORC writer is expected to correctly consider millis information from one of the fields. More details at https://issues.apache.org/jira/browse/ORC-546
2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` truncate nanoseconds to microseconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should state that Iceberg ORC writers MUST truncate nanoseconds to microseconds. Otherwise someone may store nanosecond values, which would break equality and have implications when casting from timestamp to timestamp_ns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL 150b6a9

| **`timestamp`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-2047944441`<br />`2017-11-16T22:31:08.000001` → `-1207196810` |
| **`timestamptz`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-2047944441`<br />`2017-11-16T14:31:08.000001-08:00` → `-1207196810` |
| **`timestamp_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-737750069`<br />`2017-11-16T22:31:08.000001` → `-976603392`<br />`2017-11-16T22:31:08.000000001` → `-160215926` |
| **`timestamptz_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-737750069`<br />`2017-11-16T14:31:08.000001-08:00` → `-976603392`<br />`2017-11-16T14:31:08.000000001-08:00` → `-160215926` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests for these cases? We need to make sure that the actual code produces the given values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I did generate these values with code tests, which I did not add to this PR. This change is just a spec change, following this comment.

Would you like some code to change in this PR? If so, I could remove the milliseconds details in #8658 then add those commits to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the unit tests to #8658 : bc50195

format/spec.md Outdated
| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] |
| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] |
| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] |
| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to update this table to show that timestamp_ns and timestamptz_ns are only allowed in v3 tables.

Copy link
Contributor Author

@jacobmarble jacobmarble Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL 8f75adb

@jacobmarble jacobmarble requested review from Fokko and rdblue October 16, 2023 17:24
@jacobmarble
Copy link
Contributor Author

@rdblue @Fokko what concerns remain regarding this change?

format/spec.md Outdated
3. Character strings must be stored as UTF-8 encoded byte arrays.
4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this needs to be a column in the table rather than a footnote. It just isn't very obvious that you can't use the ns types in v2 otherwise. That's my only concern with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Ryan's suggestion of having a full new column. We can expect more types to be added so that a column would make sense to me, instead of adding it to the requirements column.

Copy link
Contributor Author

@jacobmarble jacobmarble Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that detail. How's this? e81df55

@jacobmarble jacobmarble requested a review from rdblue October 27, 2023 16:22
@findepi
Copy link
Member

findepi commented Oct 27, 2023

@jacobmarble indeed, thanks!
do we expect any type promotions being allowed around the new types being added here?

@jacobmarble
Copy link
Contributor Author

@jacobmarble indeed, thanks! do we expect any type promotions being allowed around the new types being added here?

@findepi when we discussed in the last community meeting, that was discussed. We agreed to scope this PR, and #8657 to just adding the type, and let type promotion be a separate issue.

@rdblue
Copy link
Contributor

rdblue commented Oct 28, 2023

@findepi, for type promotion we decided to always interpret a promoted long as a millisecond value and the convert to either micros or nanos depending on the Iceberg type. Since we don't have a way to know what a long represents, there are problems no matter what we do and that's the best way to get predictable behavior. Otherwise, type promotion (i.e. timestamp_ms to a higher precision value) could corrupt data values that were previously working.

format/spec.md Outdated
| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] |
| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] |
| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | [v3](#version-3) |
| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | [v3](#version-3) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is clear enough. I think the version column should come first and be "Valid from" or similar. "Iceberg version" could be confused with the library version and we need to make it clear what this version means: that you are not allowed to use it in tables until that version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL. I could also fill in the other rows with v1.

@rdblue rdblue merged commit da55503 into apache:main Oct 31, 2023
@rdblue
Copy link
Contributor

rdblue commented Oct 31, 2023

@jacobmarble, I updated the table to be more clear about when types can be used and merge this. Thanks!

@jacobmarble jacobmarble deleted the jgm-timestamp-nanos-spec branch October 31, 2023 16:23
jacobmarble added a commit to jacobmarble/apache-iceberg that referenced this pull request Oct 31, 2023
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)
jacobmarble added a commit to jacobmarble/apache-iceberg that referenced this pull request Nov 1, 2023
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)
jacobmarble added a commit to jacobmarble/apache-iceberg that referenced this pull request Nov 1, 2023
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)
jacobmarble added a commit to jacobmarble/apache-iceberg that referenced this pull request Nov 8, 2023
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Jan 29, 2024
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Jan 30, 2024
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Jan 30, 2024
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Jan 30, 2024
Helps apache#8657

This change adds field `ChronoUnit unit` to `TimestampType`, such that
`TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Feb 14, 2024
Helps apache#8657

This change adds field `TimestampType.Unit unit` to `TimestampType`,
such that `TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Feb 14, 2024
Helps apache#8657

This change adds field `TimestampType.Unit unit` to `TimestampType`,
such that `TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Feb 20, 2024
Helps apache#8657

This change adds field `TimestampType.Unit unit` to `TimestampType`,
such that `TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
epgif pushed a commit to jacobmarble/apache-iceberg that referenced this pull request Feb 20, 2024
Helps apache#8657

This change adds field `TimestampType.Unit unit` to `TimestampType`,
such that `TimestampType` now represents four specified types:
- `timestamp` (existing)
- `timestamptz` (existing)
- `timestamp_ns` (new apache#8683)
- `timestamptz_ns` (new apache#8683)

Note that TimestampType.with[out]Zone() are marked as deprecated in this
change. In future PRs, I'll remove usage of these static methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants