-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: add nanosecond timestamp types #8683
Conversation
@Fokko - friendly reminder to review |
@jacobmarble, I was talking with @danielcweeks a couple days ago about v3 additions and we came up with a good reason to add Now that we are adding timestamps with different units, this gives us a way to fix a long-standing problem: older data that used millisecond timestamps as longs couldn't be read as timestamps because we had no way to know the unit of the What do you think? If you agree, then we should probably add |
Sounds good to me. I'll add milliseconds to this PR. |
449c5ff
to
ed36cfb
Compare
In the Iceberg community sync today, it was decided that we will not be adding millisecond timestamps after all. Current status of this PR:
@Fokko do you think we're ready to merge? |
@jacobmarble Yes, this looks good to me. Thanks for joining the community sync. Could you avoid reformatting the table? History is especially important in the spec. |
Helps apache#8657 This change embodies this design doc: https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit
ed36cfb
to
f6fbf04
Compare
Done. |
Notes: | ||
|
||
1. Avro type annotation `adjust-to-utc` is an Iceberg convention; default value is `false` if not present. | ||
2. Avro logical type `timestamp-nanos` is an Iceberg convention; the Avro specification does not define this type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Shouldn't we just go add this type to Avro, or use an Avro version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Google doc I already suggested adding this to the Avro spec: https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit?disco=AAAA5YbF34Y
Let me sent out an email on the Avro devlist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko please drop a link to Avro devlist post when you have a moment. The mailing list link on the Avro community page is just an email address, can't find the web UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we should update that page, that's not helping anyone. You can easily see the devlist on ponymail: https://lists.apache.org/[email protected]
On the bottom-left you can subscribe to the devlist. You can find my message here: https://lists.apache.org/thread/s985tc2m6bj5lvck017q06zkt3dg1f31 or the PR apache/avro#2554. Feel free to chime in anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @Fokko I agree that we should ask this type be added to Avro.
It will be months before this change is added to Avro, then supported by dependent projects. Do you propose that:
- We wait for the Avro spec to change before modifying the Iceberg spec?
- Adjust the Iceberg spec language to reflect the intention to switch to the Avro type when it becomes available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most Iceberg implementations have their own Avro readers, I think we're fine with running this in parallel. I don't see any huge issues on the Avro side (maybe some discussion around the physical type), but I would not consider that blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I understand; do I need to change this PR, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what comes out of the Avro PR, but I think we're good for now.
format/spec.md
Outdated
@@ -948,6 +961,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo | |||
Notes: | |||
|
|||
1. ORC's [TimestampColumnVector](https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.html) consists of a time field (milliseconds since epoch) and a nanos field (nanoseconds within the second). Hence the milliseconds within the second are reported twice; once in the time field and again in the nanos field. The read adapter should only use milliseconds within the second from one of these fields. The write adapter should also report milliseconds within the second twice; once in the time field and again in the nanos field. ORC writer is expected to correctly consider millis information from one of the fields. More details at https://issues.apache.org/jira/browse/ORC-546 | |||
2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` truncate nanoseconds to microseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should state that Iceberg ORC writers MUST truncate nanoseconds to microseconds. Otherwise someone may store nanosecond values, which would break equality and have implications when casting from timestamp to timestamp_ns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL 150b6a9
| **`timestamp`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-2047944441`<br />`2017-11-16T22:31:08.000001` → `-1207196810` | | ||
| **`timestamptz`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-2047944441`<br />`2017-11-16T14:31:08.000001-08:00` → `-1207196810` | | ||
| **`timestamp_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-737750069`<br />`2017-11-16T22:31:08.000001` → `-976603392`<br />`2017-11-16T22:31:08.000000001` → `-160215926` | | ||
| **`timestamptz_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-737750069`<br />`2017-11-16T14:31:08.000001-08:00` → `-976603392`<br />`2017-11-16T14:31:08.000000001-08:00` → `-160215926` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add tests for these cases? We need to make sure that the actual code produces the given values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I did generate these values with code tests, which I did not add to this PR. This change is just a spec change, following this comment.
Would you like some code to change in this PR? If so, I could remove the milliseconds details in #8658 then add those commits to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format/spec.md
Outdated
| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | | ||
| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | | ||
| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | | ||
| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to update this table to show that timestamp_ns and timestamptz_ns are only allowed in v3 tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL 8f75adb
format/spec.md
Outdated
3. Character strings must be stored as UTF-8 encoded byte arrays. | ||
4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this needs to be a column in the table rather than a footnote. It just isn't very obvious that you can't use the ns types in v2 otherwise. That's my only concern with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Ryan's suggestion of having a full new column. We can expect more types to be added so that a column would make sense to me, instead of adding it to the requirements
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that detail. How's this? e81df55
@jacobmarble indeed, thanks! |
@findepi when we discussed in the last community meeting, that was discussed. We agreed to scope this PR, and #8657 to just adding the type, and let type promotion be a separate issue. |
@findepi, for type promotion we decided to always interpret a promoted long as a millisecond value and the convert to either micros or nanos depending on the Iceberg type. Since we don't have a way to know what a long represents, there are problems no matter what we do and that's the best way to get predictable behavior. Otherwise, type promotion (i.e. timestamp_ms to a higher precision value) could corrupt data values that were previously working. |
format/spec.md
Outdated
| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | | ||
| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | | ||
| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | [v3](#version-3) | | ||
| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | [v3](#version-3) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is clear enough. I think the version column should come first and be "Valid from" or similar. "Iceberg version" could be confused with the library version and we need to make it clear what this version means: that you are not allowed to use it in tables until that version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL. I could also fill in the other rows with v1.
@jacobmarble, I updated the table to be more clear about when types can be used and merge this. Thanks! |
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683)
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683)
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683)
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `ChronoUnit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `TimestampType.Unit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `TimestampType.Unit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `TimestampType.Unit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps apache#8657 This change adds field `TimestampType.Unit unit` to `TimestampType`, such that `TimestampType` now represents four specified types: - `timestamp` (existing) - `timestamptz` (existing) - `timestamp_ns` (new apache#8683) - `timestamptz_ns` (new apache#8683) Note that TimestampType.with[out]Zone() are marked as deprecated in this change. In future PRs, I'll remove usage of these static methods.
Helps #8657
This change embodies this design doc:
https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit
My IDE reformatted the Markdown tables edited. I'm happy to revert those if needed.
I also resorted the tables of types, so that date/time types land after other primitive types, and before nested types.