From f6fbf04b802c151f694c7726b2444d733f5e55e0 Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Fri, 29 Sep 2023 11:32:54 -0700 Subject: [PATCH 1/8] Spec: add types timstamp_ns and timestamptz_ns Helps #8657 This change embodies this design doc: https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit --- format/spec.md | 74 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/format/spec.md b/format/spec.md index 60c0f99c3f90..0b3c4ebac070 100644 --- a/format/spec.md +++ b/format/spec.md @@ -177,8 +177,10 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. | **`decimal(P,S)`** | Fixed-point decimal; precision P, scale S | Scale is fixed [1], precision must be 38 or less | | **`date`** | Calendar date without timezone or time | | | **`time`** | Time of day without date, timezone | Microsecond precision [2] | -| **`timestamp`** | Timestamp without timezone | Microsecond precision [2] | -| **`timestamptz`** | Timestamp with timezone | Stored as UTC [2] | +| **`timestamp`** | Timestamp, with microsecond precision, without timezone | [4], [6] | +| **`timestamptz`** | Timestamp, with microsecond precision, with timezone | [4], [7] | +| **`timestamp_ns`** | Timestamp, with nanosecond precision, without timezone | [5], [6] | +| **`timestamptz_ns`** | Timestamp, with nanosecond precision, with timezone | [5], [7] | | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | **`fixed(L)`** | Fixed-length byte array of length L | | @@ -186,11 +188,13 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: -1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. -2. All time and timestamp values are stored with microsecond precision. - - Timestamps _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). - - Timestamps _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp values are stored as a long that encodes microseconds from the unix epoch. -3. Character strings must be stored as UTF-8 encoded byte arrays. +1. `decimal(P,S)` scale (`S`) is fixed, and cannot be changed by schema evolution. Precision (`P`) can only be widened by schema evolution. +2. `string` values must be stored as UTF-8 encoded byte arrays. +3. `date` and `time` values represent date and time of day, respectfully, _without time zone_. +4. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. Storage formats must retain at least this precision, but may retain higher. +5. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. Storage formats must retain this precision. +6. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). +7. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). For details on how to serialize a schema to JSON, see Appendix C. @@ -307,12 +311,12 @@ Partition specs capture the transform from table data to partition values. This | Transform name | Description | Source types | Result type | |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | **`identity`** | Source value, unmodified | Any | Source type | -| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `string`, `uuid`, `fixed`, `binary` | `int` | +| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string` | Source type | -| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz` | `int` | -| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` | -| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` | -| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz` | `int` | +| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | +| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | **`void`** | Always produces `null` | Any | Source type or `int` | All transforms must return `null` for a `null` input value. @@ -862,10 +866,12 @@ Maps with non-string keys must use an array representation with the `map` logica |**`float`**|`float`|| |**`double`**|`double`|| |**`decimal(P,S)`**|`{ "type": "fixed",`
  `"size": minBytesRequired(P),`
  `"logicalType": "decimal",`
  `"precision": P,`
  `"scale": S }`|Stored as fixed using the minimum number of bytes for the given precision.| -|**`date`**|`{ "type": "int",`
  `"logicalType": "date" }`|Stores days from the 1970-01-01.| +|**`date`**|`{ "type": "int",`
  `"logicalType": "date" }`|Stores days from 1970-01-01.| |**`time`**|`{ "type": "long",`
  `"logicalType": "time-micros" }`|Stores microseconds from midnight.| -|**`timestamp`**|`{ "type": "long",`
  `"logicalType": "timestamp-micros",`
  `"adjust-to-utc": false }`|Stores microseconds from 1970-01-01 00:00:00.000000.| -|**`timestamptz`**|`{ "type": "long",`
  `"logicalType": "timestamp-micros",`
  `"adjust-to-utc": true }`|Stores microseconds from 1970-01-01 00:00:00.000000 UTC.| +|**`timestamp`** | `{ "type": "long",`
  `"logicalType": "timestamp-micros",`
  `"adjust-to-utc": false }` | Stores microseconds from 1970-01-01 00:00:00.000000. [1] | +|**`timestamptz`** | `{ "type": "long",`
  `"logicalType": "timestamp-micros",`
  `"adjust-to-utc": true }` | Stores microseconds from 1970-01-01 00:00:00.000000 UTC. [1] | +|**`timestamp_ns`** | `{ "type": "long",`
  `"logicalType": "timestamp-nanos",`
  `"adjust-to-utc": false }` | Stores nanoseconds from 1970-01-01 00:00:00.000000000. [1], [2] | +|**`timestamptz_ns`** | `{ "type": "long",`
  `"logicalType": "timestamp-nanos",`
  `"adjust-to-utc": true }` | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC. [1], [2] | |**`string`**|`string`|| |**`uuid`**|`{ "type": "fixed",`
  `"size": 16,`
  `"logicalType": "uuid" }`|| |**`fixed(L)`**|`{ "type": "fixed",`
  `"size": L }`|| @@ -874,6 +880,11 @@ Maps with non-string keys must use an array representation with the `map` logica |**`list`**|`array`|| |**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.| +Notes: + +1. Avro type annotation `adjust-to-utc` is an Iceberg convention; default value is `false` if not present. +2. Avro logical type `timestamp-nanos` is an Iceberg convention; the Avro specification does not define this type. + **Field IDs** @@ -908,10 +919,12 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo | **`float`** | `float` | | | | **`double`** | `double` | | | | **`decimal(P,S)`** | `P <= 9`: `int32`,
`P <= 18`: `int64`,
`fixed` otherwise | `DECIMAL(P,S)` | Fixed must use the minimum number of bytes that can store `P`. | -| **`date`** | `int32` | `DATE` | Stores days from the 1970-01-01. | +| **`date`** | `int32` | `DATE` | Stores days from 1970-01-01. | | **`time`** | `int64` | `TIME_MICROS` with `adjustToUtc=false` | Stores microseconds from midnight. | | **`timestamp`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=false` | Stores microseconds from 1970-01-01 00:00:00.000000. | | **`timestamptz`** | `int64` | `TIMESTAMP_MICROS` with `adjustToUtc=true` | Stores microseconds from 1970-01-01 00:00:00.000000 UTC. | +| **`timestamp_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=false` | Stores nanoseconds from 1970-01-01 00:00:00.000000000. | +| **`timestamptz_ns`** | `int64` | `TIMESTAMP_NANOS` with `adjustToUtc=true` | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC. | | **`string`** | `binary` | `UTF8` | Encoding must be UTF-8. | | **`uuid`** | `fixed_len_byte_array[16]` | `UUID` | | | **`fixed(L)`** | `fixed_len_byte_array[L]` | | | @@ -935,8 +948,10 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo | **`decimal(P,S)`** | `decimal` | | | | **`date`** | `date` | | | | **`time`** | `long` | `iceberg.long-type`=`TIME` | Stores microseconds from midnight. | -| **`timestamp`** | `timestamp` | | [1] | -| **`timestamptz`** | `timestamp_instant` | | [1] | +| **`timestamp`** | `timestamp` | | Stores microseconds from 2015-01-01 00:00:00.000000. [1], [2] | +| **`timestamptz`** | `timestamp_instant` | | Stores microseconds from 2015-01-01 00:00:00.000000 UTC. [1], [2] | +| **`timestamp_ns`** | `timestamp` | | Stores nanoseconds from 2015-01-01 00:00:00.000000000. [1] | +| **`timestamptz_ns`** | `timestamp_instant` | | Stores nanoseconds from 2015-01-01 00:00:00.000000000 UTC. [1] | | **`string`** | `string` | | ORC `varchar` and `char` would also map to **`string`**. | | **`uuid`** | `binary` | `iceberg.binary-type`=`UUID` | | | **`fixed(L)`** | `binary` | `iceberg.binary-type`=`FIXED` & `iceberg.length`=`L` | The length would not be checked by the ORC reader and should be checked by the adapter. | @@ -948,6 +963,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo Notes: 1. ORC's [TimestampColumnVector](https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.html) consists of a time field (milliseconds since epoch) and a nanos field (nanoseconds within the second). Hence the milliseconds within the second are reported twice; once in the time field and again in the nanos field. The read adapter should only use milliseconds within the second from one of these fields. The write adapter should also report milliseconds within the second twice; once in the time field and again in the nanos field. ORC writer is expected to correctly consider millis information from one of the fields. More details at https://issues.apache.org/jira/browse/ORC-546 +2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` truncate nanoseconds to microseconds. One of the interesting challenges with this is how to map Iceberg’s schema evolution (id based) on to ORC’s (name based). In theory, we could use Iceberg’s column ids as the column and field names, but that would be inconvenient. @@ -971,8 +987,10 @@ The 32-bit hash implementation is 32-bit Murmur3 hash, x86 variant, seeded with | **`decimal(P,S)`** | `hashBytes(minBigEndian(unscaled(v)))`[2] | `14.20` → `-500754589` | | **`date`** | `hashInt(daysFromUnixEpoch(v))` | `2017-11-16` → `-653330422` | | **`time`** | `hashLong(microsecsFromMidnight(v))` | `22:31:08` → `-662762989` | -| **`timestamp`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-2047944441` | -| **`timestamptz`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00`→ `-2047944441` | +| **`timestamp`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-2047944441`
`2017-11-16T22:31:08.000001` → `-1207196810` | +| **`timestamptz`** | `hashLong(microsecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-2047944441`
`2017-11-16T14:31:08.000001-08:00` → `-1207196810` | +| **`timestamp_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T22:31:08` → `-737750069`
`2017-11-16T22:31:08.000001` → `-976603392`
`2017-11-16T22:31:08.000000001` → `-160215926` | +| **`timestamptz_ns`** | `hashLong(nanosecsFromUnixEpoch(v))` | `2017-11-16T14:31:08-08:00` → `-737750069`
`2017-11-16T14:31:08.000001-08:00` → `-976603392`
`2017-11-16T14:31:08.000000001-08:00` → `-160215926` | | **`string`** | `hashBytes(utf8Bytes(v))` | `iceberg` → `1210000089` | | **`uuid`** | `hashBytes(uuidBytes(v))` [3] | `f79c3e09-677c-4bbd-a479-3f349cb785e7` → `1488055340` | | **`fixed(L)`** | `hashBytes(v)` | `00 01 02 03` → `-188683207` | @@ -1018,8 +1036,10 @@ Types are serialized according to this table: |**`double`**|`JSON string: "double"`|`"double"`| |**`date`**|`JSON string: "date"`|`"date"`| |**`time`**|`JSON string: "time"`|`"time"`| -|**`timestamp without zone`**|`JSON string: "timestamp"`|`"timestamp"`| -|**`timestamp with zone`**|`JSON string: "timestamptz"`|`"timestamptz"`| +|**`timestamp, microseconds, without zone`**|`JSON string: "timestamp"`|`"timestamp"`| +|**`timestamp, microseconds, with zone`**|`JSON string: "timestamptz"`|`"timestamptz"`| +|**`timestamp, nanoseconds, without zone`**|`JSON string: "timestamp_ns"`|`"timestamp_ns"`| +|**`timestamp, nanoseconds, with zone`**|`JSON string: "timestamptz_ns"`|`"timestamptz_ns"`| |**`string`**|`JSON string: "string"`|`"string"`| |**`uuid`**|`JSON string: "uuid"`|`"uuid"`| |**`fixed(L)`**|`JSON string: "fixed[]"`|`"fixed[16]"`| @@ -1179,8 +1199,10 @@ This serialization scheme is for storing single values as individual binary valu | **`double`** | Stored as 8-byte little-endian | | **`date`** | Stores days from the 1970-01-01 in an 4-byte little-endian int | | **`time`** | Stores microseconds from midnight in an 8-byte little-endian long | -| **`timestamp without zone`** | Stores microseconds from 1970-01-01 00:00:00.000000 in an 8-byte little-endian long | -| **`timestamp with zone`** | Stores microseconds from 1970-01-01 00:00:00.000000 UTC in an 8-byte little-endian long | +| **`timestamp`** | Stores microseconds from 1970-01-01 00:00:00.000000 in an 8-byte little-endian long | +| **`timestamptz`** | Stores microseconds from 1970-01-01 00:00:00.000000 UTC in an 8-byte little-endian long | +| **`timestamp_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 in an 8-byte little-endian long | +| **`timestamptz_ns`** | Stores nanoseconds from 1970-01-01 00:00:00.000000000 UTC in an 8-byte little-endian long | | **`string`** | UTF-8 bytes (without length) | | **`uuid`** | 16-byte big-endian value, see example in Appendix B | | **`fixed(L)`** | Binary value | @@ -1206,6 +1228,8 @@ This serialization scheme is for storing single values as individual binary valu | **`time`** | **`JSON string`** | `"22:31:08.123456"` | Stores ISO-8601 standard time with microsecond precision | | **`timestamp`** | **`JSON string`** | `"2017-11-16T22:31:08.123456"` | Stores ISO-8601 standard timestamp with microsecond precision; must not include a zone offset | | **`timestamptz`** | **`JSON string`** | `"2017-11-16T22:31:08.123456+00:00"` | Stores ISO-8601 standard timestamp with microsecond precision; must include a zone offset and it must be '+00:00' | +| **`timestamp_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789"` | Stores ISO-8601 standard timestamp with nanosecond precision; must not include a zone offset | +| **`timestamptz_ns`** | **`JSON string`** | `"2017-11-16T22:31:08.123456789+00:00"` | Stores ISO-8601 standard timestamp with nanosecond precision; must include a zone offset and it must be '+00:00' | | **`string`** | **`JSON string`** | `"iceberg"` | | | **`uuid`** | **`JSON string`** | `"f79c3e09-677c-4bbd-a479-3f349cb785e7"` | Stores the lowercase uuid string | | **`fixed(L)`** | **`JSON string`** | `"000102ff"` | Stored as a hexadecimal string | @@ -1223,6 +1247,8 @@ Default values are added to struct fields in v3. * The `write-default` is a forward-compatible change because it is only used at write time. Old writers will fail because the field is missing. * Tables with `initial-default` will be read correctly by older readers if `initial-default` is always null for optional fields. Otherwise, old readers will default optional columns with null. Old readers will fail to read required fields which are populated by `initial-default` because that default is not supported. +Types `timestamp_ns` and `timestamptz_ns` are added in v3. + ### Version 2 Writing v1 metadata: From 6e515ec3cf02632f9f292f4ac0c3422bb82ad2db Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Wed, 11 Oct 2023 13:59:13 -0700 Subject: [PATCH 2/8] chore: integrate review feedback --- format/spec.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/format/spec.md b/format/spec.md index 0b3c4ebac070..5c281fc816ca 100644 --- a/format/spec.md +++ b/format/spec.md @@ -177,10 +177,10 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. | **`decimal(P,S)`** | Fixed-point decimal; precision P, scale S | Scale is fixed [1], precision must be 38 or less | | **`date`** | Calendar date without timezone or time | | | **`time`** | Time of day without date, timezone | Microsecond precision [2] | -| **`timestamp`** | Timestamp, with microsecond precision, without timezone | [4], [6] | -| **`timestamptz`** | Timestamp, with microsecond precision, with timezone | [4], [7] | -| **`timestamp_ns`** | Timestamp, with nanosecond precision, without timezone | [5], [6] | -| **`timestamptz_ns`** | Timestamp, with nanosecond precision, with timezone | [5], [7] | +| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | +| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | +| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | +| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | **`fixed(L)`** | Fixed-length byte array of length L | | @@ -188,13 +188,11 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: -1. `decimal(P,S)` scale (`S`) is fixed, and cannot be changed by schema evolution. Precision (`P`) can only be widened by schema evolution. -2. `string` values must be stored as UTF-8 encoded byte arrays. -3. `date` and `time` values represent date and time of day, respectfully, _without time zone_. -4. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. Storage formats must retain at least this precision, but may retain higher. -5. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. Storage formats must retain this precision. -6. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). -7. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). +1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. +2. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. + - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). + - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). +3. Character strings must be stored as UTF-8 encoded byte arrays. For details on how to serialize a schema to JSON, see Appendix C. From 150b6a9e19a45c0741c3d1cd563e92708983bf2d Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Mon, 16 Oct 2023 10:09:08 -0700 Subject: [PATCH 3/8] spec: ORC must truncate microsecond timestamps --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index 5c281fc816ca..e7672855340c 100644 --- a/format/spec.md +++ b/format/spec.md @@ -961,7 +961,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo Notes: 1. ORC's [TimestampColumnVector](https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.html) consists of a time field (milliseconds since epoch) and a nanos field (nanoseconds within the second). Hence the milliseconds within the second are reported twice; once in the time field and again in the nanos field. The read adapter should only use milliseconds within the second from one of these fields. The write adapter should also report milliseconds within the second twice; once in the time field and again in the nanos field. ORC writer is expected to correctly consider millis information from one of the fields. More details at https://issues.apache.org/jira/browse/ORC-546 -2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` truncate nanoseconds to microseconds. +2. ORC `timestamp` and `timestamp_instant` values store nanosecond precision. Iceberg ORC writers for Iceberg types `timestamp` and `timestamptz` **must** truncate nanoseconds to microseconds. One of the interesting challenges with this is how to map Iceberg’s schema evolution (id based) on to ORC’s (name based). In theory, we could use Iceberg’s column ids as the column and field names, but that would be inconvenient. From 8f75adb9f787afa17385d3ac39d7de4a558fca69 Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Mon, 16 Oct 2023 10:11:58 -0700 Subject: [PATCH 4/8] spec: more clarity that ns timestamps are v3 only --- format/spec.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/format/spec.md b/format/spec.md index e7672855340c..bff5b646c0e8 100644 --- a/format/spec.md +++ b/format/spec.md @@ -179,8 +179,8 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. | **`time`** | Time of day without date, timezone | Microsecond precision [2] | | **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | | **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | -| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | -| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | +| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2], [4] | +| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2], [4] | | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | **`fixed(L)`** | Fixed-length byte array of length L | | @@ -193,6 +193,7 @@ Notes: - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. For details on how to serialize a schema to JSON, see Appendix C. From 1abd02aba37dddedfa826c6c255cf6bd069ec6ed Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Fri, 27 Oct 2023 09:20:04 -0700 Subject: [PATCH 5/8] spec: even more clarity about ns timestamps in v3 only --- format/spec.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/format/spec.md b/format/spec.md index bff5b646c0e8..9e4667294d35 100644 --- a/format/spec.md +++ b/format/spec.md @@ -179,8 +179,8 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. | **`time`** | Time of day without date, timezone | Microsecond precision [2] | | **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | | **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | -| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2], [4] | -| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2], [4] | +| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | only supported in v3 tables [2] | +| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | only supported in v3 tables [2] | | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | **`fixed(L)`** | Fixed-length byte array of length L | | @@ -193,7 +193,6 @@ Notes: - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. -4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. For details on how to serialize a schema to JSON, see Appendix C. From e81df5543b82ac30dd4bb7f38b22e58f43f71a51 Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Fri, 27 Oct 2023 16:56:54 -0700 Subject: [PATCH 6/8] chore: add version column to markdown table --- format/spec.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/format/spec.md b/format/spec.md index 9e4667294d35..8bdea9bcaa18 100644 --- a/format/spec.md +++ b/format/spec.md @@ -167,8 +167,8 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. #### Primitive Types -| Primitive type | Description | Requirements | -|--------------------|--------------------------------------------------------------------------|--------------------------------------------------| +| Primitive type | Description | Requirements | Iceberg Version | +|--------------------|--------------------------------------------------------------------------|--------------------------------------------------|-----------------| | **`boolean`** | True or false | | | **`int`** | 32-bit signed integers | Can promote to `long` | | **`long`** | 64-bit signed integers | | @@ -179,8 +179,8 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. | **`time`** | Time of day without date, timezone | Microsecond precision [2] | | **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | | **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | -| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | only supported in v3 tables [2] | -| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | only supported in v3 tables [2] | +| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | [v3](#version-3) | +| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | [v3](#version-3) | | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | **`fixed(L)`** | Fixed-length byte array of length L | | From 34e06fd81f24e87f35f7fcdcbbc226b3fdc20157 Mon Sep 17 00:00:00 2001 From: Jacob Marble Date: Sat, 28 Oct 2023 16:49:44 -0700 Subject: [PATCH 7/8] chore: clarify minimum, not only, iceberg version --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index 8bdea9bcaa18..08bac7cb5f4b 100644 --- a/format/spec.md +++ b/format/spec.md @@ -167,7 +167,7 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. #### Primitive Types -| Primitive type | Description | Requirements | Iceberg Version | +| Primitive type | Description | Requirements | Valid From | |--------------------|--------------------------------------------------------------------------|--------------------------------------------------|-----------------| | **`boolean`** | True or false | | | **`int`** | 32-bit signed integers | Can promote to `long` | From 0caa0e8e8019c62b394be3f8e560876183c12f38 Mon Sep 17 00:00:00 2001 From: Ryan Blue Date: Tue, 31 Oct 2023 09:14:26 -0700 Subject: [PATCH 8/8] Clarify version where types were added. --- format/spec.md | 38 ++++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 18 deletions(-) diff --git a/format/spec.md b/format/spec.md index 08bac7cb5f4b..855db29f569b 100644 --- a/format/spec.md +++ b/format/spec.md @@ -167,24 +167,26 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. #### Primitive Types -| Primitive type | Description | Requirements | Valid From | -|--------------------|--------------------------------------------------------------------------|--------------------------------------------------|-----------------| -| **`boolean`** | True or false | | -| **`int`** | 32-bit signed integers | Can promote to `long` | -| **`long`** | 64-bit signed integers | | -| **`float`** | [32-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | Can promote to double | -| **`double`** | [64-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | | -| **`decimal(P,S)`** | Fixed-point decimal; precision P, scale S | Scale is fixed [1], precision must be 38 or less | -| **`date`** | Calendar date without timezone or time | | -| **`time`** | Time of day without date, timezone | Microsecond precision [2] | -| **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | -| **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | -| **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | [v3](#version-3) | -| **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | [v3](#version-3) | -| **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | -| **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | -| **`fixed(L)`** | Fixed-length byte array of length L | | -| **`binary`** | Arbitrary-length byte array | | +Supported primitive types are defined in the table below. Primitive types added after v1 have an "added by" version that is the first spec version in which the type is allowed. For example, nanosecond-precision timestamps are part of the v3 spec; using v3 types in v1 or v2 tables can break forward compatibility. + +| Added by verison | Primitive type | Description | Requirements | +|------------------|--------------------|--------------------------------------------------------------------------|--------------------------------------------------| +| | **`boolean`** | True or false | | +| | **`int`** | 32-bit signed integers | Can promote to `long` | +| | **`long`** | 64-bit signed integers | | +| | **`float`** | [32-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | Can promote to double | +| | **`double`** | [64-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | | +| | **`decimal(P,S)`** | Fixed-point decimal; precision P, scale S | Scale is fixed [1], precision must be 38 or less | +| | **`date`** | Calendar date without timezone or time | | +| | **`time`** | Time of day without date, timezone | Microsecond precision [2] | +| | **`timestamp`** | Timestamp, microsecond precision, without timezone | [2] | +| | **`timestamptz`** | Timestamp, microsecond precision, with timezone | [2] | +| [v3](#version-3) | **`timestamp_ns`** | Timestamp, nanosecond precision, without timezone | [2] | +| [v3](#version-3) | **`timestamptz_ns`** | Timestamp, nanosecond precision, with timezone | [2] | +| | **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 [3] | +| | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | +| | **`fixed(L)`** | Fixed-length byte array of length L | | +| | **`binary`** | Arbitrary-length byte array | | Notes: