From dff0b3e6f02ed28e6b0753e921d53de53e63f506 Mon Sep 17 00:00:00 2001 From: Gene Pang <77996944+gene-db@users.noreply.github.com> Date: Wed, 6 Nov 2024 09:59:24 -0800 Subject: [PATCH] GH-459: Add Variant logical type annotation (#460) * Add Variant as logical type * Update shredding to use * Revert changes to the spec * Update logical types Variant details * Clarify BYTE_ARRAY and remove underscore fields --- LogicalTypes.md | 36 ++++++++++++++++++++++++++++++++++ src/main/thrift/parquet.thrift | 8 ++++++++ 2 files changed, 44 insertions(+) diff --git a/LogicalTypes.md b/LogicalTypes.md index 1b7d5c262..3aa5ceb96 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -563,6 +563,42 @@ defined by the [BSON specification][bson-spec]. The sort order used for `BSON` is unsigned byte-wise comparison. +### VARIANT + +`VARIANT` is used for a Variant value. It must annotate a group. The group must +contain a field named `metadata` and a field named `value`. Both fields must have +type `binary`, which is also called `BYTE_ARRAY` in the Parquet thrift definition. +The `VARIANT` annotated group can be used to store either an unshredded Variant +value, or a shredded Variant value. + +* The Variant group must be annotated with the `VARIANT` logical type. +* Both fields `value` and `metadata` must be of type `binary` (called `BYTE_ARRAY` + in the Parquet thrift definition). +* The `metadata` field is required and must be a valid Variant metadata component, + as defined by the [Variant binary encoding specification](VariantEncoding.md). +* When present, the `value` field must be a valid Variant value component, + as defined by the [Variant binary encoding specification](VariantEncoding.md). +* The `value` field is required for unshredded Variant values. +* The `value` field is optional and may be null only when parts of the Variant + value are shredded according to the [Variant shredding specification](VariantShredding.md). + +This is the expected representation of an unshredded Variant in Parquet: +``` +optional group variant_unshredded (VARIANT) { + required binary metadata; + required binary value; +} +``` + +This is an example representation of a shredded Variant in Parquet: +``` +optional group variant_shredded (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; +} +``` + ## Nested Types This section specifies how `LIST` and `MAP` can be used to encode nested types diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 83457fe29..5d4431d9e 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -380,6 +380,12 @@ struct JsonType { struct BsonType { } +/** + * Embedded Variant logical type annotation + */ +struct VariantType { +} + /** * LogicalType annotations to replace ConvertedType. * @@ -410,6 +416,7 @@ union LogicalType { 13: BsonType BSON // use ConvertedType BSON 14: UUIDType UUID // no compatible ConvertedType 15: Float16Type FLOAT16 // no compatible ConvertedType + 16: VariantType VARIANT // no compatible ConvertedType } /** @@ -980,6 +987,7 @@ union ColumnOrder { * ENUM - unsigned byte-wise comparison * LIST - undefined * MAP - undefined + * VARIANT - undefined * * In the absence of logical types, the sort order is determined by the physical type: * BOOLEAN - false, true