Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FOLLOWUP] Clarify Variant specification details #457

Merged
merged 3 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions VariantEncoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ Next, is an `offset` list, which contains `dictionary_size + 1` values.
Each `offset` is a little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`.
The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`.
The last part of the metadata is `bytes`, which stores all the string values in the dictionary.
All string values must be UTF-8 encoded strings.

## Metadata encoding grammar

Expand All @@ -107,7 +108,7 @@ offset_size_minus_one: 2-bit value providing the number of bytes per dictionary
dictionary_size: `offset_size` bytes. little-endian value indicating the number of strings in the dictionary
dictionary: <offset>* <bytes>
offset: `offset_size` bytes. little-endian value indicating the starting position of the ith string in `bytes`. The list should contain `dictionary_size + 1` values, where the last value is the total length of `bytes`.
bytes: dictionary string values
bytes: UTF-8 encoded dictionary string values
```

Notes:
Expand Down Expand Up @@ -209,7 +210,7 @@ The [primitive types table](#encoding-types) shows the encoding format for each

### Value Data for Short string (`basic_type`=1)

When `basic_type` is `1`, `value_data` is the sequence of bytes that represents the string.
When `basic_type` is `1`, `value_data` is the sequence of UTF-8 encoded bytes that represents the string.

### Value Data for Object (`basic_type`=2)

Expand Down Expand Up @@ -337,7 +338,7 @@ object_header: (is_large << 4 | field_id_size_minus_one << 2 | field_offset_size
array_header: (is_large << 2 | field_offset_size_minus_one)
value_data: <primitive_val> | <short_string_val> | <object_val> | <array_val>
primitive_val: see table for binary representation
short_string_val: bytes
short_string_val: UTF-8 encoded bytes
object_val: <num_elements> <field_id>* <field_offset>* <fields>
array_val: <num_elements> <field_offset>* <fields>
num_elements: a 1 or 4 byte little-endian value (depending on is_large in <object_header>/<array_header>)
Expand Down Expand Up @@ -403,11 +404,17 @@ The *Logical Type* column indicates logical equivalence of physically encoded ty
For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

# Field ID order and uniqueness
# String values must be UTF-8 encoded
Fokko marked this conversation as resolved.
Show resolved Hide resolved

All strings within the Variant binary format must be UTF-8 encoded.
This includes the dictionary key string values, the "short string" values, and the "long string" values.

# Object field ID order and uniqueness

For objects, field IDs and offsets must be listed in the order of the corresponding field names, sorted lexicographically.
Note that the fields themselves are not required to follow this order.
Note that the field values themselves are not required to follow this order.
As a result, offsets will not necessarily be listed in ascending order.
The field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values.

An implementation may rely on this field ID order in searching for field names.
E.g. a binary search on field IDs (combined with metadata lookups) may be used to find a field with a given field.
Expand Down
20 changes: 17 additions & 3 deletions VariantShredding.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ optional group variant_col {
# Parquet Layout

The `array` and `object` fields represent Variant array and object types, respectively.
Arrays must use the three-level list structure described in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
Arrays must use the three-level list structure described in [LogicalTypes.md](LogicalTypes.md).

An `object` field must be a group.
Each field name of this inner group corresponds to the Variant value's object field name.
Expand Down Expand Up @@ -143,6 +143,17 @@ There are two main motivations for including the `variant_value` column:
1) In a case where there are rare type mismatches (for example, a numeric field with rare strings like “n/a”), we allow the field to be shredded, which could still be a significant performance benefit compared to fetching and decoding the full value/metadata binary.
2) Since there is a single schema per file, there would be no easy way to recover from a type mismatch encountered late in a file write. Parquet files can be large, and buffering all file data before starting to write could be expensive. Including a variant column for every field guarantees we can adhere to the requested shredding schema.

# Top-level metadata

Any values stored in a shredded `variant_value` field may have dictionary IDs referring to the metadata.
There is one metadata value for the entire Variant record, and that is stored in the top-level `metadata` field.
This means any `variant_value` values in the shredded representation is only the "value" portion of the [Variant Binary Encoding](VariantEncoding.md).

The metadata is kept at the top-level, instead of shredding the metadata with the shredded variant values because:
* Simplified shredding scheme and specification. No need for additional struct-of-binary values, or custom concatenated binary scheme for `variant_value`.
* Simplified and good performance for write shredding. No need to rebuild the metadata, or re-encode IDs for `variant_value`.
* Simplified and good performance for Variant reconstruction. No need to re-encode IDs for `variant_value`.

# Data Skipping

Shredded columns are expected to store statistics in the same format as a normal Parquet column.
Expand All @@ -154,11 +165,14 @@ This specification is not strict about what values may be stored in `variant_val
# Shredding Semantics

Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value.
For example, the order of fields in the binary may change, as may the physical representation of scalar values.
For example, in a reconstructed Variant value, the order of object field values may be different from the original binary.
This is allowed since the [Variant Binary Encoding](VariantEncoding.md#object-field-id-order-and-uniqueness) does not require an ordering of the field values, but the field IDs will still be ordered lexicographically according to the corresponding field names.

The physical representation of scalar values may also be different in the reconstructed Variant binary.
In particular, the [Variant Binary Encoding](VariantEncoding.md) considers all integer and decimal representations to represent a single logical type.
This flexibility enables shredding to be applicable in more scenarios, while maintaining all information and values losslessly.
As a result, it is valid to shred a decimal into a decimal column with a different scale, or to shred an integer as a decimal, as long as no numeric precision is lost.
For example, it would be valid to write the value 123 to a Decimal(9, 2) column, but the value 1.234 would need to be written to the **variant_value** column.
For example, it would be valid to write the value 123 to a Decimal(9, 2) column, but the value 1.234 would need to be written to the `variant_value` column.
When reconstructing, it would be valid for a reader to reconstruct 123 as an integer, or as a Decimal(9, 2).
Engines should not depend on the physical type of a Variant value, only the logical type.

Expand Down