[VARIANT] Accept `variantType` RFC #4096

richardc-db · 2025-01-28T20:38:47Z

Which Delta project/connector is this regarding?

Description

moves the variant type RFC into the accepted folder and inlines it into PROTOCOL.md.

Variant has been used in production systems for many months now and has also been added as a logical type in parquet here. Additionally, Spark has a robust implementation of the variant type

Additionally specifies that variant shredding will be a different table feature and removed the portion about struct fields with _ are ignored.

How was this patch tested?

manually tested that links work

Does this PR introduce any user-facing changes?

scovich

Is it safe to finalize this RFC when the parquet spec is not finalized yet?

Important
This specification is still under active development, and has not been formally adopted.

I guess this RFC actually references the spark variant spec, not the parquet variant spec -- even tho it has a whole section about "Variant data in Parquet"?

Do we need commentary in the Delta spec about it relates to those two different specifications?

PROTOCOL.md

scovich · 2025-02-04T17:39:32Z

PROTOCOL.md


 ## Writer Requirements for Variant Data Type

 When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
 - must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
+- must not write additional parquet struct fields.


Is there a particular reason we need to forbid extra fields in this specific case?

Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.

I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't metadata or value.

@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding

One reason to fail is that shredding adds a typed_value column. If a shredding-unaware reader doesn't fail when it encounters a column other than value and metadata, it could incorrectly read value and metadata, which might look valid, but would not contain the full value.

@cashmad, in this case, the VariantShredding table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.

maybe i'm missing something...

That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.

@scovich There's a large PR to change the shredding spec that is removing both of these: apache/parquet-format#461.

The spec has simplified the fields to rename untyped_value to just value, in order to make unshredded variant a special case of shredded variant; the PR also removes the wording about underscore.

@bart-samwel's suggestion to specifically call out typed_value sounds reasonable to me.

Agree -- if we know of a common failure mode where invalid data could be silently accepted, it's probably worth calling that out here. Especially if value no longer disappears to make it obvious when shredding is in use.

Alternatively -- should we take it up a level of abstraction? Require the Delta client to check for shredded files, and to reject them unless the variantShredding feature is also enabled?

I think the simplest way to clarify this spec is to call out the typed_value field.

As for checking for shredded files, that would require opening every file footer. That is too high of an overhead for this requirement, right?

Agree that requiring clients to check every footer is too much. Enough to say what the rules are, and the reader can decide on their own how to handle the possibility of non-conforming writes.

makes sense - i updated the wording to particularly callout the typed_value field in light of shredding.

"- must not write a parquet struct field named typed_value to avoid confusion with the field required by Variant shredding with the same name."

I chose must rather than should because I figure its safer to have more strong wording here. But lmk if anyone thinks there should be any tweaks!

scovich · 2025-02-04T17:44:00Z

PROTOCOL.md

+# Variant Data Type
+
+This feature enables support for the Variant data type, for storing semi-structured data.
+The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).


It seems a bit odd to specifically mention the schema serialization format here? AFAIK it didn't change with this table feature?

Meanwhile, it seems like we should specifically state that the new data type is called variant, rather than relying on the example below to show it?

Ah -- the schema serialization section adds a new entry for variant, that explains the type name is variant.

PROTOCOL.md

scovich · 2025-02-04T17:52:17Z

PROTOCOL.md

+      "elementType" : {
+        "type" : "variant"
+      },
+      "containsNull" : false


According to the variant spec, a variant column could store a JSON null value even if the variant field is non-nullable as in this example.

How is that handled? Does the Delta spec need to say anything about it or is that the query engine's problem, independent of Delta?

hmm, yeah I think this is less a storage layer issue and more of an engine issue. That should be addressed in the actual binary layout of variant (which is considered in the parquet spec)

scovich · 2025-02-04T17:53:21Z

PROTOCOL.md

+The parquet struct must include the two struct fields `value` and `metadata`.
+Supported writers must write the two binary fields, and supported readers must read the two binary fields.
+
+Variant shredding will be introduced in a separate `variantShredding` table feature.


Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?

I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).

I'd let @gene-db make the call here, though. Gene, do you have an opinion?

Maybe a better way to put it -- should we give some indication here of what "shredding" even is? (a hyperlink to the spec being perhaps the least intrusive approach)?

I think it would make sense to link to the parquet shredding spec.

Suggested change

Variant shredding will be introduced in a separate `variantShredding` table feature.

[Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) will be introduced in a separate `variantShredding` table feature.

With the link, people can refer to it to get more details about the shredding, but we don't have to put those details in the text here.

scovich

LGTM

richardc-db added 2 commits January 28, 2025 12:34

init

2120bc9

add note for shredding and remove underscore

cdea198

scovich reviewed Feb 4, 2025

View reviewed changes

richardc-db added 3 commits February 12, 2025 10:14

add shredding link

7ab4878

change wording to must not

710c50e

wording

14fb6cb

richardc-db requested a review from scovich February 13, 2025 01:02

scovich approved these changes Feb 13, 2025

View reviewed changes

allisonport-db merged commit 32ea155 into delta-io:master Feb 14, 2025
16 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VARIANT] Accept `variantType` RFC #4096

[VARIANT] Accept `variantType` RFC #4096

richardc-db commented Jan 28, 2025 •

edited

Loading

scovich left a comment

scovich Feb 4, 2025

richardc-db Feb 4, 2025

cashmand Feb 4, 2025

richardc-db Feb 4, 2025 •

edited

Loading

cashmand Feb 4, 2025

cashmand Feb 6, 2025

scovich Feb 6, 2025

gene-db Feb 6, 2025

scovich Feb 6, 2025

richardc-db Feb 13, 2025

scovich Feb 4, 2025

scovich Feb 4, 2025

scovich Feb 4, 2025

richardc-db Feb 4, 2025

scovich Feb 4, 2025

richardc-db Feb 4, 2025

scovich Feb 6, 2025

gene-db Feb 6, 2025

scovich left a comment

	Variant shredding will be introduced in a separate `variantShredding` table feature.
	[Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) will be introduced in a separate `variantShredding` table feature.

[VARIANT] Accept variantType RFC #4096

[VARIANT] Accept variantType RFC #4096

Conversation

richardc-db commented Jan 28, 2025 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardc-db Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

[VARIANT] Accept `variantType` RFC #4096

[VARIANT] Accept `variantType` RFC #4096

richardc-db commented Jan 28, 2025 •

edited

Loading

richardc-db Feb 4, 2025 •

edited

Loading