Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle nested struct in schemaString #257

Merged
merged 2 commits into from
Jun 21, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Jun 11, 2024

Handle parsing nested structs in schemaString. Setting iceberg compat causes this. Also enable the iceberg_compat_v1 test for dat, since we do now support column mapping.

Closes #253 .

Copy link
Collaborator

@hntd187 hntd187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never like these catch all serialization things, but seems like in this case it's somewhat of a necessary thing.

@nicklan nicklan requested a review from zachschuermann June 17, 2024 22:10
@@ -62,9 +62,7 @@ pub fn sort_record_batch(batch: RecordBatch) -> DeltaResult<RecordBatch> {
Ok(RecordBatch::try_new(batch.schema(), columns)?)
}

static SKIPPED_TESTS: &[&str; 2] = &[
// iceberg compat requires column mapping
"iceberg_compat_v1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

// }
// TODO: Perhaps make this more restrictive, the PROTOCOL seems to indicate that this should
// always be a `String->Int` map
Object(serde_json::Map<String, serde_json::Value>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then IIUC this is letting column/structfield metadata also include arbitrary string->json? Is there any API here other than the ability to just ask for the metadata back?

Not that i'm against it, just thinking of some implications: this makes serde_json::Map/Value part of the public API right?

We could also take the extreme approach and do Json(serde_json::Value)? or do the most specific of Map<Int, String>. or keep the in-between currently in the PR.

This comment is mostly questions/ideas I've thought of - I think most would be reasonable here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed make it part of the public API. I don't think this is exactly correct, but I think there's some work to be done to spec out in the protocol exactly what can go here.

This basically allows:

{
  "key": [number],
  "key": [string],
  "key": [bool],
  "key": {
      [pretty much any object]
   }
}

Allowing serde_json::Map<String, serde_json::Value> is basically equivalent to "anything in a json object" because json objects are just string -> Value maps. So what we're disallowing here is:

{
  "key": [1,2,3], // array
  "key": null
}

and that's it. I can't find use for that in the spec so I slightly prefer the more restrictive form. We can relax this later without breaking things, but it's harder to go from more loose to tighter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we disallowing array/null? Isn't there serde_json::Value::Null and serde_json::Value::Array?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just means the top level of enum there has to be an object, what is contained in the map can technically be anything serde_json::Value can take

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we disallowing array/null? Isn't there serde_json::Value::Null and serde_json::Value::Array?

We are only disallowing it at the top level. Inside the Object variant you could have anything, but at the top we only deser into the types I listed above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec only says metadata is a "JSON map" and AFAIK delta-spark (spark, really) represents it as a String -> Any map (see Metadata.scala). So technically even Null and Array values would be allowed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I've updated this to now just have an Other variant that will catch anything we don't understand, plus a comment explaining why that's okay.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i misunderstood yay thanks lgtm :)

@nicklan nicklan requested a review from zachschuermann June 18, 2024 17:57
@nicklan nicklan force-pushed the handle-schema-from-icebergv2compat branch from 633ba5c to 4f43d3b Compare June 19, 2024 20:29
@nicklan nicklan requested a review from scovich June 19, 2024 20:30
@nicklan nicklan merged commit be69dcf into delta-io:main Jun 21, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MetadataValue schema doesn't support nested values, used by IcebergCompatV2 protocol
4 participants