Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Add utility to filter for columns in a schema #4151

Merged
merged 4 commits into from
Feb 14, 2025

Conversation

vkorukanti
Copy link
Collaborator

Description

This utility method can be used to filter for columns with invariants or a certain data type (timestamp_ntz, variant, etc.) to implicitly identify the table features that should be enabled.

How was this patch tested?

Unit tests.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion that I think would simplify the test logic. Besides that, LGTM!

assert(results.size() === expectedColumns.size)

// Helper function to get the expected `StructField` based on a column path
def expectedStructField(columnPath: String): StructField = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I did find expectedStructField and findStructFieldInResults to be a bit confusing.

I have a different idea: either in SchemaUtils or just in this test suite, it seems like we could create a function that takes a schema and flattens it to a list of path -> structFIeld. I'm flexible if this is a path of a list of strings, or just the column names esacped with backticks as needed and joined by ..

this would be useful for doing lookups in a schema. we would create this by calling recurseIntoComplexTypes with recurseIntoMapOrArrayElements = true and stopOnFirstMatch = false and f always returns true.

so flattenedSchema = <invoke this method>

and then we have our val results =filterRecursively

and then for each expected column, we can look up the expected column in the flattenedSchema, and assert that the actual results has the expected column, and compare them.

I think this would really simplify the test code. wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Simplified even further. Also changed the flattenNestedColumnNames to use the new generic method.

@@ -191,36 +216,89 @@ public static int findColIndex(StructType schema, String colName) {
* will get flattened to: "a", "a.1", "a.2", "b", "c", "c.nest", "c.nest.3"
* </pre>
*/
private static List<String> flattenNestedFieldNames(StructType schema) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will now use the generic method added.

.asScala.map(f => (f._1.asScala.mkString("."), f._2)).toMap

// Assert that the number of results matches the expected columns
assert(results.size === expectedColumns.size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome :)

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vkorukanti vkorukanti merged commit dfc50d6 into delta-io:master Feb 14, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants