-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve strict schema string match with smarter matching #621
base: main
Are you sure you want to change the base?
Conversation
bb68ecb
to
ef53c76
Compare
val newSchemaFieldNames = newSchemaFields.keySet | ||
val currentSchemaFieldNames = currentSchemaFields.keySet | ||
|
||
// Ensure that all the new schema field names are a subset of current schema field names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is subset good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we could follow this Delta check in general. Right now I see few differences:
- For the Delta check, they ensure current schema fields are a subset new schema fields, but we're doing the opposite.
- They have checks on the nullability too.
I'm not familiar with the rationale behind the check either. Might worth to ask Ryan or Delta folks to make sure the change is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Reversed the check
- Nullability is checked within DataType.equalsStructurally
field.name -> field.dataType | ||
}).toMap | ||
|
||
val schemaChangedException = new SparkException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we want to include the info of culprit field in the error message?
val newSchemaFieldNames = newSchemaFields.keySet | ||
val currentSchemaFieldNames = currentSchemaFields.keySet | ||
|
||
// Ensure that all the new schema field names are a subset of current schema field names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we could follow this Delta check in general. Right now I see few differences:
- For the Delta check, they ensure current schema fields are a subset new schema fields, but we're doing the opposite.
- They have checks on the nullability too.
I'm not familiar with the rationale behind the check either. Might worth to ask Ryan or Delta folks to make sure the change is safe.
ef53c76
to
781b9c3
Compare
781b9c3
to
0384302
Compare
This PR improves the strict schema string match with smarter matching i.e. subset matching
Added unit test
Tested against DBR: