-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix generate_non_canonical_map_case, fix MapArray
equality
#1476
Fix generate_non_canonical_map_case, fix MapArray
equality
#1476
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -107,6 +109,47 @@ fn arrow_to_json(arrow_name: &str, json_name: &str, verbose: bool) -> Result<()> | |||
Ok(()) | |||
} | |||
|
|||
fn canonicalize_schema(schema: &Schema) -> Schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the issue that other arrow implementations create different names for the "key" and "value" fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated json has fields named other than "key" and 'value". I think this is why it is non canonical map case. But I'm not sure which arrow implementations can actually produce such map arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is better just pushed into the notion of DataType equality, as opposed to being special cased at the edges. Perhaps we shouldn't even be storing Field inside these DataType? 🤔
I also wonder if a similar problem arises with List (and possibly Union) which also have a field name, for no obvious reason?
Perhaps worth a discussion ticket, this design does seem a little counter-intuitive?
arrow/src/array/equal/utils.rs
Outdated
if l_fields.len() == 2 && r_fields.len() == 2 => | ||
{ | ||
// We don't enforce the equality of field names | ||
l_fields.get(0).unwrap().data_type() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also check nullability and metadata? I also wonder if this logic is better contained within the equality logic for DataType itself 🤔
lhs_len == rhs_len && { | ||
match data_type { | ||
DataType::Map(_, _) => { | ||
// Don't use `equal_range` which calls `utils::base_equal` that checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something but as far as I can tell you have modified base_equal to not check the names, and so I'm not sure why this special case is still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because here we are compare the nested struct between two map arrays. So at utils::base_equal
, it compares two struct types, not the map types. And we shouldn't let utils::base_equal
ignore field name difference for struct types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not just move this special case of DataType::Map(_, _)
into data type? So this code would just do a standard DataType::eq
on the DataType::Map
and it would internally special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we don't implement Eq
for DataType
, do you mean we should do it? Even we implement it, here we are comparing child data of the map array, so its data type is Struct(vec!["key", "value"])
. So we still need to specially treat map type here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah I see the problem, darn... 👍
) && { | ||
match lhs.data_type() { | ||
DataType::Map(_, _) => { | ||
// Don't use `equal_range` which calls `utils::base_equal` that checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -121,7 +164,7 @@ fn validate(arrow_name: &str, json_name: &str, verbose: bool) -> Result<()> { | |||
let arrow_schema = arrow_reader.schema().as_ref().to_owned(); | |||
|
|||
// compare schemas | |||
if json_file.schema != arrow_schema { | |||
if canonicalize_schema(&json_file.schema) != canonicalize_schema(&arrow_schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something feels a bit off about this logic living at the edges, i.e. the interface. If the field names of a DataType::Map really don't matter, I would expect this to either not be stored, or at least stored as an implementation detail...
MapArray
equality
Which issue does this PR close?
Closes #1475.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?