Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix generate_non_canonical_map_case, fix MapArray equality #1476

Merged
merged 3 commits into from
Mar 27, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 23, 2022

Which issue does this PR close?

Closes #1475.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 23, 2022
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -107,6 +109,47 @@ fn arrow_to_json(arrow_name: &str, json_name: &str, verbose: bool) -> Result<()>
Ok(())
}

fn canonicalize_schema(schema: &Schema) -> Schema {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the issue that other arrow implementations create different names for the "key" and "value" fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated json has fields named other than "key" and 'value". I think this is why it is non canonical map case. But I'm not sure which arrow implementations can actually produce such map arrays.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is better just pushed into the notion of DataType equality, as opposed to being special cased at the edges. Perhaps we shouldn't even be storing Field inside these DataType? 🤔

I also wonder if a similar problem arises with List (and possibly Union) which also have a field name, for no obvious reason?

Perhaps worth a discussion ticket, this design does seem a little counter-intuitive?

if l_fields.len() == 2 && r_fields.len() == 2 =>
{
// We don't enforce the equality of field names
l_fields.get(0).unwrap().data_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check nullability and metadata? I also wonder if this logic is better contained within the equality logic for DataType itself 🤔

lhs_len == rhs_len && {
match data_type {
DataType::Map(_, _) => {
// Don't use `equal_range` which calls `utils::base_equal` that checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something but as far as I can tell you have modified base_equal to not check the names, and so I'm not sure why this special case is still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because here we are compare the nested struct between two map arrays. So at utils::base_equal, it compares two struct types, not the map types. And we shouldn't let utils::base_equal ignore field name difference for struct types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not just move this special case of DataType::Map(_, _) into data type? So this code would just do a standard DataType::eq on the DataType::Map and it would internally special case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we don't implement Eq for DataType, do you mean we should do it? Even we implement it, here we are comparing child data of the map array, so its data type is Struct(vec!["key", "value"]). So we still need to specially treat map type here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah I see the problem, darn... 👍

) && {
match lhs.data_type() {
DataType::Map(_, _) => {
// Don't use `equal_range` which calls `utils::base_equal` that checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -121,7 +164,7 @@ fn validate(arrow_name: &str, json_name: &str, verbose: bool) -> Result<()> {
let arrow_schema = arrow_reader.schema().as_ref().to_owned();

// compare schemas
if json_file.schema != arrow_schema {
if canonicalize_schema(&json_file.schema) != canonicalize_schema(&arrow_schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels a bit off about this logic living at the edges, i.e. the interface. If the field names of a DataType::Map really don't matter, I would expect this to either not be stored, or at least stored as an implementation detail...

@alamb alamb changed the title Fix generate_non_canonical_map_case Fix generate_non_canonical_map_case, fix MapArray equality Mar 24, 2022
@alamb alamb merged commit c5442cf into apache:master Mar 27, 2022
@alamb alamb added the development-process Related to development process of arrow-rs label Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate development-process Related to development process of arrow-rs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix generate_non_canonical_map_case integration test failure
4 participants