Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3920: Serialize custom attribute in RecordField #2650

Merged
merged 7 commits into from
Jan 4, 2024

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Dec 25, 2023

What is the purpose of the change

(For example: This pull request improves file read performance by buffering data, fixing AVRO-XXXX.)
support to serialize the custom attributes in RecordField ,fixing AVRO-3920

Verifying this change

(Please pick one of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Extended interop tests to verify consistent valid schema names between SDKs
  • Added test that validates that Java throws an AvroRuntimeException on invalid binary data
  • Manually verified the change by building the website and checking the new redirect

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@github-actions github-actions bot added the Rust label Dec 25, 2023
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 25, 2023

Find a interesting case:

let schema = Schema::parse_str(
            r#"
            {
              "type" : "record",
              "name" : "record",
              "fields" : [
                 {
                    "type" : "enum",
                    "name" : "enum",
                    "symbols": ["one", "two", "three"]
                 },
                 { "name" : "next", "type" : "enum" }
             ]
            }
        "#,
        )?;
// schema 
RecordField {
                name: "enum",
                doc: None,
                aliases: None,
                default: None,
                schema: Enum(
                    EnumSchema {
                        name: Name {
                            name: "enum",
                            namespace: None,
                        },
                        aliases: None,
                        doc: None,
                        symbols: [
                            "one",
                            "two",
                            "three",
                        ],
                        default: None,
                        attributes: {},
                    },
                ),
                order: Ascending,
                position: 0,
                custom_attributes: {
                    "symbols": Array [
                        String("one"),
                        String("two"),
                        String("three"),
                    ],
                },
            },

"symbols": ["one", "two", "three"] will be treated as custom attributes and symbols, seems there are some problem here:

  1. Is that str valid? Should it write as
r#"
            {
              "type" : "record",
              "name" : "record",
              "fields" : [
                 {
                    "name": "enum"
                    "type" : {
                          "type": "enum",
                          "name" : "enum",
                           "symbols": ["one", "two", "three"]
                    }
                 },
                 { "name" : "next", "type" : "enum" }
             ]
            }
        "#,

I tried this one and the symbols will not be treated as custom attributes.
2. If the above 1 is right, then we should treat symbols in following case as attributes as custom attributes rather than symbols, right?🤔

   "fields" :[{
                    "type" : "enum",
                    "name" : "enum",
                    "symbols": ["one", "two", "three"]
     }]

cc @martin-g

@martin-g
Copy link
Member

The symbols should not be stored as attributes! It is a bug!

@martin-g
Copy link
Member

But symbols must be excluded according to

attributes: self.get_custom_attributes(complex, vec!["symbols"]),

@martin-g
Copy link
Member

it seems it is parsed here -

"type" | "name" | "doc" | "default" | "order" | "position" | "aliases" => continue,

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 25, 2023

But symbols must be excluded according to

attributes: self.get_custom_attributes(complex, vec!["symbols"]),

If we solve it like this, it means that the user should not use "symbols" as attribute names even the type is not an enum, like the following, right? 🤔

fields: [{
  "type": "long",
  "name": "field1"
  "symbols":  [..]
}]

@martin-g
Copy link
Member

If we solve it like this, it means that the user should not use "symbols" as attribute names even the type is not an enum, like the following, right?

See the surrounding code. symbols is excluded here only if this is an enum schema

@martin-g
Copy link
Member

I'll be able to help more after the holidays!

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 25, 2023

I'll be able to help more after the holidays!

Thanks! Enjoy your holiday!

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Dec 26, 2023

I send a commit to fix it. But I'm not sure whether it is a good way to solve it.
IMO, I think the better way to fix it may be to let the parse_complex(&mut self, complex:&mut Map<>) remove some fields used in this function.🤔 But this would involve too many modifications and I'm not sure it's acceptable.
I'm glad to revert the current fix way and using a better way to fix it.

@@ -3305,16 +3314,16 @@ mod tests {

let schema = Schema::parse_str(raw_schema)?;
assert_eq!(
"abf662f831715ff78f88545a05a9262af75d6406b54e1a8a174ff1d2b75affc4",
"b18ddbf029afaa876b54cb4de997506ef67e202bee13d8b54cb5698288c25318",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a regression.
According to https://avro.apache.org/docs/1.11.1/specification/#transforming-into-parsing-canonical-form we should STRIP everything but type, name, fields, symbols, items, values, size. I.e. the custom attributes should be stripped and the fingerprint should stay the same.

@martin-g
Copy link
Member

martin-g commented Jan 4, 2024

@ZENOTME I think this PR is good to go!
Please let me know if you have any concerns!

@martin-g
Copy link
Member

martin-g commented Jan 4, 2024

I have the feeling that the way I ignored logicalType to be stored in the custom attributes would break your expectations about Fixed+uuid,

// uuid logical type is not supported for SchemaKind::Fixed, so it is parsed as Schema::Fixed
// and the `logicalType` is preserved as an attribute.

The test still passes because the test depends on Schema's attributes, while the custom attribute is ignored for a record field. But those should behave the same way!

There is a discussion about making Fixed(size=16, logicalType=uuid) a valid type! It should be available in the Rust SDK soon!
But I wonder whether keeping the logicalType as a custom attribute is good in general.

@martin-g martin-g force-pushed the fix_field_custom_attr branch from 49e33e5 to 2628de4 Compare January 4, 2024 11:16
@martin-g martin-g force-pushed the fix_field_custom_attr branch from 2628de4 to e1b2f76 Compare January 4, 2024 11:20
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
@martin-g
Copy link
Member

martin-g commented Jan 4, 2024

I am going to merge this PR.
If there is more work we can do it in a separate PR!

@martin-g martin-g merged commit 5184f83 into apache:main Jan 4, 2024
15 checks passed
martin-g pushed a commit that referenced this pull request Jan 4, 2024
* AVRO-3920: [Rust] * serialize custom attr in record field

* AVRO-3920: [Rust] Do not parse "logicalType" as a custom RecordField attribute

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
(cherry picked from commit 5184f83)
@martin-g
Copy link
Member

martin-g commented Jan 4, 2024

Thank you, @ZENOTME !

@ZENOTME ZENOTME deleted the fix_field_custom_attr branch January 7, 2024 04:24
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jan 8, 2024

@ZENOTME I think this PR is good to go!
Please let me know if you have any concerns!

Sorry for being late. Thanks for your help!

I have the feeling that the way I ignored logicalType to be stored in the custom attributes would break your expectations about Fixed+uuid.
But I wonder whether keeping the logicalType as a custom attribute is good in general.

Yes, ignored logicalType here may couse unexpectation behaviour. I think keeping keeping the logicalType as a custom attribute is reasonable behaviour because it let the user customiaze their own logical type even it's not supported by RustSDK.

I guess the behaviour we ignore logicalType or other field here is to avoid duplication. So should we only ignore the logicalType when the schema in record field used it? e.g Uuid, Date, TimeMills,...

@martin-g
Copy link
Member

martin-g commented Jan 8, 2024

OK! I'll see how to improve it!

RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
* AVRO-3920: [Rust] * serialize custom attr in record field

* AVRO-3920: [Rust] Do not parse "logicalType" as a custom RecordField attribute

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants