feat: substantially improve avro deserialization performance #6201
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
fixes #6189
fixes #6190
Wooohoo!! I was able to figure out a way to get this gain without introducing a cache 😂 that means we get even better performance than what I suggested in #6106 because we don't need to hash
Schema
objects as cache keys.Previously, were were creating a case insensitive field map per-record, and always calling
toUpperCase
on every field - even if the case-sensitive version would have been matched. Looking into the performance profile (detailed in #6106) it revealed that these two points together where taking up a large chunk of time.There are two schemas in play: the ksql schema and the value schema (taken from schema registry). The original code created a mapping from expected field name (case-insensitive) to the value field. Then, for each ksql field, it looked in this map by name to see if it could find a match. If it did, it would place the value from that field into the output record.
The new code reverses this lookup pattern: it looks for each field in the actual record to see if there's a match in the ksql schema. If there isn't, it checks if the upper-cased version of the field name exists in the ksql record.
This improves performance on two fronts:
toUpperCase
on fields if the case-sensitive version matches. This allows us to get further improved performance in the case-sensitive path (this is key because we used to spend 23% of our time in deserialization just callingString.toUpperCase
- this trades this off for an extra map lookup, which is usually free due to string interning of the hash code)Testing done
Running the benchmark locally shows improvements for the
metrics
schema (the gain is less impressive on the smallerimpressions
schema) by about 40%:Profiling results now show fewer easy-to-fix hot spots. Before the optimization, we spent nearly twice the amount of time in
AvroDataTranslator
compared to actually deserializing the avro:Afterwards (funnily, the percentages exactly flipped!) we spend half the time in
AvroDataTranslator
as we do deserializing the data:Reviewer checklist