-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve validate_utf8
performance
#2048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without is_char_boundary this is not correct, it will not detect a multi-byte character split across two consecutive strings.
Sounds like a hole in our tests if all the tests pass but there is a bug 🤔 |
Added a test in #2068 |
Sorry for the late follow-up, I see its purpose now and added it. |
Codecov Report
@@ Coverage Diff @@
## master #2048 +/- ##
=======================================
Coverage 83.74% 83.74%
=======================================
Files 225 225
Lines 59422 59434 +12
=======================================
+ Hits 49764 49775 +11
- Misses 9658 9659 +1
|
if !values_str.is_char_boundary(range.start) | ||
|| !values_str.is_char_boundary(range.end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove values_str.is_char_boundary(range.end)
, as the end offset is the start offset of the next string slice and will therefore be checked by that. We do not need to check the final offset as if the end of the string is not a valid char boundary, the string as a whole would fail validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized an additional subtlety with this, if the offsets buffer is sliced, you need to validate that this slicing is at a valid boundary, which a naive implementation of the above might miss.
Theoretically you only need to validate the range of values covered by the offsets, which might be another possible optimisation 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you. Left a minor comment that should make this even faster 😄
Benchmark runs are scheduled for baseline = 9c70e4a and contender = 0c64054. 0c64054 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #1815 .
Rationale for this change
What changes are included in this PR?
Added a benchmark for utf8 validation,
Followed the suggestions in #1815 with a few notes:
validate_each_offset()
checks if the offsets are sorted, which I don't think we want to lose by switching tois_char_boundary()
,checking the bench I get a ~4x speedup 👍
Are there any user-facing changes?
Nope