Improve `validate_utf8` performance #2048

tfeda · 2022-07-12T02:14:01Z

Which issue does this PR close?

Closes #1815 .

Rationale for this change

What changes are included in this PR?

Added a benchmark for utf8 validation,
Followed the suggestions in #1815 with a few notes:

If utf8 validation fails, the new function falls back on executing each index to give a more informed error message
@tustvold validate_each_offset() checks if the offsets are sorted, which I don't think we want to lose by switching to is_char_boundary(),

checking the bench I get a ~4x speedup 👍

Are there any user-facing changes?

Nope

tustvold

Without is_char_boundary this is not correct, it will not detect a multi-byte character split across two consecutive strings.

alamb · 2022-07-13T20:32:33Z

Without is_char_boundary this is not correct, it will not detect a multi-byte character split across two consecutive strings.

Sounds like a hole in our tests if all the tests pass but there is a bug 🤔

tustvold · 2022-07-14T13:25:29Z

Added a test in #2068

tfeda · 2022-07-19T23:36:08Z

Sorry for the late follow-up, I see its purpose now and added it.

codecov-commenter · 2022-07-19T23:55:11Z

Codecov Report

Merging #2048 (a17f650) into master (b2cf02c) will increase coverage by 0.00%.
The diff coverage is 77.77%.

❗ Current head a17f650 differs from pull request most recent head 17678fa. Consider uploading reports for the commit 17678fa to get more accurate results

@@           Coverage Diff           @@
##           master    #2048   +/-   ##
=======================================
  Coverage   83.74%   83.74%           
=======================================
  Files         225      225           
  Lines       59422    59434   +12     
=======================================
+ Hits        49764    49775   +11     
- Misses       9658     9659    +1

Impacted Files	Coverage Δ
arrow/src/array/data.rs	`84.93% <77.77%> (-0.02%)`	⬇️
parquet/src/encodings/encoding.rs	`93.62% <0.00%> (+0.19%)`	⬆️

tustvold · 2022-07-23T13:29:51Z

arrow/src/array/data.rs

+                    if !values_str.is_char_boundary(range.start)
+                        || !values_str.is_char_boundary(range.end)


I think you can remove values_str.is_char_boundary(range.end), as the end offset is the start offset of the next string slice and will therefore be checked by that. We do not need to check the final offset as if the end of the string is not a valid char boundary, the string as a whole would fail validation.

I realized an additional subtlety with this, if the offsets buffer is sliced, you need to validate that this slicing is at a valid boundary, which a naive implementation of the above might miss.

Theoretically you only need to validate the range of values covered by the offsets, which might be another possible optimisation 🤔

tustvold

Looks good to me, thank you. Left a minor comment that should make this even faster 😄

ursabot · 2022-07-26T15:31:39Z

Benchmark runs are scheduled for baseline = 9c70e4a and contender = 0c64054. 0c64054 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-07-26T20:15:40Z

🎉 Thanks for this contribution @tfeda and to @tustvold for helping get it over the line

tfeda added 2 commits July 11, 2022 21:40

added utf8 validation bench

1fb1222

improve utf8 validation performance

2f30720

github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2022

fix bench clippy errors

e52c4e3

tustvold reviewed Jul 12, 2022

View reviewed changes

tustvold mentioned this pull request Jul 14, 2022

Test utf8_validation checks char boundaries #2068

Merged

tfeda added 3 commits July 14, 2022 18:29

Merge branch 'master' of https://github.com/apache/arrow-rs

e173f46

Merge branch 'master' of https://github.com/apache/arrow-rs

f57389d

Add is_char_boundary() to utf8 validation

17678fa

tustvold reviewed Jul 23, 2022

View reviewed changes

tustvold approved these changes Jul 23, 2022

View reviewed changes

tustvold merged commit 0c64054 into apache:master Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `validate_utf8` performance #2048

Improve `validate_utf8` performance #2048

tfeda commented Jul 12, 2022

tustvold left a comment

alamb commented Jul 13, 2022

tustvold commented Jul 14, 2022

tfeda commented Jul 19, 2022

codecov-commenter commented Jul 19, 2022

tustvold Jul 23, 2022 •

edited

Loading

tustvold Jul 26, 2022

tustvold left a comment

ursabot commented Jul 26, 2022

alamb commented Jul 26, 2022

		if !values_str.is_char_boundary(range.start)
		\|\| !values_str.is_char_boundary(range.end)

Improve validate_utf8 performance #2048

Improve validate_utf8 performance #2048

Conversation

tfeda commented Jul 12, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

alamb commented Jul 13, 2022

tustvold commented Jul 14, 2022

tfeda commented Jul 19, 2022

codecov-commenter commented Jul 19, 2022

Codecov Report

tustvold Jul 23, 2022 • edited Loading

Choose a reason for hiding this comment

tustvold Jul 26, 2022

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

ursabot commented Jul 26, 2022

alamb commented Jul 26, 2022

Improve `validate_utf8` performance #2048

Improve `validate_utf8` performance #2048

tustvold Jul 23, 2022 •

edited

Loading