-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utf-8 validation checking for substring
#1577
Add utf-8 validation checking for substring
#1577
Conversation
update doc add a test for invalid array type Signed-off-by: remzi <[email protected]>
clean up Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1577 +/- ##
==========================================
+ Coverage 82.93% 82.95% +0.01%
==========================================
Files 193 193
Lines 55350 55373 +23
==========================================
+ Hits 45907 45934 +27
+ Misses 9443 9439 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left some comments
Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start)) | ||
// check the `idx` is a valid char boundary or not | ||
// If yes, return `idx`, else return error | ||
let check_char_boundary = |idx: OffsetSize| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could instead interpret data
as &str
with std::str::from_utf8_unchecked
, which should be valid if this is a valid string array, and then use the standard library is_char_boundary instead of implementing our own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
// If yes, return `idx`, else return error | ||
let check_char_boundary = |idx: OffsetSize| { | ||
let idx_usize = idx.to_usize().unwrap(); | ||
if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { | |
// A valid code-point iff it does not start with 0b10xxxxxx | |
// Bit-magic taken from `std::str::is_char_boundary` | |
if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 { |
There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
#[test] | ||
fn check_invalid_array_type() { | ||
let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]); | ||
assert_eq!(substring(&array, 0, None).is_err(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like
let err = substring(&array, 0, None).unwrap_err().to_string();
assert!(err.contains("foo"), "{}", err);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start)) | ||
} else { | ||
Box::new(|start: OffsetSize, end: OffsetSize| end - start) | ||
// function for calculating the start index of each string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// function for calculating the start index of each string | |
// Function for calculating the start index of each string | |
// | |
// Takes the start and end offsets of the original string, and returns the byte offset | |
// to start copying data from the source values array | |
// | |
// Returns an error if this offset is not at a valid UTF-8 char boundary |
Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Ordering::Greater => Box::new(|old_start, end| { | ||
check_char_boundary((old_start + start).min(end)) | ||
}), | ||
Ordering::Equal => Box::new(|old_start, _| Ok(old_start)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
let new_length = cal_new_length(new_start, pair[1]); | ||
len_so_far += new_length; | ||
new_starts_ends.push((new_start, new_start + new_length)); | ||
offsets.windows(2).try_for_each(|pair| -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought eliminating the dyn-dispatch would be both clearer and faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons that I use dyn Fn
are:
- We should do the comparison for
start
and the destruction forlength
outside the for loop. And using closure is a convenient way, otherwise there would be lots of duplicate code. - Closures have different sizes because they capture different environment.
- The dynamic dispatching doesn't impact the performance in this function because they are not hotspots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that modern optimising compilers will automatically lift conditionals out of loops, in a process called loop unswitching, and that letting it do this can be faster not just from eliminating dynamic dispatch, but also letting it further optimise the code.
I don't feel strongly about it, but thought I'd mention that you might be able to make the code clearer and faster by letting the compiler handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation. Let me have a try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
}), | ||
}; | ||
|
||
// function for calculating the end index of each string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some documentation of this functions inputs and output would be helpful I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
arrow/benches/string_kernels.rs
Outdated
@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) { | |||
let str_len = 1000; | |||
|
|||
let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len); | |||
let start = 0; | |||
let start = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could benchmark the various permutations of no start offset and no length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! And I have updated the benchmark result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation checking is only enabled when start !=0
or length != None
. So there is no performance cost for the first micro-bench
/// ``` | ||
/// | ||
/// # Error | ||
/// this function errors when the passed array is not a \[Large\]String array. | ||
/// - The function errors when the passed array is not a \[Large\]String array. | ||
/// - The function errors when you try to create a substring in the middle of a multibyte character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, multibyte character sounds a bit vague, maybe just say invalid utf-8 format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 { | ||
Ok(idx) | ||
} else { | ||
Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message may not be understood by end-users. Maybe just say "The substring begins at index {} is an invalid utf8 char because it is not the first byte in a utf8 code point sequence"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
update doc Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me - thank you 👍
I'll leave this open for a bit longer to give others the chance to review, otherwise I'll merge this in this afternoon (GMT)
Co-authored-by: Raphael Taylor-Davies <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
Which issue does this PR close?
Closes #1575
What changes are included in this PR?
Are there any user-facing changes?
No.
Benchmark
We only check the char boundary validation when
start != 0
orlength != None
.