fix nextind bug for invalid UTF-8 #25214

StefanKarpinski · 2017-12-20T21:40:58Z

Reported by @bkamins in #24420. More comprehensive tests to come next week.

bkamins · 2017-12-20T22:11:47Z

I use this code to test it:

srand(1)
errors = []
for i in 1:10^6, j in 1:5
    s = String(rand(UInt8, 5))
    if next(s, j)[2] != nextind(s, j)
        push!(errors, (s, j, isvalid(s, j), next(s,j)[2], nextind(s, j)))
    end
end

and length(errors) is 71119, but the problem is only when next is used with invalid index any(getindex.(errors, 3)) is false - I guess this is intended for performance reasons. Right?

StefanKarpinski · 2017-12-20T22:16:26Z

I decided that it made sense for next to be allowed to assume that its input index was valid since it's generally used via the iteration protocol or in relatively advanced generic code. It still checks for bounds errors (unless you annotate with @inbounds), so you can't accidentally access invalid memory. For String this helps since next doesn't need to check if it's in the middle of a character.

nalimilan · 2017-12-21T07:36:33Z

I decided that it made sense for next to be allowed to assume that its input index was valid since it's generally used via the iteration protocol or in relatively advanced generic code. It still checks for bounds errors (unless you annotate with @inbounds), so you can't accidentally access invalid memory. For String this helps since next doesn't need to check if it's in the middle of a character.

Is this really useful, given that you can use @inbounds when you know the index is valid? It's kind of weird to check that the index is in bounds, but not that it's valid.

StefanKarpinski · 2017-12-21T08:00:29Z

@inbounds does not tell you that the index is valid, only that it's in bounds. It's unclear to me that it's a good idea to conflate those two. Functions like thisind, nextind and prevind are designed to allow invalid index inputs, but @inbounds still applies to them. Do you want that same @inbounds annotation to tell those functions that they should assume that the input index is valid? If not, then we'd be in a situation where @inbounds tells some of these string functions that they're inbounds but not others – in a seemingly haphazard fashion.

nalimilan · 2017-12-21T08:04:20Z

Yeah, @inbounds was designed for arrays where indices are always valid when they are in bounds, so it's not completely clear what it should mean for strings where the situation is more complex. But I can't find a situation where you'd want to assert that an index is valid, but that you don't know whether it's inside bounds. That makes no sense to me. So I'd be inclined to throw an error with invalid indices with all functions, and have @inbounds disable all checks. Also, it would be less breaking if we decided to stop throwing errors after 1.0 than the contrary.

StefanKarpinski · 2017-12-21T08:41:59Z

So I'd be inclined to throw an error with invalid indices with all functions

The whole point of thisind is to synchronize to a valid index. Similarly for nextind and prevind although they are also useful for valid indices, unlike thisind which has no other use.

But I can't find a situation where you'd want to assert that an index is valid, but that you don't know whether it's inside bounds.

Sure, but the opposite does make sense: knowing that an index is in bounds but not knowing if it's valid. That's why these aren't perfectly coupled.

nalimilan · 2017-12-21T08:50:06Z

The whole point of thisind is to synchronize to a valid index. Similarly for nextind and prevind although they are also useful for valid indices, unlike thisind which has no other use.

Right.

Sure, but the opposite does make sense: knowing that an index is in bounds but not knowing if it's valid. That's why these aren't perfectly coupled.

I'm not sure. In what situations would you be sure the index is in bounds but not whether it's valid?

StefanKarpinski · 2017-12-21T09:28:01Z

Split a string in half by types at the closest character: i = thisind(s, ncodeunits(s) >> 1).

bkamins · 2017-12-21T09:33:13Z

After thinking about it I would support @nalimilan, i.e.:

make next throw an error on every invalid index (except 0 and n+1)
@inbounds should disable all checks

if we were sure that in performance sensitive code next is actually annotated with @inbounds in Base.

Checking the sources next is often not annotated with @inbounds. Therefore it is probably up to @StefanKarpinski to decide on a global design:

we leave code base as is and leave current implementation of next;
we make next more strict (per @nalimilan proposal) and review functions using next in Base to properly annotate them with @inbounds where needed.

StefanKarpinski · 2017-12-21T15:04:39Z

I think this is not the right design. Checking for out-of-bounds errors versus index error have very different threat models: if you accidentally access out of bounds memory, your program may crash; if you look for a character at an in-bounds invalid index, you get a garbled character value. We need to be very vigilant against the former, we do not need to be nearly so vigilant about the latter.

We don't generally automatically add @inbounds on iteration since it's unsafe for custom iterable types (unsafe in the "crash" sense, not in the "code may not do what you want" sense). With the proposed change, when someone iterates the characters of a string, they will not get automatic @inbounds annotations and every character access will have to check both that it is in bounds and that the index is the start of a character. Compilers are very good at eliding bounds checks, so the compiler probably can eliminate the first check automatically. On the other hand, there's no way any compiler is ever going to be able to figure out that next doesn't need to check index validity, so we're going to be stuck with the overhead of guarding against that on every character access when iterating strings.

I also think the proposal doesn't actually protect against any realistic user error. Here are the possible modes of failure in the current design:

The iteration protocol is incorrectly implemented. The user will get garbled characters but no memory errors since we still do bounds checks. They investigate and find that the string type is broken and report an error which gets fixed.
The user doesn't understand string indexing and tries to do s[i += 1] or something similar. This will throw a clear error as soon as they encounter a string with invalid indices.
The user understands string indexing and iteration well enough to use next but also uses bare index arithmetic and passes an incorrectly derived index back to the next function. They get garbled character values as a result.

The first scenario we can't really do anything about – bugs in string type implementations will lead to garbled characters no matter what we do. The second scenario seems likely and we defend against it. The third scenario just seems pretty unlikely. If the user is advanced enough to use next on strings, they're advanced enough to know how to manipulate string indices correctly. Note that by far the most common and likely usage of next on strings is to keep doing c, i = next(s, i) in which case as long as next is correctly implemented, it will also never get an invalid index. They might mess up a done check, but if so they'll get a clear bounds error.

bkamins · 2017-12-21T16:19:08Z

Thank you for a detailed explanation (actually it is very good to have it put down in writing 👍).
This actually defines @inbounds as checking memory bounds only. I am convinced.

fix nextind bug for invalid UTF-8 reported by bkamins in #24420

582c664

StefanKarpinski added strings "Strings!" bugfix This change fixes an existing bug labels Dec 20, 2017

StefanKarpinski changed the title ~~fix nextind bug for invalid UTF-8 reported by bkamins in #24420~~ fix nextind bug for invalid UTF-8 Dec 20, 2017

StefanKarpinski merged commit 1a6462e into master Dec 20, 2017

StefanKarpinski deleted the sk/fixnextind branch December 20, 2017 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix nextind bug for invalid UTF-8 #25214

fix nextind bug for invalid UTF-8 #25214

StefanKarpinski commented Dec 20, 2017 •

edited

Loading

bkamins commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017 •

edited

Loading

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

bkamins commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017 •

edited

Loading

bkamins commented Dec 21, 2017

fix nextind bug for invalid UTF-8 #25214

fix nextind bug for invalid UTF-8 #25214

Conversation

StefanKarpinski commented Dec 20, 2017 • edited Loading

bkamins commented Dec 20, 2017

StefanKarpinski commented Dec 20, 2017 • edited Loading

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

nalimilan commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017

bkamins commented Dec 21, 2017

StefanKarpinski commented Dec 21, 2017 • edited Loading

bkamins commented Dec 21, 2017

StefanKarpinski commented Dec 20, 2017 •

edited

Loading

StefanKarpinski commented Dec 20, 2017 •

edited

Loading

StefanKarpinski commented Dec 21, 2017 •

edited

Loading