Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nextind bug for invalid UTF-8 #25214

Merged
merged 1 commit into from
Dec 20, 2017
Merged

fix nextind bug for invalid UTF-8 #25214

merged 1 commit into from
Dec 20, 2017

Conversation

StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 20, 2017

Reported by @bkamins in #24420. More comprehensive tests to come next week.

@StefanKarpinski StefanKarpinski added strings "Strings!" bugfix This change fixes an existing bug labels Dec 20, 2017
@StefanKarpinski StefanKarpinski changed the title fix nextind bug for invalid UTF-8 reported by bkamins in #24420 fix nextind bug for invalid UTF-8 Dec 20, 2017
@bkamins
Copy link
Member

bkamins commented Dec 20, 2017

I use this code to test it:

srand(1)
errors = []
for i in 1:10^6, j in 1:5
    s = String(rand(UInt8, 5))
    if next(s, j)[2] != nextind(s, j)
        push!(errors, (s, j, isvalid(s, j), next(s,j)[2], nextind(s, j)))
    end
end

and length(errors) is 71119, but the problem is only when next is used with invalid index any(getindex.(errors, 3)) is false - I guess this is intended for performance reasons. Right?

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 20, 2017

I decided that it made sense for next to be allowed to assume that its input index was valid since it's generally used via the iteration protocol or in relatively advanced generic code. It still checks for bounds errors (unless you annotate with @inbounds), so you can't accidentally access invalid memory. For String this helps since next doesn't need to check if it's in the middle of a character.

@StefanKarpinski StefanKarpinski merged commit 1a6462e into master Dec 20, 2017
@StefanKarpinski StefanKarpinski deleted the sk/fixnextind branch December 20, 2017 23:35
@nalimilan
Copy link
Member

I decided that it made sense for next to be allowed to assume that its input index was valid since it's generally used via the iteration protocol or in relatively advanced generic code. It still checks for bounds errors (unless you annotate with @inbounds), so you can't accidentally access invalid memory. For String this helps since next doesn't need to check if it's in the middle of a character.

Is this really useful, given that you can use @inbounds when you know the index is valid? It's kind of weird to check that the index is in bounds, but not that it's valid.

@StefanKarpinski
Copy link
Member Author

@inbounds does not tell you that the index is valid, only that it's in bounds. It's unclear to me that it's a good idea to conflate those two. Functions like thisind, nextind and prevind are designed to allow invalid index inputs, but @inbounds still applies to them. Do you want that same @inbounds annotation to tell those functions that they should assume that the input index is valid? If not, then we'd be in a situation where @inbounds tells some of these string functions that they're inbounds but not others – in a seemingly haphazard fashion.

@nalimilan
Copy link
Member

Yeah, @inbounds was designed for arrays where indices are always valid when they are in bounds, so it's not completely clear what it should mean for strings where the situation is more complex. But I can't find a situation where you'd want to assert that an index is valid, but that you don't know whether it's inside bounds. That makes no sense to me. So I'd be inclined to throw an error with invalid indices with all functions, and have @inbounds disable all checks. Also, it would be less breaking if we decided to stop throwing errors after 1.0 than the contrary.

@StefanKarpinski
Copy link
Member Author

So I'd be inclined to throw an error with invalid indices with all functions

The whole point of thisind is to synchronize to a valid index. Similarly for nextind and prevind although they are also useful for valid indices, unlike thisind which has no other use.

But I can't find a situation where you'd want to assert that an index is valid, but that you don't know whether it's inside bounds.

Sure, but the opposite does make sense: knowing that an index is in bounds but not knowing if it's valid. That's why these aren't perfectly coupled.

@nalimilan
Copy link
Member

The whole point of thisind is to synchronize to a valid index. Similarly for nextind and prevind although they are also useful for valid indices, unlike thisind which has no other use.

Right.

Sure, but the opposite does make sense: knowing that an index is in bounds but not knowing if it's valid. That's why these aren't perfectly coupled.

I'm not sure. In what situations would you be sure the index is in bounds but not whether it's valid?

@StefanKarpinski
Copy link
Member Author

Split a string in half by types at the closest character: i = thisind(s, ncodeunits(s) >> 1).

@bkamins
Copy link
Member

bkamins commented Dec 21, 2017

After thinking about it I would support @nalimilan, i.e.:

  • make next throw an error on every invalid index (except 0 and n+1)
  • @inbounds should disable all checks

if we were sure that in performance sensitive code next is actually annotated with @inbounds in Base.

Checking the sources next is often not annotated with @inbounds. Therefore it is probably up to @StefanKarpinski to decide on a global design:

  • we leave code base as is and leave current implementation of next;
  • we make next more strict (per @nalimilan proposal) and review functions using next in Base to properly annotate them with @inbounds where needed.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 21, 2017

I think this is not the right design. Checking for out-of-bounds errors versus index error have very different threat models: if you accidentally access out of bounds memory, your program may crash; if you look for a character at an in-bounds invalid index, you get a garbled character value. We need to be very vigilant against the former, we do not need to be nearly so vigilant about the latter.

We don't generally automatically add @inbounds on iteration since it's unsafe for custom iterable types (unsafe in the "crash" sense, not in the "code may not do what you want" sense). With the proposed change, when someone iterates the characters of a string, they will not get automatic @inbounds annotations and every character access will have to check both that it is in bounds and that the index is the start of a character. Compilers are very good at eliding bounds checks, so the compiler probably can eliminate the first check automatically. On the other hand, there's no way any compiler is ever going to be able to figure out that next doesn't need to check index validity, so we're going to be stuck with the overhead of guarding against that on every character access when iterating strings.

I also think the proposal doesn't actually protect against any realistic user error. Here are the possible modes of failure in the current design:

  1. The iteration protocol is incorrectly implemented. The user will get garbled characters but no memory errors since we still do bounds checks. They investigate and find that the string type is broken and report an error which gets fixed.

  2. The user doesn't understand string indexing and tries to do s[i += 1] or something similar. This will throw a clear error as soon as they encounter a string with invalid indices.

  3. The user understands string indexing and iteration well enough to use next but also uses bare index arithmetic and passes an incorrectly derived index back to the next function. They get garbled character values as a result.

The first scenario we can't really do anything about – bugs in string type implementations will lead to garbled characters no matter what we do. The second scenario seems likely and we defend against it. The third scenario just seems pretty unlikely. If the user is advanced enough to use next on strings, they're advanced enough to know how to manipulate string indices correctly. Note that by far the most common and likely usage of next on strings is to keep doing c, i = next(s, i) in which case as long as next is correctly implemented, it will also never get an invalid index. They might mess up a done check, but if so they'll get a clear bounds error.

@bkamins
Copy link
Member

bkamins commented Dec 21, 2017

Thank you for a detailed explanation (actually it is very good to have it put down in writing 👍).
This actually defines @inbounds as checking memory bounds only. I am convinced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants