make sure string function index bounds are correctly documented #25478

StefanKarpinski · 2018-01-09T21:44:54Z

https://discourse.julialang.org/t/documentation-for-thisind-doesnt-match-behavior/8239

bkamins · 2018-01-11T17:21:26Z

The same with nextind and prevind.

In particular the part of docs of nextind:

If i is out of bounds in s return i + 1.

and related part of docs of prevind is not working as documented, e.g.:

julia> nextind("x", 2)
ERROR: BoundsError: attempt to access "x"
  at index [2]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

julia> nextind("x", -1)
ERROR: BoundsError: attempt to access "x"
  at index [-1]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

Also see #24255 (I did not update that PR because I do not know what is the expected behavior).

StefanKarpinski · 2018-01-11T17:38:30Z

The docs are what I initially thought the bounds should be, what they actually ended up being is:

thisind(s, i): 0 ≤ i ≤ ncodeunits(s)+1
nextind(s, i): 0 ≤ i < ncodeunits(s)+1
prevind(s, i): 0 < i ≤ ncodeunits(s)+1

Reasoning: it's useful and common to want to pass a just-out-of-bounds index and it makes sense to do so as long as the operation isn't going to go even more out of bounds. So the full in-bounds-or-just-out-of-bounds range of indices is ok for thisind since it never goes further out of bounds than it already is. For nextind you can only pass in the lower just-out-of-bounds value and move up. Similarly, for prevind you can only pass in the upper just-out-of-bounds value and move down. If you limit these bounds any further a lot of generic code is forced to add annoying and pointless boundary condition handling cases whereas with these bounds code works "just works" generically and correctly. In other words, this is a strict as we can empirically make these bounds.

bkamins · 2018-01-11T17:53:33Z

Thanks. The only missing piece of specification is how:

nextind(s,i,n) should work when i+n>ncodeunits(s)+1
prevind(s,i,n) should work when i+n<0

This is actually what is stopping me with #24255. Or maybe it is unspecified (i.e. it should return any value that is not within bounds of the string - then the current implementation is OK - only performance should be improved)

StefanKarpinski · 2018-01-12T01:52:54Z

Does it matter? You can't do any computations with those out-of-bounds indices anyway. The only thing you can really do with such an index is check if it is out of bounds or not and compare it to other indices. So I think the only constraints need to be that the functions are consistent when called with the same string object and that they should be strictly monotonic in n.

bkamins · 2018-01-12T09:16:04Z

The issue is about notion of:

the same string object

If we define it as identical under === then current implementation is OK I think.

If we define it as two strings identical under == and of the same type then we have the following problem:

julia> nextind(chop("∀∀"), 1, 2)
7

julia> nextind(chop("∀a"), 1, 2)
5

strings chop("∀a") and chop("∀∀") have the same type and are identical under ==. They are not egal (===).

It is OK for me to leave it as is, but I want to have a clear documenation. The reason is that people develop custom AbstractStrings and we need to be precise. Otherwise we will have discussions what is a bug and what is an implementation detail (this could also affect what we see as breaking change in the future).

bkamins · 2018-01-12T11:56:03Z

And PR #25525 shows that it might matter (I have spotted that problem because of inconsistent behavior of nextind 😄).

StefanKarpinski · 2018-01-12T17:25:58Z

chop("∀a") === chop("∀∀") is false though.

StefanKarpinski · 2018-01-12T17:31:19Z

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit. The SubString type would need to be updated not to just pass through to its underlying string object.

bkamins · 2018-01-12T18:08:40Z

All you write is exactly what I want to say.

I think (I will check it to be 100% sure) that we can simply remove methods nextind/prevind(s::SubString, i, n) from substring.jl and make them fall back to generic methods without any penalty right now and it should be consistent with:

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit.

Currently it should not have a performance penalty I think because nextind/prevind(s, i, n) are generic anyway.

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance (e.g to call next in nextind once we know we are at a valid index which should be faster than calling isvalid on each index) and then also additional SubString{String} methods could be implemented in a similar fashion also.

So In short - I believe there is a simple way to make nextind/prevind for SubString behave now as you have proposed and in the future it is also possible to keep this contract.

I could make an appropriate PR showing this approach - should I (I am asking because you are assigned to this issue).

StefanKarpinski · 2018-01-12T19:10:26Z

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance

We can add SubString method specializations at that point.

I could make an appropriate PR showing this approach - should I?

Please do! Assignment just means I'm responsible to make sure it gets done, not that I have to actually do. Thank you!

bkamins · 2018-01-15T14:48:02Z

Apart from PRs I have made going back to the code of this issue I want to make sure this is intended:

julia> nextind("1", 0, 0)
0

julia> prevind("1", 2, 0)
2

and also reverseind requires update of the documentation as:

julia> reverseind("1", 0)
2

julia> reverseind("1", 2)
0

which is not covered by the contract given in the docs (I do not know if it was intended or not).

Those three cases make me realize that actually those the only three places in the whole Base where thisind can be called with 0 or ncodeunits(n)+1 value.

I fully agree with:

nextind(s, i): 0 ≤ i < ncodeunits(s)+1
prevind(s, i): 0 < i ≤ ncodeunits(s)+1

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

In Base those are only those three places I have indicated, and you may decide that:

this is a desired behavior (then we leave it as is);
actually we do not want this behavior (then restricting thisind(s, i) to 0 < i ≤ ncodeunits(s) is a natural choice);
we might want this behavior but decide to restrict thisind(s, i) to 0 < i ≤ ncodeunits(s) anyway and in those three places add some custom handling.

Fortunately thisind is a new function (introduced in Sep 2017 if I remember correctly) and its current range of accepted second argument is even fresher so I think we have some flexibility here.

StefanKarpinski · 2018-01-16T01:26:24Z

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

You may want to try changing the bounds to 1 ≤ i ≤ ncodeunits(s) and see what breaks. It's quite a bit. In particular, I recall that it made the generic definition of reverseind break. One bit of insight into the problem is that for zero-length strings, that index range is empty so you can't actually call thisind at all on a zero length string, which seems fishy. Another issue is that one can (as reverseind does), do a computation that results in a just-out-of-bounds index, then do a transformation of that which should be within a character, and then want to find the start of that character. If you allow 0 and ncodeunits(s)+1 then this all just works. If you don't then you end up having a lot of special cases to deal with specially.

bkamins · 2018-01-16T09:59:09Z

Thanks. Actually this led me to finding some more bugs:

correct nextind/prevind/thisind for SubString{String} #25531 (comment)) - I will fix it there;
Consistency of search methods with start position #12906 (comment) - probably in the discussion there we will have a decision what to do.

I did the check what you ask and here are my conclusions:

only definition of endof would have to be changed (but the additional check we would have to do here for ncodeunits(s)==0 would be removed from thisind implementation to this cancels out);
reverseind should work without changes.

everything else should not be impacted (fortunately thisind is new and not much depends on it).

So in short the narrow thisind would mean:

benefits:
- we are sure that thisind returns a valid index if it does not throw an error;
- we can always change the definition to wide approach (with 0 and ncodeunits(s)+1) in the future which will not be breaking;
- shorter body of thisind which might simplify inlining it;
negatives:
- thisind for zero-length string will always throw an error as zero-length string has no valid index (now it accepts 0 and 1)

I am indifferent, i.e. I can live with both definitions given they are properly documented as both approaches have their benefits and costs.

bkamins · 2018-01-16T10:32:24Z

@nalimilan - you were working with those find* functions I guess. Do you have any preference here?

nalimilan · 2018-01-16T13:39:05Z

In the initial PR I wasn't a fan of nextind accepting/returning out-of-bounds indices, but I've kind of lost track of all these discussions so I'm not sure anymore...

bkamins · 2018-01-16T14:09:55Z

One additional benefit of wide thisind is that it is guaranteed not to throw an error on nextind(s,i) and prevind(s,i) result. Here the only problem is that this property does not hold for nextind(s,i,n) for n>1 as discussed above.

In short: @StefanKarpinski - I think we have all cards about thisind on the table. The difference is minimal. I go ahead with PRs using current definition of thisind (accepting 0 and ncodeunits(s)+1), but it is easy to change.

bkamins · 2018-01-16T14:11:24Z

@StefanKarpinski: please just indicate your final preference as there are bugs whose fixing depends on the decision 😄.

StefanKarpinski · 2018-01-16T14:43:36Z

I need some time to review all of these issues and changes. I'll be able to look at it some time today.

samoconnor · 2018-01-26T06:50:43Z

While we're here, why abbreviate index to ind ?
thisindex?
nextindex?
...

bkamins · 2018-02-04T10:52:00Z

We are done with my PRs cleaning up the errors in the implementations of the related functions I could find. They now all work assuming thisind accepts 0 and ncodeunits(s)+1.

So we can leave it as is and start writing precise documentations the functions or reconsider range of accepted values in thisind.

@StefanKarpinski I know this is a busy period in Julia development but this change will be seriously breaking for strings ecosystem if we want to change it in the future. I can live with both definitions. You have worked on this more and longer so I leave it for you to decide 😄.

bkamins · 2018-02-10T20:30:41Z

I do not know if it is the right place too keep track of this, but somewhere in the documentation we should note that length is not additive for Strings, e.g.:

julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> length(a), length(b)
(1, 1)

julia> a*b, length(a*b)
("∀", 1)

as this is a natural property people could assume in their code.

StefanKarpinski · 2018-02-11T19:37:10Z

It is additive for valid UTF-8 strings, however, which could be noted at the same time.

StefanKarpinski added docs This change adds or pertains to documentation strings "Strings!" labels Jan 9, 2018

StefanKarpinski added this to the 1.0 milestone Jan 9, 2018

StefanKarpinski self-assigned this Jan 9, 2018

bkamins mentioned this issue Jan 12, 2018

correct nextind/prevind/thisind for SubString{String} #25531

Merged

bkamins mentioned this issue Jan 16, 2018

Consistency of search methods with start position #12906

Open

JeffBezanson modified the milestones: 1.0, 1.0.x Jan 23, 2018

bkamins mentioned this issue Mar 20, 2018

String nextind is inconsistent about what to do with out-of-bounds indices #26511

Closed

bkamins mentioned this issue May 1, 2018

Document invalid UTF-8 indexing and concatenation #26952

Merged

DilumAluthge removed this from the 1.x.y milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make sure string function index bounds are correctly documented #25478

make sure string function index bounds are correctly documented #25478

StefanKarpinski commented Jan 9, 2018

bkamins commented Jan 11, 2018

StefanKarpinski commented Jan 11, 2018 •

edited

Loading

bkamins commented Jan 11, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 12, 2018

bkamins commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 15, 2018

StefanKarpinski commented Jan 16, 2018

bkamins commented Jan 16, 2018

bkamins commented Jan 16, 2018

nalimilan commented Jan 16, 2018

bkamins commented Jan 16, 2018

bkamins commented Jan 16, 2018

StefanKarpinski commented Jan 16, 2018

samoconnor commented Jan 26, 2018

bkamins commented Feb 4, 2018

bkamins commented Feb 10, 2018

StefanKarpinski commented Feb 11, 2018 •

edited

Loading

make sure string function index bounds are correctly documented #25478

make sure string function index bounds are correctly documented #25478

Comments

StefanKarpinski commented Jan 9, 2018

bkamins commented Jan 11, 2018

StefanKarpinski commented Jan 11, 2018 • edited Loading

bkamins commented Jan 11, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 12, 2018

bkamins commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 12, 2018

StefanKarpinski commented Jan 12, 2018

bkamins commented Jan 15, 2018

StefanKarpinski commented Jan 16, 2018

bkamins commented Jan 16, 2018

bkamins commented Jan 16, 2018

nalimilan commented Jan 16, 2018

bkamins commented Jan 16, 2018

bkamins commented Jan 16, 2018

StefanKarpinski commented Jan 16, 2018

samoconnor commented Jan 26, 2018

bkamins commented Feb 4, 2018

bkamins commented Feb 10, 2018

StefanKarpinski commented Feb 11, 2018 • edited Loading

StefanKarpinski commented Jan 11, 2018 •

edited

Loading

StefanKarpinski commented Feb 11, 2018 •

edited

Loading