Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make sure string function index bounds are correctly documented #25478

Open
StefanKarpinski opened this issue Jan 9, 2018 · 22 comments
Open
Assignees
Labels
docs This change adds or pertains to documentation strings "Strings!"

Comments

@StefanKarpinski
Copy link
Member

https://discourse.julialang.org/t/documentation-for-thisind-doesnt-match-behavior/8239

@StefanKarpinski StefanKarpinski added docs This change adds or pertains to documentation strings "Strings!" labels Jan 9, 2018
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Jan 9, 2018
@StefanKarpinski StefanKarpinski self-assigned this Jan 9, 2018
@bkamins
Copy link
Member

bkamins commented Jan 11, 2018

The same with nextind and prevind.

In particular the part of docs of nextind:

If i is out of bounds in s return i + 1.

and related part of docs of prevind is not working as documented, e.g.:

julia> nextind("x", 2)
ERROR: BoundsError: attempt to access "x"
  at index [2]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

julia> nextind("x", -1)
ERROR: BoundsError: attempt to access "x"
  at index [-1]
Stacktrace:
 [1] nextind(::String, ::Int64) at .\strings\string.jl:121
 [2] top-level scope

Also see #24255 (I did not update that PR because I do not know what is the expected behavior).

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Jan 11, 2018

The docs are what I initially thought the bounds should be, what they actually ended up being is:

  • thisind(s, i): 0 ≤ i ≤ ncodeunits(s)+1
  • nextind(s, i): 0 ≤ i < ncodeunits(s)+1
  • prevind(s, i): 0 < i ≤ ncodeunits(s)+1

Reasoning: it's useful and common to want to pass a just-out-of-bounds index and it makes sense to do so as long as the operation isn't going to go even more out of bounds. So the full in-bounds-or-just-out-of-bounds range of indices is ok for thisind since it never goes further out of bounds than it already is. For nextind you can only pass in the lower just-out-of-bounds value and move up. Similarly, for prevind you can only pass in the upper just-out-of-bounds value and move down. If you limit these bounds any further a lot of generic code is forced to add annoying and pointless boundary condition handling cases whereas with these bounds code works "just works" generically and correctly. In other words, this is a strict as we can empirically make these bounds.

@bkamins
Copy link
Member

bkamins commented Jan 11, 2018

Thanks. The only missing piece of specification is how:

  • nextind(s,i,n) should work when i+n>ncodeunits(s)+1
  • prevind(s,i,n) should work when i+n<0

This is actually what is stopping me with #24255. Or maybe it is unspecified (i.e. it should return any value that is not within bounds of the string - then the current implementation is OK - only performance should be improved)

@StefanKarpinski
Copy link
Member Author

Does it matter? You can't do any computations with those out-of-bounds indices anyway. The only thing you can really do with such an index is check if it is out of bounds or not and compare it to other indices. So I think the only constraints need to be that the functions are consistent when called with the same string object and that they should be strictly monotonic in n.

@bkamins
Copy link
Member

bkamins commented Jan 12, 2018

The issue is about notion of:

the same string object

If we define it as identical under === then current implementation is OK I think.

If we define it as two strings identical under == and of the same type then we have the following problem:

julia> nextind(chop("∀∀"), 1, 2)
7

julia> nextind(chop("∀a"), 1, 2)
5

strings chop("∀a") and chop("∀∀") have the same type and are identical under ==. They are not egal (===).

It is OK for me to leave it as is, but I want to have a clear documenation. The reason is that people develop custom AbstractStrings and we need to be precise. Otherwise we will have discussions what is a bug and what is an implementation detail (this could also affect what we see as breaking change in the future).

@bkamins
Copy link
Member

bkamins commented Jan 12, 2018

And PR #25525 shows that it might matter (I have spotted that problem because of inconsistent behavior of nextind 😄).

@StefanKarpinski
Copy link
Member Author

chop("∀a") === chop("∀∀") is false though.

@StefanKarpinski
Copy link
Member Author

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit. The SubString type would need to be updated not to just pass through to its underlying string object.

@bkamins
Copy link
Member

bkamins commented Jan 12, 2018

All you write is exactly what I want to say.

I think (I will check it to be 100% sure) that we can simply remove methods nextind/prevind(s::SubString, i, n) from substring.jl and make them fall back to generic methods without any penalty right now and it should be consistent with:

I would be fine with defining imaginary out-of-bounds characters to all be exactly 1 code unit.

Currently it should not have a performance penalty I think because nextind/prevind(s, i, n) are generic anyway.

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance (e.g to call next in nextind once we know we are at a valid index which should be faster than calling isvalid on each index) and then also additional SubString{String} methods could be implemented in a similar fashion also.

So In short - I believe there is a simple way to make nextind/prevind for SubString behave now as you have proposed and in the future it is also possible to keep this contract.

I could make an appropriate PR showing this approach - should I (I am asking because you are assigned to this issue).

@StefanKarpinski
Copy link
Member Author

The only issue is that in the future probably those generic methods should be specialized for String type to improve performance

We can add SubString method specializations at that point.

I could make an appropriate PR showing this approach - should I?

Please do! Assignment just means I'm responsible to make sure it gets done, not that I have to actually do. Thank you!

@bkamins
Copy link
Member

bkamins commented Jan 15, 2018

Apart from PRs I have made going back to the code of this issue I want to make sure this is intended:

julia> nextind("1", 0, 0)
0

julia> prevind("1", 2, 0)
2

and also reverseind requires update of the documentation as:

julia> reverseind("1", 0)
2

julia> reverseind("1", 2)
0

which is not covered by the contract given in the docs (I do not know if it was intended or not).

Those three cases make me realize that actually those the only three places in the whole Base where thisind can be called with 0 or ncodeunits(n)+1 value.

I fully agree with:

nextind(s, i): 0 ≤ i < ncodeunits(s)+1
prevind(s, i): 0 < i ≤ ncodeunits(s)+1

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

In Base those are only those three places I have indicated, and you may decide that:

  • this is a desired behavior (then we leave it as is);
  • actually we do not want this behavior (then restricting thisind(s, i) to 0 < i ≤ ncodeunits(s) is a natural choice);
  • we might want this behavior but decide to restrict thisind(s, i) to 0 < i ≤ ncodeunits(s) anyway and in those three places add some custom handling.

Fortunately thisind is a new function (introduced in Sep 2017 if I remember correctly) and its current range of accepted second argument is even fresher so I think we have some flexibility here.

@StefanKarpinski
Copy link
Member Author

but then the question is in what situations it is good that thisind would accept 0 or ncodeunits(s)+1 as a second argument?

You may want to try changing the bounds to 1 ≤ i ≤ ncodeunits(s) and see what breaks. It's quite a bit. In particular, I recall that it made the generic definition of reverseind break. One bit of insight into the problem is that for zero-length strings, that index range is empty so you can't actually call thisind at all on a zero length string, which seems fishy. Another issue is that one can (as reverseind does), do a computation that results in a just-out-of-bounds index, then do a transformation of that which should be within a character, and then want to find the start of that character. If you allow 0 and ncodeunits(s)+1 then this all just works. If you don't then you end up having a lot of special cases to deal with specially.

@bkamins
Copy link
Member

bkamins commented Jan 16, 2018

Thanks. Actually this led me to finding some more bugs:

  1. correct nextind/prevind/thisind for SubString{String} #25531 (comment)) - I will fix it there;
  2. Consistency of search methods with start position #12906 (comment) - probably in the discussion there we will have a decision what to do.

I did the check what you ask and here are my conclusions:

  1. only definition of endof would have to be changed (but the additional check we would have to do here for ncodeunits(s)==0 would be removed from thisind implementation to this cancels out);
  2. reverseind should work without changes.

everything else should not be impacted (fortunately thisind is new and not much depends on it).

So in short the narrow thisind would mean:

  1. benefits:
    • we are sure that thisind returns a valid index if it does not throw an error;
    • we can always change the definition to wide approach (with 0 and ncodeunits(s)+1) in the future which will not be breaking;
    • shorter body of thisind which might simplify inlining it;
  2. negatives:
    • thisind for zero-length string will always throw an error as zero-length string has no valid index (now it accepts 0 and 1)

I am indifferent, i.e. I can live with both definitions given they are properly documented as both approaches have their benefits and costs.

@bkamins
Copy link
Member

bkamins commented Jan 16, 2018

@nalimilan - you were working with those find* functions I guess. Do you have any preference here?

@nalimilan
Copy link
Member

In the initial PR I wasn't a fan of nextind accepting/returning out-of-bounds indices, but I've kind of lost track of all these discussions so I'm not sure anymore...

@bkamins
Copy link
Member

bkamins commented Jan 16, 2018

One additional benefit of wide thisind is that it is guaranteed not to throw an error on nextind(s,i) and prevind(s,i) result. Here the only problem is that this property does not hold for nextind(s,i,n) for n>1 as discussed above.

In short: @StefanKarpinski - I think we have all cards about thisind on the table. The difference is minimal. I go ahead with PRs using current definition of thisind (accepting 0 and ncodeunits(s)+1), but it is easy to change.

@bkamins
Copy link
Member

bkamins commented Jan 16, 2018

@StefanKarpinski: please just indicate your final preference as there are bugs whose fixing depends on the decision 😄.

@StefanKarpinski
Copy link
Member Author

I need some time to review all of these issues and changes. I'll be able to look at it some time today.

@JeffBezanson JeffBezanson modified the milestones: 1.0, 1.0.x Jan 23, 2018
@samoconnor
Copy link
Contributor

While we're here, why abbreviate index to ind ?
thisindex?
nextindex?
...

@bkamins
Copy link
Member

bkamins commented Feb 4, 2018

We are done with my PRs cleaning up the errors in the implementations of the related functions I could find. They now all work assuming thisind accepts 0 and ncodeunits(s)+1.

So we can leave it as is and start writing precise documentations the functions or reconsider range of accepted values in thisind.

@StefanKarpinski I know this is a busy period in Julia development but this change will be seriously breaking for strings ecosystem if we want to change it in the future. I can live with both definitions. You have worked on this more and longer so I leave it for you to decide 😄.

@bkamins
Copy link
Member

bkamins commented Feb 10, 2018

I do not know if it is the right place too keep track of this, but somewhere in the documentation we should note that length is not additive for Strings, e.g.:

julia> a, b = "\xe2\x88", "\x80"
("\xe2\x88", "\x80")

julia> length(a), length(b)
(1, 1)

julia> a*b, length(a*b)
("∀", 1)

as this is a natural property people could assume in their code.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Feb 11, 2018

It is additive for valid UTF-8 strings, however, which could be noted at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation strings "Strings!"
Projects
None yet
Development

No branches or pull requests

6 participants