-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Unicode strings and characters #13
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Let's do it!
Also, the write-up is especially nice.
I suggest an explicit mention, somewhere, that the design omits any "byte access" into string
. This is on purpose, because we want to encourage Dafny users to work at the higher level of seq<char>
, not dealing with the underlying representation. In places where the bytes are of importance, it is possible to write a string
-to-byte-sequence function in Dafny.
0012-unicode-strings.md
Outdated
|
||
```dafny | ||
method Main() { | ||
var s := "Unicode is just so \ud83d\ude0e"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line ought to give an error with /unicodeChar:1
, since \ud83d
is not a legal character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out I'm on board with that after all. :)
0012-unicode-strings.md
Outdated
string and character literals could only contain printable and white-space ASCII characters, | ||
due to a limitation of the Coco/R parser generator the toolchain uses. | ||
This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters. | ||
A second form of escape sequence accepting a hexadecimal number with up to six digits, `\u{XXXXXX}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "up to" just for char
literals? If it were to apply also to string
, then does "\ua0"
mean the 1-character string containing the ASCII character 160, or does it mean the 2-character string containing the ASCII characters 16 and 48?
Oh, are the curly braces required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"\ua0"
is currently not legal, and would remain not legal, because uXXXX
escapes require exactly 4 characters, to avoid this exact ambiguity.
"\u{a0}"
would mean the 1-char
string containing the ASCII character 160.
(And as I type this I am leaning closer to making it "\U{a0}"
instead :) )
0012-unicode-strings.md
Outdated
|
||
```dafny | ||
// Several different ways to express the same string literal | ||
var s1 := "Unicode is just so \ud83d\ude0e"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose we forbid this under /unicodeChar:1
. After all, neither \ud83d
nor \ude0e
falls into the numeric ranges of the new char
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a reason to reject valid string literals using the existing escape pattern, personally. I can make this clearer in the proposal, but we can keep both forms with these semantics:
\uXXXX
- UTF-16 code unit, incorrect use of surrogates rejected by the parser.\u{X}
...\u{XXXXXX}
- Unicode scalar value, surrogate range values rejected by the parser.
There is precedent for supporting both, as Go allows both \uXXXX
and \UXXXXXXXX
. I liked the variable-length syntax of \u{X..X}
(also used by Swift) better personally, as exactly requiring exactly eight digits just means the first two will always be 00
:)
I'm open to using U
for the second form if it helps reduce confusion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we should avoid any implication that a single escape sequence produces a single "character", since neither form can produce 🇨🇦 that way. We will only know that a single u{X..X}
sequence will produce a single char
(and that's not ultimately that meaningful anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revision 2 now says we only have \U{XXXXXX}
with /unicodeChar:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I realized that in Go \uXXXX
still specifies Unicode code points rather than UTF-16 code units, that is, it rejects surrogate code points (and hence dafny-lang/dafny#1980 :)
0012-unicode-strings.md
Outdated
The exact representation of strings at runtime, including the particular encoding, | ||
is an implementation detail of a particular backend, and will often be optimized for the idioms and support | ||
of the target environment. Enabling Unicode characters will change the target language types used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
0012-unicode-strings.md
Outdated
(0 <= n && n <= 55295 /* 0xD7FF */) || | ||
(57344 /* 0xE000 */ <= n && n <= 1114111 /* 0x10FFFF */ ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use half-open intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I used closed intervals throughout is because that's what the Unicode standard itself does. It's always irked me, but as a result the constant 0xDFFF
tends to be much more well-known than 0xE000
.
I will absolutely use half-open intervals here as it will probably verify more efficiently, and I'm happy to change the proposal accordingly if it would help readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with half-open intervals for consistency.
0012-unicode-strings.md
Outdated
representing individual characters. `char` values can be converted to and from `int` values (using `as char` and `as int` expressions), | ||
and an `int` value corresponding to a `char` value is currently required to be a valid UTF-16 code unit, i.e. in the range | ||
`[0, 65536)`. This range includes the so-called ["surrogate" code points](https://unicode.org/faq/utf_bom.html#utf16-2), | ||
i.e. values in the range `[0xD800, 0xDFFF]`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continue using half-open intervals: [0xD800, 0xE000)
.
0012-unicode-strings.md
Outdated
and are not assignable Unicode code points themselves. | ||
|
||
I propose a breaking change in Dafny 4.0, to make `char` represent any Unicode code point, independent of the encoding used. | ||
This means that the corresponding `int` value for a `char` must always be a [Unicode scalar value](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#G7404), meaning any value in the range `[0, 0x10FFFF]` but excluding the surrogate code points from `[0xD800, 0xDFFF]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use half-open intervals, here and throughout: [0, 0x11_0000)
.
to work with the new semantics of Unicode characters. | ||
In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string | ||
to a equivalent `Dafny.ISequence<char>` copy. | ||
A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document and officially support the way to get to these conversion routines from Dafny. More precisely, we should give the :extern
declarations one needs to write in the Dafny source to get to these routines. For C#, that may be:
type {:extern} NativeString
function {:extern "Dafny.ConvertStringToNative"} StringToNative(s: string): NativeString
function {:extern "Dafny.ConvertNativeToString"} NativeToString(s': NativeString): string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would specifically recommend against having any NativeString
type, because it won't behave consistently across runtimes. I'm recommending instead having these adaptors defined in each runtime independently, so they can either trust the underlying native string semantics or check for and reject invalid values as appropriate. I'll expand the detail on those adaptors and make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Some high-level comments / worries
I would like to see more details about the following:
- Interop: since the new string type won't be able to represent all UTF-16 code unit sequences, it would be nice to think ahead of how we will deal with ill-formed input (I'm especially thinking of Javascript strings coming from user inputs on a webpage, or windows file system paths). In particular, Rustan's proposed native string conversion functions are partial. Will we have them return
Result
values? - String literals. Are we going to keep using the current UTF-16 based model? The
\u
codes in Dafny assumeUTF-16
AFAICT. - Conversions to/from sequences. I'm not sure it's the best to keep
string
as aseq<char>
. In particular, it prevents us from defining member functions on the type string. We could support theas
operator for casting to aseq<char>
cheaply. - I'm not sure caching decoding to UTF-32 is efficient. We could simply cache a few
(string, index) → offset
pairs… even possibly as a global map (not attached to every string object).
0012-unicode-strings.md
Outdated
## Correctness | ||
|
||
The current definition of these types means that the Dafny `string` type allows data that is not a valid Unicode string. | ||
The value `"\uD800"`, for example, is not a valid Unicode string and has no valid encoding in UTF-8, UTF-16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth trying to align this document's terminology with the standard Unicode terms. IIRC Unicode string
refers to a sequence of code units (ie after encoding), so the first part is valid (it's not a valid unicode string), but the second part is unclear ("has no valid encoding")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, and unfortunately I can't find any standard terminology for the abstract "sequence of scalar values" concept that is ultimately what I want string
to represent.
0012-unicode-strings.md
Outdated
## Compatibility | ||
|
||
The current definitions of `string` and `char` are biased towards using a UTF-16 encoding at runtime. | ||
This aligns well with some compilation target languages which also use UTF-16, such as Java, C#, and JavaScript, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But C# and Javascript don't use UTF-16, since they don't enforce well-formedness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine :) Qualified a bit here but will address this more concretely in the Runtime section.
0012-unicode-strings.md
Outdated
|
||
```dafny | ||
method Main() { | ||
var s := "Unicode is just so \ud83d\ude0e"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to this point the document hasn't addressed the representation of string literals, so I'm having trouble parsing the example. What does this string represent with unicodeChar:1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled a fair bit of content specifically on string literals to the early parts of this section, hopefully it helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully not supporting \uXXXX
with \unicodeChar:1
any longer helps clear this up.
to work with the new semantics of Unicode characters. | ||
In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string | ||
to a equivalent `Dafny.ISequence<char>` copy. | ||
A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1
.
0012-unicode-strings.md
Outdated
## Go: | ||
|
||
In Go a string is a read-only slice of bytes, which generally contains UTF-8 encoded bytes | ||
but may contain arbitrary bytes. The `rune` type is an alias for `int32`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the rune-based
API behave when given an int outside of the range of valid scalar values?
0012-unicode-strings.md
Outdated
and two new corresponding `std::u16string` and `std::u32string` collection classes. | ||
It also provides three new kinds of string literals, | ||
`u8"..."`, `u"..."`, and `U"..."`, | ||
which produce binary values encoded with UTF-8, UTF-16, and UTF-32 respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a binary value
?
0012-unicode-strings.md
Outdated
## C++: | ||
|
||
The `char` type represents bytes, and the `std::string` class from the standard library | ||
operates on bytes as character, and generally does not produce correct results if used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part:
operates on bytes as character
0012-unicode-strings.md
Outdated
[unresolved-questions]: #unresolved-questions | ||
|
||
Is there anything more we can do to make migration easier and safer for users? | ||
Chance are very good that all Dafny code in existence to date either will not change behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly optimistic to me WRT string literals, but I'm not sure how we plan to handle them. At the moment they are sequences of characters mixed with UTF16 code units (possibly ill-formed), right? It would be good to specify what happens to them in /unicodeChar:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is moot now but let me know otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be code that we won't be able to port (code using string literals with unpaired surrogates), but that's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the worst case scenario will be code using string
to carry such data, which the conversion utility (or verification with /unicodeChar:1
) will reject, and the code will have to be rewritten to use seq<uint16>
instead. I'm definitely okay with that, especially as I think it's much more likely such code will be unaware of the issue with unpaired surrogates rather than intentionally using the current definition of string
to allow them.
0012-unicode-strings.md
Outdated
Chance are very good that all Dafny code in existence to date either will not change behavior | ||
across this change, or will slightly improve because of the improved handling of surrogate code points. | ||
I have been unable to think of anything that would provide more value than the verifier will already provide, | ||
but I am open to suggestion as always! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically converting old-style \u
strings to new-style \u{}
strings would be useful :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, and although I've clarified I don't intend to drop \u
(just reject invalid sequences of them), even so that conversion utility would be very easy to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm dropping \u
I agree and will definitely provide a conversion utility.
|
||
This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2, | ||
while keeping the alias of `string` for `seq<char>`. | ||
This would only increase the confusion and cognitive burden for Dafny users in the future, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we discuss how easy it will be for external code to wrap a native string type to give it Dafny's string API? It would be nice if one could implement ISequence<char>
using C# strings (plus some sanity checks), to save on conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to rewrite the Compilation/Runtime
section to be more clear about this (as I intended that to address exactly this point, but I don't think it's clear enough yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I THINK this should have enough detail but let me know otherwise.
|
effort to achieve verification, or would have to be abandoned entirely in favour of the helper methods | ||
provided by the shared library. | ||
|
||
## Change the definition of the `string` type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section describes downsides of having a standalone string
type, but what are the upsides?
I think one aspect to consider is managing expectations. If you have type string = seq<char>
, then what performance expectations will the user have of accessing individual characters of that string, and do those align with what we're providing?
With a custom string
type, you can define a fresh performance contract, such as that accessing a specific character may be linear time. With that, you would be free to encode the Dafny type nativestring
using native strings. I don't think that's the right thing to do, but it seems like a benefit of this approach that isn't mentioned. Or you could have a type nativestring
that doesn't have an API for accessing individual characters, that way users that are processing strings but only in a simple fashion, can keep using native strings instead of the memory heavy, have to be converted when using external code, Dafny strings.
Expanding on the comment about Windows path and Javascript strings:
This is relevant to the discussion of adding arguments to |
0012-unicode-strings.md
Outdated
but in previous versions of Dafny, | ||
string and character literals could only contain printable and white-space ASCII characters, | ||
due to a limitation of the Coco/R parser generator the toolchain uses. | ||
This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What has been fixed? The Coco/R parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is intended to describe the desired end state for Dafny users, so it's describing a future state. :)
I'm saying I can see where the issue is (specifically in our copy of Scanner.frame
, which is creating a Buffer
when it should be creating a UTF8Buffer
) so I'm proposing we fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this part has been actually fixed now: dafny-lang/dafny#2717 :)
Co-authored-by: Rustan Leino <[email protected]> Co-authored-by: Clément Pit-Claudel <[email protected]>
The exact representation of strings at runtime, including the particular encoding, | ||
is an implementation detail of a particular backend, and will often be optimized for the idioms and support | ||
of the target environment. | ||
This also applies to string literals in Dafny source code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better but still ambiguous. Maybe it's enough to say that strings literals are as in Dafny 3, except for the fact that unpaired surrogates are disallowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only true w.r.t. what literals are permitted in source text though (with the added \U{X..X}
escape sequence), not how they are represented at runtime. I do want to say that, unlike in Dafny 3, users shouldn't assume a string literal becomes a sequence where every ASCII character or \u
escape sequence literally becomes an individual element of that sequence. I want to call out that if you compile your program to Go, for example, your literal will likely become a Go string
value wrapped as a Dafny.Sequence<rune>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'm hoping that dropping support for \u
makes this clearer.
Co-authored-by: David Cok <[email protected]>
Migrating an existing codebase should reduce to a simple find-and-replace operation. | ||
|
||
Note that this aspect of the foreign function interface in Dafny is best handled on a per-backend basis. | ||
It isn't generally tractable to define a single `NativeString` type with consistent semantics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you specify in more details what the limitations of a NativeString
type would be?
I imagine it would be a subset of a collection type with linear time access, like
type NativeString = x: LinkedListTrait<char> | ValidInAllRuntimes(x)
The LinkedList would warn users against doing random access on individual characters, since that may take linear time depending on the runtime.
There would be some native methods on NativeString
like Concat(right: NativeString): NativeString
(linear in total length) and Take(n: nat): NativeString
(linear in n). Something like SubString(start: nat, length: nat)
might not exist because that would be linear not just in length
but also in start
, which is surprising.
Implementation of the design from dafny-lang/rfcs#13. Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes #3001. Depends on #2976 to be fixed for the tests to pass consistently across platforms. I've documented some of the less obvious compilation strategy decisions in `docs/Compilation/StringsAndChars.md`. <small>By submitting this pull request, I confirm that my contribution is made under the terms of the [MIT license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Resolves #12. Resolves dafny-lang/dafny#413.
Tagging those I suspect will be interested: @dafny-lang/dafny-core @mschlaipfer @txiang61 @seanmcl @indolering @alex-chew @seebees
Revision 2: Besides general clarification, I decided to drop support for the old
\uXXXX
UTF-16 code unit escape sequence with/unicodeChar:1
, as keeping the support was creating confusion even though it was technically possible.