Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode strings and characters #13

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

robin-aws
Copy link
Member

@robin-aws robin-aws commented Jul 14, 2022

Resolves #12. Resolves dafny-lang/dafny#413.

Tagging those I suspect will be interested: @dafny-lang/dafny-core @mschlaipfer @txiang61 @seanmcl @indolering @alex-chew @seebees

Revision 2: Besides general clarification, I decided to drop support for the old \uXXXX UTF-16 code unit escape sequence with /unicodeChar:1, as keeping the support was creating confusion even though it was technically possible.

Copy link
Collaborator

@RustanLeino RustanLeino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Let's do it!

Also, the write-up is especially nice.

I suggest an explicit mention, somewhere, that the design omits any "byte access" into string. This is on purpose, because we want to encourage Dafny users to work at the higher level of seq<char>, not dealing with the underlying representation. In places where the bytes are of importance, it is possible to write a string-to-byte-sequence function in Dafny.


```dafny
method Main() {
var s := "Unicode is just so \ud83d\ude0e";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line ought to give an error with /unicodeChar:1, since \ud83d is not a legal character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I'm on board with that after all. :)

string and character literals could only contain printable and white-space ASCII characters,
due to a limitation of the Coco/R parser generator the toolchain uses.
This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters.
A second form of escape sequence accepting a hexadecimal number with up to six digits, `\u{XXXXXX}`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "up to" just for char literals? If it were to apply also to string, then does "\ua0" mean the 1-character string containing the ASCII character 160, or does it mean the 2-character string containing the ASCII characters 16 and 48?

Oh, are the curly braces required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"\ua0" is currently not legal, and would remain not legal, because uXXXX escapes require exactly 4 characters, to avoid this exact ambiguity.

"\u{a0}" would mean the 1-char string containing the ASCII character 160.

(And as I type this I am leaning closer to making it "\U{a0}" instead :) )


```dafny
// Several different ways to express the same string literal
var s1 := "Unicode is just so \ud83d\ude0e";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we forbid this under /unicodeChar:1. After all, neither \ud83d nor \ude0e falls into the numeric ranges of the new char.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a reason to reject valid string literals using the existing escape pattern, personally. I can make this clearer in the proposal, but we can keep both forms with these semantics:

  1. \uXXXX - UTF-16 code unit, incorrect use of surrogates rejected by the parser.
  2. \u{X}...\u{XXXXXX} - Unicode scalar value, surrogate range values rejected by the parser.

There is precedent for supporting both, as Go allows both \uXXXX and \UXXXXXXXX. I liked the variable-length syntax of \u{X..X} (also used by Swift) better personally, as exactly requiring exactly eight digits just means the first two will always be 00 :)

I'm open to using U for the second form if it helps reduce confusion though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we should avoid any implication that a single escape sequence produces a single "character", since neither form can produce 🇨🇦 that way. We will only know that a single u{X..X} sequence will produce a single char (and that's not ultimately that meaningful anyway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revision 2 now says we only have \U{XXXXXX} with /unicodeChar:1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I realized that in Go \uXXXX still specifies Unicode code points rather than UTF-16 code units, that is, it rejects surrogate code points (and hence dafny-lang/dafny#1980 :)

Comment on lines 94 to 96
The exact representation of strings at runtime, including the particular encoding,
is an implementation detail of a particular backend, and will often be optimized for the idioms and support
of the target environment. Enabling Unicode characters will change the target language types used to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

0012-unicode-strings.md Outdated Show resolved Hide resolved
Comment on lines 119 to 120
(0 <= n && n <= 55295 /* 0xD7FF */) ||
(57344 /* 0xE000 */ <= n && n <= 1114111 /* 0x10FFFF */ )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use half-open intervals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I used closed intervals throughout is because that's what the Unicode standard itself does. It's always irked me, but as a result the constant 0xDFFF tends to be much more well-known than 0xE000.

I will absolutely use half-open intervals here as it will probably verify more efficiently, and I'm happy to change the proposal accordingly if it would help readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going with half-open intervals for consistency.

representing individual characters. `char` values can be converted to and from `int` values (using `as char` and `as int` expressions),
and an `int` value corresponding to a `char` value is currently required to be a valid UTF-16 code unit, i.e. in the range
`[0, 65536)`. This range includes the so-called ["surrogate" code points](https://unicode.org/faq/utf_bom.html#utf16-2),
i.e. values in the range `[0xD800, 0xDFFF]`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continue using half-open intervals: [0xD800, 0xE000).

and are not assignable Unicode code points themselves.

I propose a breaking change in Dafny 4.0, to make `char` represent any Unicode code point, independent of the encoding used.
This means that the corresponding `int` value for a `char` must always be a [Unicode scalar value](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#G7404), meaning any value in the range `[0, 0x10FFFF]` but excluding the surrogate code points from `[0xD800, 0xDFFF]`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use half-open intervals, here and throughout: [0, 0x11_0000).

0012-unicode-strings.md Outdated Show resolved Hide resolved
to work with the new semantics of Unicode characters.
In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string
to a equivalent `Dafny.ISequence<char>` copy.
A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document and officially support the way to get to these conversion routines from Dafny. More precisely, we should give the :extern declarations one needs to write in the Dafny source to get to these routines. For C#, that may be:

type {:extern} NativeString
function {:extern "Dafny.ConvertStringToNative"} StringToNative(s: string): NativeString
function {:extern "Dafny.ConvertNativeToString"} NativeToString(s': NativeString): string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would specifically recommend against having any NativeString type, because it won't behave consistently across runtimes. I'm recommending instead having these adaptors defined in each runtime independently, so they can either trust the underlying native string semantics or check for and reject invalid values as appropriate. I'll expand the detail on those adaptors and make this clearer.

Copy link
Member

@cpitclaudel cpitclaudel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Some high-level comments / worries

I would like to see more details about the following:

  • Interop: since the new string type won't be able to represent all UTF-16 code unit sequences, it would be nice to think ahead of how we will deal with ill-formed input (I'm especially thinking of Javascript strings coming from user inputs on a webpage, or windows file system paths). In particular, Rustan's proposed native string conversion functions are partial. Will we have them return Result values?
  • String literals. Are we going to keep using the current UTF-16 based model? The \u codes in Dafny assume UTF-16 AFAICT.
  • Conversions to/from sequences. I'm not sure it's the best to keep string as a seq<char>. In particular, it prevents us from defining member functions on the type string. We could support the as operator for casting to a seq<char> cheaply.
  • I'm not sure caching decoding to UTF-32 is efficient. We could simply cache a few (string, index) → offset pairs… even possibly as a global map (not attached to every string object).

## Correctness

The current definition of these types means that the Dafny `string` type allows data that is not a valid Unicode string.
The value `"\uD800"`, for example, is not a valid Unicode string and has no valid encoding in UTF-8, UTF-16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth trying to align this document's terminology with the standard Unicode terms. IIRC Unicode string refers to a sequence of code units (ie after encoding), so the first part is valid (it's not a valid unicode string), but the second part is unclear ("has no valid encoding")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, and unfortunately I can't find any standard terminology for the abstract "sequence of scalar values" concept that is ultimately what I want string to represent.

## Compatibility

The current definitions of `string` and `char` are biased towards using a UTF-16 encoding at runtime.
This aligns well with some compilation target languages which also use UTF-16, such as Java, C#, and JavaScript,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But C# and Javascript don't use UTF-16, since they don't enforce well-formedness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine :) Qualified a bit here but will address this more concretely in the Runtime section.


```dafny
method Main() {
var s := "Unicode is just so \ud83d\ude0e";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to this point the document hasn't addressed the representation of string literals, so I'm having trouble parsing the example. What does this string represent with unicodeChar:1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled a fair bit of content specifically on string literals to the early parts of this section, hopefully it helps!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully not supporting \uXXXX with \unicodeChar:1 any longer helps clear this up.

0012-unicode-strings.md Outdated Show resolved Hide resolved
to work with the new semantics of Unicode characters.
In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string
to a equivalent `Dafny.ISequence<char>` copy.
A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1.

## Go:

In Go a string is a read-only slice of bytes, which generally contains UTF-8 encoded bytes
but may contain arbitrary bytes. The `rune` type is an alias for `int32`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the rune-based API behave when given an int outside of the range of valid scalar values?

and two new corresponding `std::u16string` and `std::u32string` collection classes.
It also provides three new kinds of string literals,
`u8"..."`, `u"..."`, and `U"..."`,
which produce binary values encoded with UTF-8, UTF-16, and UTF-32 respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a binary value?

## C++:

The `char` type represents bytes, and the `std::string` class from the standard library
operates on bytes as character, and generally does not produce correct results if used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part:

operates on bytes as character

[unresolved-questions]: #unresolved-questions

Is there anything more we can do to make migration easier and safer for users?
Chance are very good that all Dafny code in existence to date either will not change behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly optimistic to me WRT string literals, but I'm not sure how we plan to handle them. At the moment they are sequences of characters mixed with UTF16 code units (possibly ill-formed), right? It would be good to specify what happens to them in /unicodeChar:1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is moot now but let me know otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be code that we won't be able to port (code using string literals with unpaired surrogates), but that's fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the worst case scenario will be code using string to carry such data, which the conversion utility (or verification with /unicodeChar:1) will reject, and the code will have to be rewritten to use seq<uint16> instead. I'm definitely okay with that, especially as I think it's much more likely such code will be unaware of the issue with unpaired surrogates rather than intentionally using the current definition of string to allow them.

Chance are very good that all Dafny code in existence to date either will not change behavior
across this change, or will slightly improve because of the improved handling of surrogate code points.
I have been unable to think of anything that would provide more value than the verifier will already provide,
but I am open to suggestion as always!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically converting old-style \u strings to new-style \u{} strings would be useful :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, and although I've clarified I don't intend to drop \u (just reject invalid sequences of them), even so that conversion utility would be very easy to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm dropping \u I agree and will definitely provide a conversion utility.


This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2,
while keeping the alias of `string` for `seq<char>`.
This would only increase the confusion and cognitive burden for Dafny users in the future, though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss how easy it will be for external code to wrap a native string type to give it Dafny's string API? It would be nice if one could implement ISequence<char> using C# strings (plus some sanity checks), to save on conversions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rewrite the Compilation/Runtime section to be more clear about this (as I intended that to address exactly this point, but I don't think it's clear enough yet)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I THINK this should have enough detail but let me know otherwise.

@cpitclaudel
Copy link
Member

cpitclaudel commented Jul 25, 2022

As one last high-level comment: The debate on whether string should be seq<char> or its own type would be moot if seq<char> was a trait instead of a type. Then string could be a separate type that implements seq<char>. I wonder how easy that change would be with @RustanLeino's new type system (part of the issue is knowing how to deal with empty sequences).

@robin-aws robin-aws self-assigned this Jul 25, 2022
effort to achieve verification, or would have to be abandoned entirely in favour of the helper methods
provided by the shared library.

## Change the definition of the `string` type
Copy link
Member

@keyboardDrummer keyboardDrummer Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section describes downsides of having a standalone string type, but what are the upsides?

I think one aspect to consider is managing expectations. If you have type string = seq<char>, then what performance expectations will the user have of accessing individual characters of that string, and do those align with what we're providing?

With a custom string type, you can define a fresh performance contract, such as that accessing a specific character may be linear time. With that, you would be free to encode the Dafny type nativestring using native strings. I don't think that's the right thing to do, but it seems like a benefit of this approach that isn't mentioned. Or you could have a type nativestring that doesn't have an API for accessing individual characters, that way users that are processing strings but only in a simple fashion, can keep using native strings instead of the memory heavy, have to be converted when using external code, Dafny strings.

@cpitclaudel
Copy link
Member

Expanding on the comment about Windows path and Javascript strings:

This is relevant to the discussion of adding arguments to main, too: we'll have to consider whether main takes strings or OS strings.

but in previous versions of Dafny,
string and character literals could only contain printable and white-space ASCII characters,
due to a limitation of the Coco/R parser generator the toolchain uses.
This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What has been fixed? The Coco/R parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is intended to describe the desired end state for Dafny users, so it's describing a future state. :)

I'm saying I can see where the issue is (specifically in our copy of Scanner.frame, which is creating a Buffer when it should be creating a UTF8Buffer) so I'm proposing we fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this part has been actually fixed now: dafny-lang/dafny#2717 :)

@robin-aws robin-aws changed the title Unicode strings and characters RFC: Unicode strings and characters Jul 26, 2022
The exact representation of strings at runtime, including the particular encoding,
is an implementation detail of a particular backend, and will often be optimized for the idioms and support
of the target environment.
This also applies to string literals in Dafny source code:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better but still ambiguous. Maybe it's enough to say that strings literals are as in Dafny 3, except for the fact that unpaired surrogates are disallowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only true w.r.t. what literals are permitted in source text though (with the added \U{X..X} escape sequence), not how they are represented at runtime. I do want to say that, unlike in Dafny 3, users shouldn't assume a string literal becomes a sequence where every ASCII character or \u escape sequence literally becomes an individual element of that sequence. I want to call out that if you compile your program to Go, for example, your literal will likely become a Go string value wrapped as a Dafny.Sequence<rune>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'm hoping that dropping support for \u makes this clearer.

0012-unicode-strings.md Outdated Show resolved Hide resolved
0012-unicode-strings.md Outdated Show resolved Hide resolved
0012-unicode-strings.md Outdated Show resolved Hide resolved
0012-unicode-strings.md Outdated Show resolved Hide resolved
0012-unicode-strings.md Outdated Show resolved Hide resolved
0012-unicode-strings.md Outdated Show resolved Hide resolved
Migrating an existing codebase should reduce to a simple find-and-replace operation.

Note that this aspect of the foreign function interface in Dafny is best handled on a per-backend basis.
It isn't generally tractable to define a single `NativeString` type with consistent semantics,
Copy link
Member

@keyboardDrummer keyboardDrummer Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify in more details what the limitations of a NativeString type would be?

I imagine it would be a subset of a collection type with linear time access, like

type NativeString = x: LinkedListTrait<char> | ValidInAllRuntimes(x)

The LinkedList would warn users against doing random access on individual characters, since that may take linear time depending on the runtime.

There would be some native methods on NativeString like Concat(right: NativeString): NativeString (linear in total length) and Take(n: nat): NativeString (linear in n). Something like SubString(start: nat, length: nat) might not exist because that would be linear not just in length but also in start, which is surprising.

robin-aws added a commit to dafny-lang/dafny that referenced this pull request Nov 24, 2022
Implementation of the design from
dafny-lang/rfcs#13.

Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes
#3001.

Depends on #2976 to be fixed for the tests to pass consistently across
platforms.

I've documented some of the less obvious compilation strategy decisions
in `docs/Compilation/StringsAndChars.md`.

<small>By submitting this pull request, I confirm that my contribution
is made under the terms of the [MIT
license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC pending: Unicode string support Revisit definition of Dafny strings (RFC #12)
5 participants