Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit definition of Dafny strings (RFC #12) #413

Closed
robin-aws opened this issue Nov 5, 2019 · 10 comments · Fixed by #3016 · May be fixed by dafny-lang/rfcs#13
Closed

Revisit definition of Dafny strings (RFC #12) #413

robin-aws opened this issue Nov 5, 2019 · 10 comments · Fixed by #3016 · May be fixed by dafny-lang/rfcs#13
Assignees
Labels
breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) kind: enhancement Enhancements, feature requests, etc. These are NOT bugs, but ways we can improve Dafny part: language definition Relating to the Dafny language definition itself
Milestone

Comments

@robin-aws
Copy link
Member

robin-aws commented Nov 5, 2019

(was: "Explicitly document that Dafny strings are UTF-16")

I found this just a bit too subtle when reading the reference manual. There are a couple references to "Unicode letters" and "Unicode characters" that should really be "Unicode 16-bit code units". The grammar does spell out that a Unicode character is four hex digits, but that's a little bit buried.

Even better, I would love it if Dafny characters were actually sequences of Unicode code points, so that the built-in sequence operations could be used to perform robust per-character string manipulation.

@robin-aws
Copy link
Member Author

Going further than this, even if the canonical representation of strings in Dafny is using UTF16, another issue with just making string an alias for seq<char> means that it allows invalid UTF16 sequences (those that use surrogate code points incorrectly for example).

Given that Unicode is so often misunderstood and often the source of subtle bugs, I think there's a great opportunity to have a verifiably-correct implementation of Unicode in Dafny. Languages like Java and C# will often allow invalid sequences and can be lax with encoding/decoding, but I think it would be much better to be strict in Dafny.

@robin-aws robin-aws changed the title Explicitly document that Dafny strings are UTF-16 Revisit definition of Dafny strings May 14, 2020
@robin-aws
Copy link
Member Author

Making string an alias for seq<char> also has implications for interacting with external code: it is necessary to convert from the native representation in the target language to the runtime implementation of Dafny sequences. This can have performance consequences and also complicates the "foreign function interface": because sequences are value types in Dafny you can't refer to a possibly-null string value, which happens all the time in external code.

@robin-aws
Copy link
Member Author

My current plan here is to keep string as an alias for seq<char>, but to redefine char to mean "Unicode scalar value" ala Rust, and to therefore need 32 bits instead of 16. This means the direct value of a string is the "true" Unicode sequence. Within each target runtime, it should be possible to implement this type by wrapping the native representation of strings (as Java already does with the StringDafnySequence class), therefore allowing more efficient encoding of these values. This will help with both performance and the FFI - yes, doing per-character operations will be slow because the native wrappers will not be random access w.r.t. the actual Unicode scalar values, but it should be possible to optimize all the operations Unicode-compliant code will actually need to do.

@robin-aws robin-aws added the kind: enhancement Enhancements, feature requests, etc. These are NOT bugs, but ways we can improve Dafny label Jun 23, 2020
@acioc acioc added this to the Dafny 3.0 milestone Aug 12, 2020
@indo-dev-0
Copy link

Hi! I'm modelling bits of Unicode in Dafny.

I think you will need to offer a fairly rich string API. Users who are modelling abstract systems to ensure correctness of an algorithm probably don't want to spend their compute budget proving anything about strings. There are also plenty of specifications that limit input to printable ASCII, Base64, alphanumerics, etc.

Users who want to prove complex things about strings will also need to worry about low level details like normal forms, stream safety, Unicode versions, and platform-specific or even broken behaviour. At that point, Unicode Scalar Value won't be good enough, as you will need to reason about non-character code points.

I think it would be interesting to see a functional base class that works across all string types, with iterators for low-level per-codepoint/per-scalar access but emphasizing higher level operations like split, contains, replace, pad, etc. A richer API would result in fewer programmer errors, as there are lots of sharp corner cases in the Unicode standard, especially with regard to comparison functions.

Did you find any high-assurance implementations of Unicode? I found some stuff on Liquid Haskell verifying a subset of Haskell's text handling library, but Unicode is not a very specific search query.

@robin-aws
Copy link
Member Author

Hi there! I think we have compatible but fairly different goals here. My main intention with this issue was to make the Dafny built-in string type represent the abstract, semantic value of a string instead of the implementation. The current definition is definitely too tightly coupled to the latter since it refers to 16-bit code units, which is how strings are implemented in .NET and Java, for example, but certainly not in all target languages.

As I mentioned in my last comment, the immediate plan is only to defer to existing string implementations in the target languages. This will be the right choice for most programs for efficiency and compatibility with external code.

What you're describing should definitely be possible too, however: a Dafny-native implementation of Unicode based on actual byte sequences, which would use ghost string values to prove correctness. I'd suggest you define your own Unicode scalar value type for now (something like newtype myChar := c: int | ...) so that you can use seq<myChar> for a similar purpose and not be thrown off by the current definition of string.

@indolering
Copy link

What you're describing should definitely be possible too, however: a Dafny-native implementation of Unicode based on actual byte sequences, which would use ghost string values to prove correctness. I'd suggest you define your own Unicode scalar value type for now (something like newtype myChar := c: int | ...) so that you can use seq for a similar purpose and not be thrown off by the current definition of string.

This is what I have done, although using nats as the base value. I've also created parallel OO versions of each type, primarily as a pedagogical exercise. I'm assuming it will be easier to verify statements about sets/maps of the primitive type than it will be to verify transforms of the OO versions.

@davidcok davidcok modified the milestones: Dafny 3.0, Dafny 3.1 Dec 23, 2020
@seanmcl
Copy link
Collaborator

seanmcl commented Jun 9, 2021

I think implementing UTF-8 and UTF-16 in Dafny, rather than in C#, is a good approach. Like @indolering mentioned, you could get all sorts of nice theorems about the relationship between the underlying byte array and the string. Keeping the Dafny implementation as small as we can is also a benefit. It's interesting to think about how to make Dafny code that uses strings agnostic in its unicode encoding, which would be something we need if we want to compile to languages with different encodings.

@indolering
Copy link

I'm planning on doing this, I just need to get some design feedback from the Unicode mailing list and apply for funding.

@atomb atomb removed this from the Dafny 3.1 milestone Apr 21, 2022
@robin-aws robin-aws added part: language definition Relating to the Dafny language definition itself breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) labels Jun 29, 2022
@robin-aws robin-aws self-assigned this Jun 29, 2022
@robin-aws
Copy link
Member Author

I've got a concrete design in mind for Dafny 4.0, and will write it up in a short RFC.

@robin-aws
Copy link
Member Author

dafny-lang/rfcs#12

@robin-aws robin-aws added this to the Dafny 4.0 milestone Aug 12, 2022
@robin-aws robin-aws changed the title Revisit definition of Dafny strings Revisit definition of Dafny strings (RFC #12) Oct 4, 2022
robin-aws added a commit that referenced this issue Nov 24, 2022
Implementation of the design from
dafny-lang/rfcs#13.

Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes
#3001.

Depends on #2976 to be fixed for the tests to pass consistently across
platforms.

I've documented some of the less obvious compilation strategy decisions
in `docs/Compilation/StringsAndChars.md`.

<small>By submitting this pull request, I confirm that my contribution
is made under the terms of the [MIT
license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) kind: enhancement Enhancements, feature requests, etc. These are NOT bugs, but ways we can improve Dafny part: language definition Relating to the Dafny language definition itself
Projects
None yet
7 participants