-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit definition of Dafny strings (RFC #12) #413
Revisit definition of Dafny strings (RFC #12) #413
Comments
Going further than this, even if the canonical representation of strings in Dafny is using UTF16, another issue with just making Given that Unicode is so often misunderstood and often the source of subtle bugs, I think there's a great opportunity to have a verifiably-correct implementation of Unicode in Dafny. Languages like Java and C# will often allow invalid sequences and can be lax with encoding/decoding, but I think it would be much better to be strict in Dafny. |
Making |
My current plan here is to keep |
Hi! I'm modelling bits of Unicode in Dafny. I think you will need to offer a fairly rich string API. Users who are modelling abstract systems to ensure correctness of an algorithm probably don't want to spend their compute budget proving anything about strings. There are also plenty of specifications that limit input to printable ASCII, Base64, alphanumerics, etc. Users who want to prove complex things about strings will also need to worry about low level details like normal forms, stream safety, Unicode versions, and platform-specific or even broken behaviour. At that point, Unicode Scalar Value won't be good enough, as you will need to reason about non-character code points. I think it would be interesting to see a functional base class that works across all string types, with iterators for low-level per-codepoint/per-scalar access but emphasizing higher level operations like split, contains, replace, pad, etc. A richer API would result in fewer programmer errors, as there are lots of sharp corner cases in the Unicode standard, especially with regard to comparison functions. Did you find any high-assurance implementations of Unicode? I found some stuff on Liquid Haskell verifying a subset of Haskell's text handling library, but Unicode is not a very specific search query. |
Hi there! I think we have compatible but fairly different goals here. My main intention with this issue was to make the Dafny built-in As I mentioned in my last comment, the immediate plan is only to defer to existing string implementations in the target languages. This will be the right choice for most programs for efficiency and compatibility with external code. What you're describing should definitely be possible too, however: a Dafny-native implementation of Unicode based on actual byte sequences, which would use ghost |
This is what I have done, although using nats as the base value. I've also created parallel OO versions of each type, primarily as a pedagogical exercise. I'm assuming it will be easier to verify statements about sets/maps of the primitive type than it will be to verify transforms of the OO versions. |
I think implementing UTF-8 and UTF-16 in Dafny, rather than in C#, is a good approach. Like @indolering mentioned, you could get all sorts of nice theorems about the relationship between the underlying byte array and the string. Keeping the Dafny implementation as small as we can is also a benefit. It's interesting to think about how to make Dafny code that uses strings agnostic in its unicode encoding, which would be something we need if we want to compile to languages with different encodings. |
I'm planning on doing this, I just need to get some design feedback from the Unicode mailing list and apply for funding. |
I've got a concrete design in mind for Dafny 4.0, and will write it up in a short RFC. |
Implementation of the design from dafny-lang/rfcs#13. Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes #3001. Depends on #2976 to be fixed for the tests to pass consistently across platforms. I've documented some of the less obvious compilation strategy decisions in `docs/Compilation/StringsAndChars.md`. <small>By submitting this pull request, I confirm that my contribution is made under the terms of the [MIT license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
(was: "Explicitly document that Dafny strings are UTF-16")
I found this just a bit too subtle when reading the reference manual. There are a couple references to "Unicode letters" and "Unicode characters" that should really be "Unicode 16-bit code units". The grammar does spell out that a Unicode character is four hex digits, but that's a little bit buried.
Even better, I would love it if Dafny characters were actually sequences of Unicode code points, so that the built-in sequence operations could be used to perform robust per-character string manipulation.
The text was updated successfully, but these errors were encountered: