From f31895da30cbb0ca9be19a35a4f67bd9cf4607b4 Mon Sep 17 00:00:00 2001 From: Simon Sapin Date: Tue, 6 May 2014 00:52:30 +0100 Subject: [PATCH 1/3] RFC: Add byte and byte string literals --- active/0000-ascii-literals.md | 109 ++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 active/0000-ascii-literals.md diff --git a/active/0000-ascii-literals.md b/active/0000-ascii-literals.md new file mode 100644 index 00000000000..7955d813e9b --- /dev/null +++ b/active/0000-ascii-literals.md @@ -0,0 +1,109 @@ +- Start Date: 2014-05-05 +- RFC PR #: +- Rust Issue #: + +# Summary + +Add ASCII byte literals and ASCII byte string literals to the language, +similar to the existing (Unicode) character and string literals. +Before the RFC process was in place, this was discussed in mozilla/rust#4334. + + +# Motivation + +Programs dealing with text usually should use Unicode, +represented in Rust by the `str` and `char` types. +In some cases however, +a program may be dealing with bytes that can not be interpreted as Unicode as a whole, +but still contain ASCII compatible bits. + +For example, the HTTP protocol was originally defined as Latin-1, +but in practice different pieces of the same request or response +can use different encodings. +The PDF file format is mostly ASCII, +but can contain UTF-16 strings and raw binary data. + +There is a precedent at least in Python, which has both Unicode and byte strings. + + +# Drawbacks + +The language becomes slightly more complex, +although that complexity should be limited to the parser. + + +# Detailed design + +Using terminology from [the Reference Manual](http://static.rust-lang.org/doc/master/rust.html#character-and-string-literals): + +Extend the syntax of expressions and patterns to add +byte literals of type `u8` and +byte string literals of type `&'static [u8]` (or `[u8]`, post-DST). +They are identical to the existing character and string literals, except that: + +* They are prefixed with a `b` (for "binary"), to distinguish them +* Unescaped code points in the body must be in the ASCII range: U+0000 to U+007F. +* `'\x5c' 'u' hex_digit 4` and `'\x5c' 'U' hex_digit 8` escapes are not allowed. +* `'\x5c' 'x' hex_digit 2` escapes represent a single byte rather than a code point. + (They are the only way to express a non-ASCII byte.) + +Examples: `b'A' == 65u8`, `b'\t' == 9u8`, `b'\xFF' == 0xFFu8`, +`b"A\t\xFF" == [65u8, 9, 0xFF]` + +Assuming `buffer` of type `&[u8]` +```rust +match buffer[i] { + b'a' .. b'z' => { /* ... */ } + c => { /* ... */ } +} +``` + + +# Alternatives + +Status quo: patterns must use numeric literals for ASCII values, +or (for a single byte, not a byte string) cast to char + +```rust +match buffer[i] { + c @ 0x61 .. 0x7A => { /* ... */ } + c => { /* ... */ } +} +match buffer[i] as char { + // `c` is of the wrong type! + c @ 'a' .. 'z' => { /* ... */ } + c => { /* ... */ } +} +``` + +Another option is to change the syntax so that macros such as +[`bytes!()`](http://static.rust-lang.org/doc/master/std/macros/builtin/macro.bytes.html) +can be used in patterns, and add a `byte!()` macro: + +```rust +match buffer[i] { + c @ byte!('a') .. byte!('z') => { /* ... */ } + c => { /* ... */ } +}q +``` + +This RFC was written to align the syntax with Python, +but there could be many variations such as using a different prefix (maybe `a` for ASCII), +or using a suffix instead (maybe `u8`, as in integer literals). + +The code points from syntax could be encoded as UTF-8 +rather than being mapped to bytes of the same value, +but assuming UTF-8 is not always appropriate when working with bytes. + +See also previous discussion in mozilla/rust#4334. + + +# Unresolved questions + +Should there be "raw byte string" literals? +E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")` + +Should control characters (U+0000 to U+001F) be disallowed in syntax? +This should be consistent across all kinds of literals. + +Should the `bytes!()` macro be removed in favor of this? From 4ea0ec9b4eb952d56d58e0e8c33358ba99638db4 Mon Sep 17 00:00:00 2001 From: Simon Sapin Date: Tue, 6 May 2014 00:58:04 +0100 Subject: [PATCH 2/3] (Byte literals RFC) Fix lack of Markdown magic MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Apparently, GitHub’s auto-linking does not apply when rendering in-repo Markdown files. --- active/0000-ascii-literals.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/active/0000-ascii-literals.md b/active/0000-ascii-literals.md index 7955d813e9b..7f558bdd28b 100644 --- a/active/0000-ascii-literals.md +++ b/active/0000-ascii-literals.md @@ -6,7 +6,8 @@ Add ASCII byte literals and ASCII byte string literals to the language, similar to the existing (Unicode) character and string literals. -Before the RFC process was in place, this was discussed in mozilla/rust#4334. +Before the RFC process was in place, +this was discussed in [#4334](https://github.com/mozilla/rust/issues/4334). # Motivation @@ -95,7 +96,7 @@ The code points from syntax could be encoded as UTF-8 rather than being mapped to bytes of the same value, but assuming UTF-8 is not always appropriate when working with bytes. -See also previous discussion in mozilla/rust#4334. +See also previous discussion in [#4334](https://github.com/mozilla/rust/issues/4334). # Unresolved questions From 471fbe84b2873c9f9bab82e2ca84d5b311458bdb Mon Sep 17 00:00:00 2001 From: Simon Sapin Date: Tue, 6 May 2014 01:22:52 +0100 Subject: [PATCH 3/3] (Byte literals RFC) Raw string prefix precedent --- active/0000-ascii-literals.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/active/0000-ascii-literals.md b/active/0000-ascii-literals.md index 7f558bdd28b..668ed4ceb15 100644 --- a/active/0000-ascii-literals.md +++ b/active/0000-ascii-literals.md @@ -42,7 +42,8 @@ byte literals of type `u8` and byte string literals of type `&'static [u8]` (or `[u8]`, post-DST). They are identical to the existing character and string literals, except that: -* They are prefixed with a `b` (for "binary"), to distinguish them +* They are prefixed with a `b` (for "binary"), to distinguish them. + This is similar to the `r` prefix for raw strings. * Unescaped code points in the body must be in the ASCII range: U+0000 to U+007F. * `'\x5c' 'u' hex_digit 4` and `'\x5c' 'U' hex_digit 8` escapes are not allowed. * `'\x5c' 'x' hex_digit 2` escapes represent a single byte rather than a code point.