-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add byte and byte string literals #69
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
- Start Date: 2014-05-05 | ||
- RFC PR #: | ||
- Rust Issue #: | ||
|
||
# Summary | ||
|
||
Add ASCII byte literals and ASCII byte string literals to the language, | ||
similar to the existing (Unicode) character and string literals. | ||
Before the RFC process was in place, | ||
this was discussed in [#4334](https://github.com/mozilla/rust/issues/4334). | ||
|
||
|
||
# Motivation | ||
|
||
Programs dealing with text usually should use Unicode, | ||
represented in Rust by the `str` and `char` types. | ||
In some cases however, | ||
a program may be dealing with bytes that can not be interpreted as Unicode as a whole, | ||
but still contain ASCII compatible bits. | ||
|
||
For example, the HTTP protocol was originally defined as Latin-1, | ||
but in practice different pieces of the same request or response | ||
can use different encodings. | ||
The PDF file format is mostly ASCII, | ||
but can contain UTF-16 strings and raw binary data. | ||
|
||
There is a precedent at least in Python, which has both Unicode and byte strings. | ||
|
||
|
||
# Drawbacks | ||
|
||
The language becomes slightly more complex, | ||
although that complexity should be limited to the parser. | ||
|
||
|
||
# Detailed design | ||
|
||
Using terminology from [the Reference Manual](http://static.rust-lang.org/doc/master/rust.html#character-and-string-literals): | ||
|
||
Extend the syntax of expressions and patterns to add | ||
byte literals of type `u8` and | ||
byte string literals of type `&'static [u8]` (or `[u8]`, post-DST). | ||
They are identical to the existing character and string literals, except that: | ||
|
||
* They are prefixed with a `b` (for "binary"), to distinguish them. | ||
This is similar to the `r` prefix for raw strings. | ||
* Unescaped code points in the body must be in the ASCII range: U+0000 to U+007F. | ||
* `'\x5c' 'u' hex_digit 4` and `'\x5c' 'U' hex_digit 8` escapes are not allowed. | ||
* `'\x5c' 'x' hex_digit 2` escapes represent a single byte rather than a code point. | ||
(They are the only way to express a non-ASCII byte.) | ||
|
||
Examples: `b'A' == 65u8`, `b'\t' == 9u8`, `b'\xFF' == 0xFFu8`, | ||
`b"A\t\xFF" == [65u8, 9, 0xFF]` | ||
|
||
Assuming `buffer` of type `&[u8]` | ||
```rust | ||
match buffer[i] { | ||
b'a' .. b'z' => { /* ... */ } | ||
c => { /* ... */ } | ||
} | ||
``` | ||
|
||
|
||
# Alternatives | ||
|
||
Status quo: patterns must use numeric literals for ASCII values, | ||
or (for a single byte, not a byte string) cast to char | ||
|
||
```rust | ||
match buffer[i] { | ||
c @ 0x61 .. 0x7A => { /* ... */ } | ||
c => { /* ... */ } | ||
} | ||
match buffer[i] as char { | ||
// `c` is of the wrong type! | ||
c @ 'a' .. 'z' => { /* ... */ } | ||
c => { /* ... */ } | ||
} | ||
``` | ||
|
||
Another option is to change the syntax so that macros such as | ||
[`bytes!()`](http://static.rust-lang.org/doc/master/std/macros/builtin/macro.bytes.html) | ||
can be used in patterns, and add a `byte!()` macro: | ||
|
||
```rust | ||
match buffer[i] { | ||
c @ byte!('a') .. byte!('z') => { /* ... */ } | ||
c => { /* ... */ } | ||
}q | ||
``` | ||
|
||
This RFC was written to align the syntax with Python, | ||
but there could be many variations such as using a different prefix (maybe `a` for ASCII), | ||
or using a suffix instead (maybe `u8`, as in integer literals). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we don't go the macro route, I would prefer a suffix to a prefix to keep us inline with numeric literals. If we could reuse There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it used "foo"u8, then "foo"u16 could become a UCS2 literal and "foo"u32 would be a UCS4 (UTF-32 with native endian) literal. However, I'm not sure that we want these... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don’t think we do. I believe that most programs will manipulate Unicode (which in Rust happens to be represented as UTF-8, though that’s mostly hidden), and bytes. UCS-2 or UCS-4 would only ever be used because of very specific constraints (such as Servo interacting with SpiderMonkey) and do not warrant custom syntax built into the language. |
||
|
||
The code points from syntax could be encoded as UTF-8 | ||
rather than being mapped to bytes of the same value, | ||
but assuming UTF-8 is not always appropriate when working with bytes. | ||
|
||
See also previous discussion in [#4334](https://github.com/mozilla/rust/issues/4334). | ||
|
||
|
||
# Unresolved questions | ||
|
||
Should there be "raw byte string" literals? | ||
E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Python precedent is for allowing |
||
|
||
Should control characters (U+0000 to U+001F) be disallowed in syntax? | ||
This should be consistent across all kinds of literals. | ||
|
||
Should the `bytes!()` macro be removed in favor of this? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer having a macro rather than a prefix for this, maybe
ascii!
rather than byte. I kind of want to make it ugly so it is clear this is the exceptional situation and you should be using utf8 unless you have a good reason not to.So,
ascii!('t')
forb't'
andascii!("foo")
forb"foo"
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d be ok with that, if macros can be used in patterns. I don’t now what this entails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I believe that macros can not both disallow unescaped non-ASCII code points, and allow non-ASCII escapes like
\xFF
, unless the tokenizer cooperates. See rust-lang/rust#13955There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly opposed to using a macro for this. There is sufficient pain in using byte types over string types in general that people will only use byte literals where appropriate. And where they are appropriate, using
ascii!('t')
all the time could easily be extremely painful.The
b
prefix makes it all nice and clear you’re dealing with byte characters or byte strings. Trust the user.