Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add byte and byte string literals #69

Merged
merged 3 commits into from
Jun 4, 2014
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions active/0000-ascii-literals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
- Start Date: 2014-05-05
- RFC PR #:
- Rust Issue #:

# Summary

Add ASCII byte literals and ASCII byte string literals to the language,
similar to the existing (Unicode) character and string literals.
Before the RFC process was in place,
this was discussed in [#4334](https://github.com/mozilla/rust/issues/4334).


# Motivation

Programs dealing with text usually should use Unicode,
represented in Rust by the `str` and `char` types.
In some cases however,
a program may be dealing with bytes that can not be interpreted as Unicode as a whole,
but still contain ASCII compatible bits.

For example, the HTTP protocol was originally defined as Latin-1,
but in practice different pieces of the same request or response
can use different encodings.
The PDF file format is mostly ASCII,
but can contain UTF-16 strings and raw binary data.

There is a precedent at least in Python, which has both Unicode and byte strings.


# Drawbacks

The language becomes slightly more complex,
although that complexity should be limited to the parser.


# Detailed design

Using terminology from [the Reference Manual](http://static.rust-lang.org/doc/master/rust.html#character-and-string-literals):

Extend the syntax of expressions and patterns to add
byte literals of type `u8` and
byte string literals of type `&'static [u8]` (or `[u8]`, post-DST).
They are identical to the existing character and string literals, except that:

* They are prefixed with a `b` (for "binary"), to distinguish them.
This is similar to the `r` prefix for raw strings.
* Unescaped code points in the body must be in the ASCII range: U+0000 to U+007F.
* `'\x5c' 'u' hex_digit 4` and `'\x5c' 'U' hex_digit 8` escapes are not allowed.
* `'\x5c' 'x' hex_digit 2` escapes represent a single byte rather than a code point.
(They are the only way to express a non-ASCII byte.)

Examples: `b'A' == 65u8`, `b'\t' == 9u8`, `b'\xFF' == 0xFFu8`,
`b"A\t\xFF" == [65u8, 9, 0xFF]`

Assuming `buffer` of type `&[u8]`
```rust
match buffer[i] {
b'a' .. b'z' => { /* ... */ }
c => { /* ... */ }
}
```


# Alternatives

Status quo: patterns must use numeric literals for ASCII values,
or (for a single byte, not a byte string) cast to char

```rust
match buffer[i] {
c @ 0x61 .. 0x7A => { /* ... */ }
c => { /* ... */ }
}
match buffer[i] as char {
// `c` is of the wrong type!
c @ 'a' .. 'z' => { /* ... */ }
c => { /* ... */ }
}
```

Another option is to change the syntax so that macros such as
[`bytes!()`](http://static.rust-lang.org/doc/master/std/macros/builtin/macro.bytes.html)
can be used in patterns, and add a `byte!()` macro:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer having a macro rather than a prefix for this, maybe ascii! rather than byte. I kind of want to make it ugly so it is clear this is the exceptional situation and you should be using utf8 unless you have a good reason not to.

So, ascii!('t') for b't' and ascii!("foo") for b"foo".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be ok with that, if macros can be used in patterns. I don’t now what this entails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe that macros can not both disallow unescaped non-ASCII code points, and allow non-ASCII escapes like \xFF, unless the tokenizer cooperates. See rust-lang/rust#13955

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly opposed to using a macro for this. There is sufficient pain in using byte types over string types in general that people will only use byte literals where appropriate. And where they are appropriate, using ascii!('t') all the time could easily be extremely painful.

The b prefix makes it all nice and clear you’re dealing with byte characters or byte strings. Trust the user.


```rust
match buffer[i] {
c @ byte!('a') .. byte!('z') => { /* ... */ }
c => { /* ... */ }
}q
```

This RFC was written to align the syntax with Python,
but there could be many variations such as using a different prefix (maybe `a` for ASCII),
or using a suffix instead (maybe `u8`, as in integer literals).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't go the macro route, I would prefer a suffix to a prefix to keep us inline with numeric literals. If we could reuse u8, that would be great. So 't'u8 and "foo"u8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

't'u8 is OK but I think "foo"u8 doesn’t really work, since [u8] is not especially "more 8-bit" than the UTF-8 based str.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it used "foo"u8, then "foo"u16 could become a UCS2 literal and "foo"u32 would be a UCS4 (UTF-32 with native endian) literal. However, I'm not sure that we want these...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we do. I believe that most programs will manipulate Unicode (which in Rust happens to be represented as UTF-8, though that’s mostly hidden), and bytes. UCS-2 or UCS-4 would only ever be used because of very specific constraints (such as Servo interacting with SpiderMonkey) and do not warrant custom syntax built into the language.


The code points from syntax could be encoded as UTF-8
rather than being mapped to bytes of the same value,
but assuming UTF-8 is not always appropriate when working with bytes.

See also previous discussion in [#4334](https://github.com/mozilla/rust/issues/4334).


# Unresolved questions

Should there be "raw byte string" literals?
E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python precedent is for allowing br and forbidding rb (syntax error). Also: yes.


Should control characters (U+0000 to U+001F) be disallowed in syntax?
This should be consistent across all kinds of literals.

Should the `bytes!()` macro be removed in favor of this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.