Skip to content

Commit

Permalink
Merge pull request #1459 from mattheww/2024-01_input_format
Browse files Browse the repository at this point in the history
Input format
  • Loading branch information
ehuss authored Mar 6, 2024
2 parents 5440070 + 8ba3c49 commit 5afb503
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 65 deletions.
7 changes: 4 additions & 3 deletions src/comments.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
>    | INNER_BLOCK_DOC
>
> _IsolatedCR_ :\
>    _A `\r` not followed by a `\n`_
>    \\r
## Non-doc comments

Expand All @@ -53,8 +53,9 @@ that follows. That is, they are equivalent to writing `#![doc="..."]` around
the body of the comment. `//!` comments are usually used to document
modules that occupy a source file.

Isolated CRs (`\r`), i.e. not followed by LF (`\n`), are not allowed in doc
comments.
The character `U+000D` (CR) is not allowed in doc comments.

> **Note**: The sequence `U+000D` (CR) immediately followed by `U+000A` (LF) would have been previously transformed into a single `U+000A` (LF).
## Examples

Expand Down
42 changes: 3 additions & 39 deletions src/crates-and-source-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,9 @@

> **<sup>Syntax</sup>**\
> _Crate_ :\
> &nbsp;&nbsp; UTF8BOM<sup>?</sup>\
> &nbsp;&nbsp; SHEBANG<sup>?</sup>\
> &nbsp;&nbsp; [_InnerAttribute_]<sup>\*</sup>\
> &nbsp;&nbsp; [_Item_]<sup>\*</sup>
> **<sup>Lexer</sup>**\
> UTF8BOM : `\uFEFF`\
> SHEBANG : `#!` \~`\n`<sup>\+</sup>[](#shebang)

> Note: Although Rust, like any other language, can be implemented by an
> interpreter as well as a compiler, the only existing implementation is a
> compiler, and the language has always been designed to be compiled. For these
Expand Down Expand Up @@ -53,6 +46,8 @@ that apply to the containing module, most of which influence the behavior of
the compiler. The anonymous crate module can have additional attributes that
apply to the crate as a whole.

> **Note**: The file's contents may be preceded by a [shebang].
```rust
// Specify the crate name.
#![crate_name = "projx"]
Expand All @@ -65,34 +60,6 @@ apply to the crate as a whole.
#![warn(non_camel_case_types)]
```

## Byte order mark

The optional [_UTF8 byte order mark_] (UTF8BOM production) indicates that the
file is encoded in UTF8. It can only occur at the beginning of the file and
is ignored by the compiler.

## Shebang

A source file can have a [_shebang_] (SHEBANG production), which indicates
to the operating system what program to use to execute this file. It serves
essentially to treat the source file as an executable script. The shebang
can only occur at the beginning of the file (but after the optional
_UTF8BOM_). It is ignored by the compiler. For example:

<!-- ignore: tests don't like shebang -->
```rust,ignore
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
```

A restriction is imposed on the shebang syntax to avoid confusion with an
[attribute]. The `#!` characters must not be followed by a `[` token, ignoring
intervening [comments] or [whitespace]. If this restriction fails, then it is
not treated as a shebang, but instead as the start of an attribute.

## Preludes and `no_std`

This section has been moved to the [Preludes chapter](names/preludes.md).
Expand Down Expand Up @@ -161,20 +128,17 @@ or `_` (U+005F) characters.
[_InnerAttribute_]: attributes.md
[_Item_]: items.md
[_MetaNameValueStr_]: attributes.md#meta-item-attribute-syntax
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
[_utf8 byte order mark_]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[`ExitCode`]: ../std/process/struct.ExitCode.html
[`Infallible`]: ../std/convert/enum.Infallible.html
[`Termination`]: ../std/process/trait.Termination.html
[attribute]: attributes.md
[attributes]: attributes.md
[comments]: comments.md
[function]: items/functions.md
[module]: items/modules.md
[module path]: paths.md
[shebang]: input-format.md#shebang-removal
[trait or lifetime bounds]: trait-bounds.md
[where clauses]: items/generics.md#where-clauses
[whitespace]: whitespace.md

<script>
(function() {
Expand Down
54 changes: 53 additions & 1 deletion src/input-format.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,55 @@
# Input format

Rust input is interpreted as a sequence of Unicode code points encoded in UTF-8.
This chapter describes how a source file is interpreted as a sequence of tokens.

See [Crates and source files] for a description of how programs are organised into files.

## Source encoding

Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
It is an error if the file is not valid UTF-8.

## Byte order mark removal

If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.

## CRLF normalization

Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).

Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).

## Shebang removal

If the remaining sequence begins with the characters `!#`, the characters up to and including the first `U+000A` (LF) are removed from the sequence.

For example, the first line of the following file would be ignored:

<!-- ignore: tests don't like shebang -->
```rust,ignore
#!/usr/bin/env rustx
fn main() {
println!("Hello!");
}
```

As an exception, if the `#!` characters are followed (ignoring intervening [comments] or [whitespace]) by a `[` token, nothing is removed.
This prevents an [inner attribute] at the start of a source file being removed.

> **Note**: The standard library [`include!`] macro applies byte order mark removal, CRLF normalization, and shebang removal to the file it reads. The [`include_str!`] and [`include_bytes!`] macros do not.
## Tokenization

The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.


[`include!`]: ../std/macro.include.md
[`include_bytes!`]: ../std/macro.include_bytes.md
[`include_str!`]: ../std/macro.include_str.md
[inner attribute]: attributes.md
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[comments]: comments.md
[Crates and source files]: crates-and-source-files.md
[_shebang_]: https://en.wikipedia.org/wiki/Shebang_(Unix)
[whitespace]: whitespace.md
52 changes: 30 additions & 22 deletions src/tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ Literals are tokens used in [literal expressions].

[^nsets]: The number of `#`s on each side of the same literal must be equivalent.

> **Note**: Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
#### ASCII escapes

| | Name |
Expand Down Expand Up @@ -156,13 +158,10 @@ A _string literal_ is a sequence of any Unicode characters enclosed within two
`U+0022` (double-quote) characters, with the exception of `U+0022` itself,
which must be _escaped_ by a preceding `U+005C` character (`\`).

Line-breaks are allowed in string literals.
A line-break is either a newline (`U+000A`) or a pair of carriage return and newline (`U+000D`, `U+000A`).
Both byte sequences are translated to `U+000A`.

Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.

The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.

#### Character escapes

Expand Down Expand Up @@ -198,10 +197,10 @@ following forms:
Raw string literals do not process any escapes. They start with the character
`U+0072` (`r`), followed by fewer than 256 of the character `U+0023` (`#`) and a
`U+0022` (double-quote) character. The _raw string body_ can contain any sequence
of Unicode characters and is terminated only by another `U+0022` (double-quote)
character, followed by the same number of `U+0023` (`#`) characters that preceded
the opening `U+0022` (double-quote) character.
`U+0022` (double-quote) character.

The _raw string body_ can contain any sequence of Unicode characters other than `U+000D` (CR).
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.

All Unicode characters contained in the raw string body represent themselves,
the characters `U+0022` (double-quote) (except when followed by at least as
Expand Down Expand Up @@ -259,6 +258,11 @@ the literal, it must be _escaped_ by a preceding `U+005C` (`\`) character.
Alternatively, a byte string literal can be a _raw byte string literal_, defined
below.

Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.

Some additional _escapes_ are available in either byte or non-raw byte string
literals. An escape starts with a `U+005C` (`\`) and continues with one of the
following forms:
Expand All @@ -281,19 +285,19 @@ following forms:
> &nbsp;&nbsp; `br` RAW_BYTE_STRING_CONTENT SUFFIX<sup>?</sup>
>
> RAW_BYTE_STRING_CONTENT :\
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII<sup>* (non-greedy)</sup> `"`\
> &nbsp;&nbsp; &nbsp;&nbsp; `"` ASCII_FOR_RAW<sup>* (non-greedy)</sup> `"`\
> &nbsp;&nbsp; | `#` RAW_BYTE_STRING_CONTENT `#`
>
> ASCII :\
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F)_
> ASCII_FOR_RAW :\
> &nbsp;&nbsp; _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_
Raw byte string literals do not process any escapes. They start with the
character `U+0062` (`b`), followed by `U+0072` (`r`), followed by fewer than 256
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
_raw string body_ can contain any sequence of ASCII characters and is terminated
only by another `U+0022` (double-quote) character, followed by the same number of
`U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote)
character. A raw byte string literal can not contain any non-ASCII byte.
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.

The _raw string body_ can contain any sequence of ASCII characters other than `U+000D` (CR).
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.
A raw byte string literal can not contain any non-ASCII byte.

All characters contained in the raw string body represent their ASCII encoding,
the characters `U+0022` (double-quote) (except when followed by at least as
Expand Down Expand Up @@ -339,6 +343,11 @@ C strings are implicitly terminated by byte `0x00`, so the C string literal
literal `b"\x00"`. Other than the implicit terminator, byte `0x00` is not
permitted within a C string.

Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
See [String continuation escapes] for details.
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.

Some additional _escapes_ are available in non-raw C string literals. An escape
starts with a `U+005C` (`\`) and continues with one of the following forms:

Expand Down Expand Up @@ -381,11 +390,10 @@ c"\xC3\xA6";
Raw C string literals do not process any escapes. They start with the
character `U+0063` (`c`), followed by `U+0072` (`r`), followed by fewer than 256
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character. The
_raw C string body_ can contain any sequence of Unicode characters (other than
`U+0000`) and is terminated only by another `U+0022` (double-quote) character,
followed by the same number of `U+0023` (`#`) characters that preceded the
opening `U+0022` (double-quote) character.
of the character `U+0023` (`#`), and a `U+0022` (double-quote) character.

The _raw C string body_ can contain any sequence of Unicode characters other than `U+0000` (NUL) and `U+000D` (CR).
It is terminated only by another `U+0022` (double-quote) character, followed by the same number of `U+0023` (`#`) characters that preceded the opening `U+0022` (double-quote) character.

All characters contained in the raw C string body represent themselves in UTF-8
encoding. The characters `U+0022` (double-quote) (except when followed by at
Expand Down

0 comments on commit 5afb503

Please sign in to comment.