-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the-parser.md #1933
Merged
Merged
Update the-parser.md #1933
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,68 +1,74 @@ | ||
# Lexing and Parsing | ||
|
||
The very first thing the compiler does is take the program (in Unicode | ||
characters) and turn it into something the compiler can work with more | ||
conveniently than strings. This happens in two stages: Lexing and Parsing. | ||
The very first thing the compiler does is take the program (in Unicode) and | ||
transmute it into a data format the compiler can work with more conveniently | ||
than strings. This happens in two stages: Lexing and Parsing. | ||
|
||
Lexing takes strings and turns them into streams of [tokens]. For example, | ||
`foo.bar + buz` would be turned into the tokens `foo`, `.`, | ||
`bar`, `+`, and `buz`. The lexer lives in [`rustc_lexer`][lexer]. | ||
1. _Lexing_ takes strings and turns them into streams of [tokens]. For | ||
example, `foo.bar + buz` would be turned into the tokens `foo`, `.`, `bar`, | ||
`+`, and `buz`. | ||
|
||
[tokens]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/token/index.html | ||
[lexer]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html | ||
|
||
Parsing then takes streams of tokens and turns them into a structured | ||
form which is easier for the compiler to work with, usually called an [*Abstract | ||
Syntax Tree*][ast] (AST). An AST mirrors the structure of a Rust program in memory, | ||
using a `Span` to link a particular AST node back to its source text. | ||
2. _Parsing_ takes streams of tokens and turns them into a structured form | ||
which is easier for the compiler to work with, usually called an [*Abstract | ||
Syntax Tree* (`AST`)][ast] . | ||
|
||
|
||
An `AST` mirrors the structure of a Rust program in memory, using a `Span` to | ||
link a particular `AST` node back to its source text. The `AST` is defined in | ||
[`rustc_ast`][rustc_ast], along with some definitions for tokens and token | ||
streams, data structures/`trait`s for mutating `AST`s, and shared definitions for | ||
other `AST`-related parts of the compiler (like the lexer and | ||
`macro`-expansion). | ||
|
||
The AST is defined in [`rustc_ast`][rustc_ast], along with some definitions for | ||
tokens and token streams, data structures/traits for mutating ASTs, and shared | ||
definitions for other AST-related parts of the compiler (like the lexer and | ||
macro-expansion). | ||
The lexer is developed in [`rustc_lexer`][lexer]. | ||
|
||
The parser is defined in [`rustc_parse`][rustc_parse], along with a | ||
high-level interface to the lexer and some validation routines that run after | ||
macro expansion. In particular, the [`rustc_parse::parser`][parser] contains | ||
`macro` expansion. In particular, the [`rustc_parse::parser`][parser] contains | ||
the parser implementation. | ||
|
||
The main entrypoint to the parser is via the various `parse_*` functions and others in the | ||
[parser crate][parser_lib]. They let you do things like turn a [`SourceFile`][sourcefile] | ||
The main entrypoint to the parser is via the various `parse_*` functions and others in | ||
[rustc_parse][rustc_parse]. They let you do things like turn a [`SourceFile`][sourcefile] | ||
(e.g. the source in a single file) into a token stream, create a parser from | ||
the token stream, and then execute the parser to get a `Crate` (the root AST | ||
the token stream, and then execute the parser to get a [`Crate`] (the root `AST` | ||
node). | ||
|
||
To minimize the amount of copying that is done, | ||
both [`StringReader`] and [`Parser`] have lifetimes which bind them to the parent `ParseSess`. | ||
This contains all the information needed while parsing, | ||
as well as the [`SourceMap`] itself. | ||
To minimize the amount of copying that is done, both [`StringReader`] and | ||
[`Parser`] have lifetimes which bind them to the parent [`ParseSess`]. This | ||
contains all the information needed while parsing, as well as the [`SourceMap`] | ||
itself. | ||
|
||
Note that while parsing, we may encounter macro definitions or invocations. We | ||
set these aside to be expanded (see [this chapter](./macro-expansion.md)). | ||
Expansion may itself require parsing the output of the macro, which may reveal | ||
more macros to be expanded, and so on. | ||
Note that while parsing, we may encounter `macro` definitions or invocations. We | ||
set these aside to be expanded (see [Macro Expansion](./macro-expansion.md)). | ||
Expansion itself may require parsing the output of a `macro`, which may reveal | ||
more `macro`s to be expanded, and so on. | ||
|
||
## More on Lexical Analysis | ||
|
||
Code for lexical analysis is split between two crates: | ||
|
||
- `rustc_lexer` crate is responsible for breaking a `&str` into chunks | ||
- [`rustc_lexer`] crate is responsible for breaking a `&str` into chunks | ||
constituting tokens. Although it is popular to implement lexers as generated | ||
finite state machines, the lexer in `rustc_lexer` is hand-written. | ||
finite state machines, the lexer in [`rustc_lexer`] is hand-written. | ||
|
||
- [`StringReader`] integrates `rustc_lexer` with data structures specific to `rustc`. | ||
Specifically, | ||
it adds `Span` information to tokens returned by `rustc_lexer` and interns identifiers. | ||
- [`StringReader`] integrates [`rustc_lexer`] with data structures specific to | ||
`rustc`. Specifically, it adds `Span` information to tokens returned by | ||
[`rustc_lexer`] and interns identifiers. | ||
|
||
[rustc_ast]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/index.html | ||
[rustc_errors]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html | ||
[ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree | ||
[`Crate`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/struct.Crate.html | ||
[`Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html | ||
[`ParseSess`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_session/parse/struct.ParseSess.html | ||
[`rustc_lexer`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_lexer/index.html | ||
[`SourceMap`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/source_map/struct.SourceMap.html | ||
[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html | ||
[ast module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/index.html | ||
[rustc_parse]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html | ||
[parser_lib]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html | ||
[ast]: ./ast-validation.md | ||
[parser]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/index.html | ||
[`Parser`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/parser/struct.Parser.html | ||
[`StringReader`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/lexer/struct.StringReader.html | ||
[visit module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/visit/index.html | ||
[rustc_ast]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/index.html | ||
[rustc_errors]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_errors/index.html | ||
[rustc_parse]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_parse/index.html | ||
[sourcefile]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_span/struct.SourceFile.html | ||
[visit module]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/visit/index.html |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't convert nice semantic line breaks into hard line breaks like this, it makes it harder to read and harder to diff