Replace LALRPOP parser with hand-written parser #10036

dhruvmanila · 2024-02-19T04:13:35Z

(Supersedes #9152, authored by @LaBatata101)

Summary

This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser.

It also updates the grammar for PEP 646 so that the parser outputs the correct AST. For example, in data[*x], the index expression is now a tuple with a single starred expression instead of just a starred expression.

Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will stop at the first syntax error. This will be updated in the following months.

For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post.

Test Plan

Write lots and lots of tests for both valid and invalid syntax and verify the output.

Acknowledgements

@MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project
@LaBatata101 for initiating the transition to a hand-written parser in Replace autogenerated parser with hand-written parser #9152
@addisoncrump for implementing the fuzzer which helped catch a lot of bugs

codspeed-hq · 2024-03-07T10:42:55Z

CodSpeed Performance Report

Merging #10036 will improve performances by ×2.4

_{Comparing dhruv/parser (98a7669) with main (e09180b)}

Summary

⚡ 7 improvements
✅ 23 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`dhruv/parser`	Change
⚡	`linter/all-rules[unicode/pypinyin.py]`	9.6 ms	8.6 ms	+11.63%
⚡	`linter/all-with-preview-rules[unicode/pypinyin.py]`	11.1 ms	10.1 ms	+9.95%
⚡	`parser[large/dataset.py]`	63.6 ms	26.6 ms	×2.4
⚡	`parser[numpy/ctypeslib.py]`	10.8 ms	5 ms	×2.2
⚡	`parser[numpy/globals.py]`	964.6 µs	424.9 µs	×2.3
⚡	`parser[pydantic/types.py]`	24.4 ms	10.9 ms	×2.2
⚡	`parser[unicode/pypinyin.py]`	3.8 ms	1.7 ms	×2.2

github-actions · 2024-03-12T12:52:56Z

`ruff-ecosystem` results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+1 -1 violations, +0 -0 fixes in 1 projects; 43 projects unchanged)

demisto/content (+1 -1 violations, +0 -0 fixes)

+ Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:12: E999 SyntaxError: Multiple exception types must be parenthesized
- Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:21: E999 SyntaxError: Unexpected token ','

Changes by rule (1 rules affected)

code	total	+ violation	- violation	+ fix	- fix
E999	2	1	1	0	0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+1 -1 violations, +0 -0 fixes in 1 projects; 43 projects unchanged)

demisto/content (+1 -1 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

+ Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:12: E999 SyntaxError: Multiple exception types must be parenthesized
- Packs/ThreatQ/Integrations/ThreatQ/ThreatQ.py:106:21: E999 SyntaxError: Unexpected token ','

Changes by rule (1 rules affected)

code	total	+ violation	- violation	+ fix	- fix
E999	2	1	1	0	0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

## Summary This PR reviews the logic for IPython escape command parsing and makes the required changes to almost match the old parsing logic. ~One change I've not incorporated is the `unparse_expr`. The new parser can use `src_text` directly but this means that any whitespace is preserved~ Migrated the `unparse_expr` logic ## Test plan Use the new parser for testing escape commands and verify that there are no snapshot changes.

## Summary This PR updates the `ruff_new_parser_equiv` fuzzer target to reject the corpus which produces an empty range for the AST from LALRPOP. If there are no nodes, then LALRPOP produces an empty range while the new parser uses the correct range of the source itself.

Co-authored-by: Micha Reiser <[email protected]>

## Summary This PR updates the LALRPOP to use the complete source range for module by looking at the tokens before the parsing begins. This means the module range now corresponds to the actual source even if it only contains trivia tokens. This is mainly so that the fuzzer doesn't crash because the AST doesn't match.

> `simple` is a boolean integer set to `True` for a [Name](https://docs.python.org/3/library/ast.html#ast.Name) node in `target` that do not appear in between parenthesis and are hence pure names and not expressions. Reference: https://docs.python.org/3/library/ast.html#ast.AnnAssign

## Summary Dotted names aren't allowed in `from ... import` statement. They're only allowed in `import` statement. ### Alternative Another solution would be to parse the dotted name, check if there's a `.` in the parsed string and raise an error. I choose not to do this because it didn't make sense to do `contains` for every import name. ## Test Plan Add invalid syntax test cases to verify the logic.

## Summary This PR fixes a bug in parenthesized with-items parsing where the parser should expect a comma but it didn't. This is the case in the speculative parsing loop of parenthesized with-items. Once the parser finds any token other than comma, it breaks out of the loop. Then, once the parser has determined the with item kind, it needs to expect a comma if it's not the end of parenthesized with-items. ## Test Plan Add inline test cases, verify the snapshots.

## Summary This PR avoids the panic when trying to recover from an invalid pattern. The panic occurs because there's no way to represent `x as y` syntax as a Python expression. There were few cases which were not accounted for when working on this. The cases where it would panic are the ones where only a subset of pattern is valid in a position but all patterns are valid in a nested position of the mentioned top-level pattern. For example, consider a mapping pattern and the following points: * Key position can only be a `MatchValue` and `MatchSingleton` * `MatchClass` can contain a `MatchAs` as an argument * `MatchAs` pattern with both pattern and name (`a as b`) cannot be converted to a valid expression Considering this, if a mapping pattern has a class pattern with an `as` pattern as one of it's argument, the parser would panic. ```py match subject: case {Foo(a as b): 1}: ... # ^^^^^^^^^^^^^^^^ mapping pattern # ^^^^^^^^^^^ mapping pattern key, class pattern # ^^^^^^ as pattern with both pattern and name ``` The same case is for complex literal patterns as well. I'm not sure what the recovery should look like but for now it's better to avoid the panic and create an invalid expression node in it's place. ## Test Plan Add new test cases and update the snapshots.

## Summary This fixes a bug in the parser when `Mode::Expression` is used where the parser didn't recover from unexpected token after the expression. The current solution isn't ideal as the parser will drop all the remaining tokens after the initial expression has been parsed. This is to avoid panicking especially in the context of forward annotation like: ```python x: Literal["foo", "bar baz"] ``` Here, the parser is being given the `bar baz` as the source to be parsed using `Mode::Expression`. The parser will parse `bar` as a name expression and then stop because the expression ends there. But, it seems like there's more tokens remaining. This specific case could potentially be recovered by using a `ExprTuple` and raising a "missing comma" error but I've avoided that complexity for now. ## Test Plan Add new test cases for `Mode::Expression` and verified the snapshots.

## Summary This PR fixes the bug where the parser would fail to parse the following code: ```python for (x in y)[0] in z: ... ``` This is because when parsing the target of a `for` statement, the context flag for the parser is set to `FOR_TARGET` and this is used by binary expression parsing to stop the expression parsing when it sees the `in` keyword. But, in parenthesized context, this is allowed. The solution implemented in this PR is to reset the parser context when parsing a parenthesized expression and set it back to the original context later. An alternative solution would be to introduce [`ExpressionContext`](https://github.com/biomejs/biome/blob/838ccb442370e72197e625a7d0e4456e6e77e498/crates/biome_js_parser/src/syntax/expr.rs#L42-L43) but that would require a lot of updates as almost all of expression parsing function would need to be updated to take this context. A benefit of this would be that both `AllowStarredExpression` and `AllowNamedExpression` can be merged in the context and everything around `ParserCtxFlags` can be removed. It could possibly also simplify certain parsing functions. ## Test Plan Add test cases and verified the snapshots.

## Summary This PR removes a test case from the formatter which uses a named expression as `del` target. This is an invalid syntax in the new parser.

## Summary This fixes a bug for code like: ```python match subject: case a, b if expr: ... ``` Previously, the `:` was the only terminator token so the list parsing wouldn't stop after `b`. The `if` token has been added to the terminator list to fix this. ## Test Plan Add test cases and verified the snapshots.

## Summary There have been some grammar changes in [PEP 646] which were not accounted for in the old parser. The new parser has been updated with the correct AST. This is the case when there's a starred expression inside a subscript expression like the following example: ```python data[*x] ``` This gives us the AST where the slice element is actually a tuple expression with one element (starred expression) instead of just a starred expression. Now, the formatter's current behavior is to always add a trailing comma in a tuple with a single element. This PR updates the formatter to use the "preserve" behavior in trailing comma as well. So, trailing comma will not be added in the above example and if there's a trailing comma in the above example, it'll be preserved. This retains the current behavior without the AST change. [PEP 646]: https://peps.python.org/pep-0646/#change-1-star-expressions-in-indexes ## Test Plan Run `cargo insta test -p ruff_python_formatter` and make sure there are no changes.

crates/ruff_python_ast/src/comparable.rs

MichaReiser · 2024-04-18T10:45:03Z

One obligatory comment :P Excellent work. Looking forward to my next git pull

addisoncrump · 2024-04-18T11:29:23Z

🚀 Let me know if I can help with further testing.

dhruvmanila added the parser Related to the parser label Feb 19, 2024

dhruvmanila mentioned this pull request Feb 19, 2024

Replace autogenerated parser with hand-written parser #9152

Closed

dhruvmanila marked this pull request as draft February 19, 2024 04:16

dhruvmanila force-pushed the dhruv/parser branch from 3209b85 to d247242 Compare February 19, 2024 08:09

dhruvmanila force-pushed the dhruv/parser branch from d247242 to a695672 Compare March 7, 2024 08:49

dhruvmanila temporarily deployed to release March 7, 2024 10:51 — with GitHub Actions Inactive

dhruvmanila force-pushed the dhruv/parser branch from 96addbf to b5cc384 Compare March 7, 2024 12:53

dhruvmanila temporarily deployed to release March 7, 2024 12:54 — with GitHub Actions Inactive

dhruvmanila temporarily deployed to release March 7, 2024 13:07 — with GitHub Actions Inactive

dhruvmanila temporarily deployed to release March 7, 2024 16:19 — with GitHub Actions Inactive

dhruvmanila temporarily deployed to release March 8, 2024 02:03 — with GitHub Actions Inactive

dhruvmanila force-pushed the dhruv/parser branch from b07c008 to 70731cb Compare March 12, 2024 12:32

dhruvmanila temporarily deployed to release March 12, 2024 12:33 — with GitHub Actions Inactive

dhruvmanila force-pushed the dhruv/parser branch from cf89468 to b803d4f Compare March 13, 2024 16:08

dhruvmanila temporarily deployed to release March 13, 2024 16:09 — with GitHub Actions Inactive

dhruvmanila had a problem deploying to release March 14, 2024 03:26 — with GitHub Actions Error

dhruvmanila temporarily deployed to release March 14, 2024 03:28 — with GitHub Actions Inactive

dhruvmanila had a problem deploying to release March 14, 2024 04:05 — with GitHub Actions Error

dhruvmanila had a problem deploying to release March 14, 2024 04:10 — with GitHub Actions Error

dhruvmanila temporarily deployed to release March 14, 2024 04:13 — with GitHub Actions Inactive

dhruvmanila force-pushed the dhruv/parser branch from 81e8050 to 4f5604f Compare March 14, 2024 09:11

dhruvmanila temporarily deployed to release March 14, 2024 09:11 — with GitHub Actions Inactive

zanieb mentioned this pull request Mar 15, 2024

Publishing Internal Crates to crates.io #10417

Closed

dhruvmanila force-pushed the dhruv/parser branch 2 times, most recently from 236aa4a to c16f7b7 Compare March 26, 2024 10:00

dhruvmanila mentioned this pull request Mar 29, 2024

☂️ Error recovery improvements for the new parser #10653

Open

17 tasks

dhruvmanila force-pushed the dhruv/parser branch 2 times, most recently from 48ce841 to 6d26ec6 Compare March 31, 2024 03:12

dhruvmanila and others added 20 commits April 18, 2024 15:40

Fix parameter range if it includes *, ** (#10875)

52aed8d

Expect for after async instead of bump (#10877)

02e6345

Co-authored-by: Micha Reiser <[email protected]>

Move expr token set, docs improvements (#10887)

8ef7d36

Consider exactly _ for wildcard pattern (#10896)

b156d0b

Fix clippy

0780f78

fixup! Pre-compute the module range for LALRPOP parser (#10879)

d61c689

Remove invalid named expr test from formatter (#10995)

f464a42

## Summary This PR removes a test case from the formatter which uses a named expression as `del` target. This is an invalid syntax in the new parser.

Remove LALRPOP based parser (#10998)

e5e3692

Remove leftover fuzz target

924ed5e

dhruvmanila force-pushed the dhruv/parser branch from e7b19dd to 924ed5e Compare April 18, 2024 10:10

dhruvmanila marked this pull request as ready for review April 18, 2024 10:26

dhruvmanila requested a review from MichaReiser as a code owner April 18, 2024 10:26

MichaReiser approved these changes Apr 18, 2024

View reviewed changes

crates/ruff_python_ast/src/comparable.rs Outdated Show resolved Hide resolved

Remove leftover Invalid enum variants

98a7669

dhruvmanila merged commit 13ffb5b into main Apr 18, 2024
17 checks passed

dhruvmanila deleted the dhruv/parser branch April 18, 2024 12:27

BrewTestBot mentioned this pull request Apr 18, 2024

ruff 0.4.0 Homebrew/homebrew-core#169465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace LALRPOP parser with hand-written parser #10036

Replace LALRPOP parser with hand-written parser #10036

dhruvmanila commented Feb 19, 2024 •

edited

Loading

codspeed-hq bot commented Mar 7, 2024 •

edited

Loading

github-actions bot commented Mar 12, 2024 •

edited

Loading

MichaReiser commented Apr 18, 2024

addisoncrump commented Apr 18, 2024

Replace LALRPOP parser with hand-written parser #10036

Replace LALRPOP parser with hand-written parser #10036

Conversation

dhruvmanila commented Feb 19, 2024 • edited Loading

Summary

Test Plan

Acknowledgements

codspeed-hq bot commented Mar 7, 2024 • edited Loading

CodSpeed Performance Report

Merging #10036 will improve performances by ×2.4

Summary

Benchmarks breakdown

github-actions bot commented Mar 12, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

MichaReiser commented Apr 18, 2024

addisoncrump commented Apr 18, 2024

dhruvmanila commented Feb 19, 2024 •

edited

Loading

codspeed-hq bot commented Mar 7, 2024 •

edited

Loading

github-actions bot commented Mar 12, 2024 •

edited

Loading

`ruff-ecosystem` results