Improve error recovery for strings and f-strings #11916

dhruvmanila · 2024-06-18T04:50:46Z

Currently, the parser drops all unterminated strings or more precisely the lexer doesn't emit the string token for unterminated strings. This is for both normal and formatted string literals.

This can be solved by always emitting the string / f-string tokens even if it's unterminated. We would need to introduce an Unterminated token flag which will signal the parser about the same.

For example:

"hello world

def foo():
    pass

Even though the parser recovers from the above unterminated string, the AST doesn't include the string itself. Here, the new tokens would be:

TokenKind::String (with TokenFlags::Unterminated)
TokenKind::Newline
TokenKind::NonLogicalNewline
TokenKind::Def
...

This will help the parser create the correct AST. We'd also need to move the error handling from the lexer to the parser.

Note that this will only work for single-quoted strings as triple-quoted strings will just consume the remaining source code completely if there's no closing quote.

F-strings

There's more to f-strings than a normal string literal because it can contain an expression element. This means that there could be unterminated f-string or the closing curly brace could be missing or an error in the expression itself.

For example:

f"hello {foo
# or, f"hello {foo}
# or, f"hello {foo} world
# or, f"hello

def bar():
    pass

Here, there's a NonLogicalNewline token after the "foo" variable because of an unclosed {. This would require a modified re-lexing logic to (1) recover from an unclosed { and (2) recover from an unterminated f-string.

This also requires moving the error handling from the lexer to the parser.

But, here's another problem: the parser doesn't use existing list parsing logic for parsing comma-separated expressions aka tuple expression which means it doesn't benefit from any error recovery logic. Thus, if there's an error in the f-string expression itself, there's a high change it'll not recovery correctly.

f'hello {foo bar

def func():
    pass

For the above code snippet, should "foo bar" be parsed as a tuple expression with a missing comma (?) because currently the "bar" would be parsed as a name node in the global scope.

`Unknown` token

If we were to move the error handling for strings to the parser, it would mean that there won't be any Unknown token. This means that the change cannot be made without support for #11915. The main reason is that these rules rely on the fact that there'll be an Unknown token in the token stream where there's a lexial error. Maybe there's a way around that but I'm not exactly sure now.

The text was updated successfully, but these errors were encountered:

dhruvmanila · 2024-06-19T04:36:04Z

After #11915, if we do start emitting the tokens for unterminated strings, we should avoid running any token-based rules on them. This could be done by checking whether the Unterminated token flag is set or not.

dhruvmanila added the parser Related to the parser label Jun 18, 2024

This was referenced Jun 19, 2024

Rule PYI010 cause panic #11937

Closed

Improve f-string error recovery within list parsing #11946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error recovery for strings and f-strings #11916

Improve error recovery for strings and f-strings #11916

dhruvmanila commented Jun 18, 2024

dhruvmanila commented Jun 19, 2024

Improve error recovery for strings and f-strings #11916

Improve error recovery for strings and f-strings #11916

Comments

dhruvmanila commented Jun 18, 2024

F-strings

Unknown token

dhruvmanila commented Jun 19, 2024

`Unknown` token