-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply NFKC normalization to unicode identifiers in the lexer #10412
Apply NFKC normalization to unicode identifiers in the lexer #10412
Conversation
|
1eed840
to
a8a9863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great with such a minimal change. Can you please write / update documentation around this change? Maybe the lex_identifier
or the Name
token?
It's good to go from my side but I'll let @MichaReiser give the final approval.
Thanks! I had a go at writing some docs in d785be5 -- how does that look to you? |
Looks good. Thank you! |
@@ -16,6 +16,9 @@ pub enum Tok { | |||
/// Token value for a name, commonly known as an identifier. | |||
Name { | |||
/// The name value. | |||
/// | |||
/// Unicode names are NFKC-normalized by the lexer, | |||
/// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers) | |||
name: Box<str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was eventually hoping to remove all owned data from the lexer tokens (e.g., prior to this change, we could've conceivably removed this field altogether; if we remove more similar fields from other tokens, we can eventually reduce the size of the Tok
enum, which could be great for performance). This now precludes us from doing so. But I don't have enough context on the future design of the lexer-parser to know if it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still accomplish this and Name
isn't the only token that must return the parsed value (e.g. FStrings etc).
What I have in mind is to:
- Replace
Tok
withTokenKind
(that holds no data) - Add a
take_value()
method onLexer
that "takes" the current value (enum
overName
,Int
etc.).
This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok
). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. We could potentially move this to the parse_identifier
method in the parser as that's the final destination for where this value needs to be. I could come back to this once the new parser is merged and I'm looking into the feedback loop between the lexer and parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, I'll defer to Micha and Dhruv to approve.
@@ -16,6 +16,9 @@ pub enum Tok { | |||
/// Token value for a name, commonly known as an identifier. | |||
Name { | |||
/// The name value. | |||
/// | |||
/// Unicode names are NFKC-normalized by the lexer, | |||
/// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers) | |||
name: Box<str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still accomplish this and Name
isn't the only token that must return the parsed value (e.g. FStrings etc).
What I have in mind is to:
- Replace
Tok
withTokenKind
(that holds no data) - Add a
take_value()
method onLexer
that "takes" the current value (enum
overName
,Int
etc.).
This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok
). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).
Thanks all! |
Summary
A second attempt to fix #5003, hopefully without the performance problems that #10381 suffered from.
Python applies NFKC normalization to identifiers that use unicode characters. That means that F821 should not be emitted if ruff encounters the following snippet (but on
main
, it is), as from Python's perspective, these are all the same identifier:This PR fixes that false positive by NFKC-normalizing identifers as they are encountered in the lexer.
Test Plan
cargo test