-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uncommon_codepoints
: lint against 00B7 MIDDLE DOT in final position
#120695
uncommon_codepoints
: lint against 00B7 MIDDLE DOT in final position
#120695
Conversation
00B7 MIDDLE DOT '·' is considered by [UTS 39](https://unicode.org/reports/tr39/#Identifier_Status_and_Type) to be "Allowed - Inclusion" due to its presence in [UTR 31 Table 3a - Optional Characters for Medial](https://www.unicode.org/reports/tr31/#Table_Optional_Medial); as such, it is unsuitable for appearing in the final position of identifiers.
r? @pnkfelix (rustbot has picked a reviewer for you, use r? to override) |
Do we want to add special cases here? I would be surprised if middle dot was the only special case |
MIDDLE DOT is the only Unicode character with this particular set of special circumstances:
|
Another special case that could be added to the lint is proper handling of ZWJ/ZWNJ, as recommended by UTS 55. |
What are we trying to accomplish by allowing an interpunct (·, aka middle dot) with no linting whatsoever, in the middle of identifiers but not at the end? I understand that the grammar of UAX31 indicates that medial characters are not meant to ever occur at the end of an identifier. But we also explicitly have adopted the profile from UAX31 that leaves <Medial> empty. (And, I think, have implicitly put 00B7 MIDDLE DOT '·' into
In any case, I am not yet sure why we wouldn't just lint against this character (as an "uncommon unicode character") regardless of whether it occurs at the beginning, middle, or end of a symbol. From my point of view, either its presence is confusing (and will be confusing regardless of context), or its presence is not confusing (regardless of context).
|
Maybe I misunderstand what you are proposing; I thought Rust already rejects ZWJ/ZWNJ in identifiers (reference). Are you talking about other contexts where we need to lint that? |
Yes, and this is more strict than what the standard recommends. The recommendation is to allow it in a few special cases, and lint against it everywhere else. |
ZWJ is indeed already not accepted as part of identifiers, it is considered whitespace by the lexer. Could you expand on why 00B7 needs to be considered an uncommon codepoint only if it appears at the end of an identifier? |
No, it is not valid whitespace either (does not have the
It doesn't matter to me either way, I guess it depends on how strongly we value supporting Catalan. I can change the PR to lint against it in all positions if there's consensus in that direction. |
I don't think this is correct:
I'm not sure what you mean by
I don't think that's due to that table at all. I think it's direclty because it's used in Catalan.
Yeah, this was a deliberate choice: I didn't want to bog down the original RFC with the careful handling of ZW(N)J, so I left it an open question for the future. The lack of ZW(N)J should not be a signal about whether other thing should or shouldn't be allowed, since we could totally allow them in the future. I definitely think we should take a more principled approach here. There's #120228 and it already proposes having a separate sub-lint that deals with confusables with syntax. While that issue doesn't itself ask for splitting lints, this would be a good reason to split the lint. I think middle dot counting as "uncommon codepoints" goes against the spirit of the lint, which is about non-linguistic content. I would rather not single out Catalan on this until we have a principled approach. If people feel strongly about linting about this the lint should then catch the middle dot in all positions not surrounded by ls, since in Catalan the only legal place for the interpunct is in <l·l>. But my preference is to first fix #120228 (without changing what is linted about) and then we can get better at expanding sub-lints. |
Were that the case, it would be
How do you even define what is an "l"? What about case, diacritics, mathematical style variants, fullwidth etc… |
l or L, Catalan doesn't have diacritics on its ls, and the rest are not linguistic content. |
Ah, forgot about that.
Ah, fair. The background for this is actually probably that a lot of this standard is driven by the needs of domain name stuff, where the interpunct in particular is a bit of a problem. I think it's less of a big deal for us. I wouldn't take Inclusion as too much of a signal here. That ontology needs a bit more work on defining what people should use these sub-properties for, it's a bit of a mess right now. We've got plans to improve that but that'll take time. |
By the way, @Jules-Bertholet , I do want to thank you for bringing this to our attention. I don't know where this conversation will end up. I'm planning to tag this as nominated for T-lang design since that team has the final call when it comes to potentially contentious lint decisions. But no matter where it ends up, it is super useful to know about the issue. |
Actually, the more I reflect on this, I think @Manishearth has the best guidance here: We shouldn't do anything about 00B7 until we have the more principled approach in place, i.e. we should resolve #120228 first, and then land any changes here on top of the framework introduced there. In that spirit, I am going to close this PR, but I want to thank @Jules-Bertholet again for bringing this to our attention. And I will also open up an issue, so that we can track the follow-up work item once #120228 is resolved. |
00B7 MIDDLE DOT '·' is considered by UTS 39 to be "Allowed - Inclusion" due to its presence in UTR 31 Table 3a - Optional Characters for Medial; as such, it is unsuitable for appearing in the final position of identifiers.
@rustbot label T-compiler A-diagnostics A-unicode