-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not consider #
an emoji in the lexer
#109754
Conversation
r? @TaKO8Ki (rustbot has picked a reviewer for you, use r? to override) |
Can you link to open-i18n/rust-unic#280 ? I think that may be the root cause of the issue. r=me afterwards |
Actually, rust-unic doesn't seem to have been updated since 2019(!). I'm actually feeling like we may want to unland the original fix and fix the issue with edit: That is to say, I'm not confident that rust-unic will fix this any time soon so maybe it's best to approach this fix from another direction 🤔 |
Well, this is beta nominated anyways, so perhaps let's use that as an opportunity to discuss this in the compiler meeting tomorrow morning. r=me if we decide to go forward with this fix during that time. |
We've been using rust-unic in the lexer for a while now to recover identifiers with emojis in general, the PR with the regression only added a path to recover lifetimes in particular. Other codepaths were already excluding ASCII codepoints from consideration. It is disappointing to see such a useful project fall into disrepair, but maybe we can (as a separate effort) try to contact the maintainers (I see @CAD97 was involved a few years back and might be amenable to doing code reviews for fixes?). Otherwise, we could move to another crate like emojis for the checks. |
I can do code review, but I don't think I have publish rights (would need to double check), so it'd have to be published as a fork. Nowadays I'd recommend building on icu4x if possible since that has a real backing rather than relying on the bandwidth of just one person for management. A big advantage of icu4x is that the data provider can be separated from the algorithm/API provider, so the former doesn't have to wait on the latter for updates, and you get to choose exactly what data you pack. Here's the relevant loader. (I also fell out of favor with unic's project structure around the time of my most recent commit landing there, and again nowadays tend to prefer icu4x. I can try to email @behnam over the weekend to see if I can get publishing permissions to keep what unic does have updated.) |
Why the emoji detection logic even exists in the lexer is beyond me. Yeah, some person thought it would improve diagnostics, with which I disagree, and it would be tolerable if it were harmless. |
Checking for the Emoji character property is not going to be the right approach, as a number of characters (as you've found) have that property despite not being what most users would consider emojis (filtering ASCII might be enough, although a number of other things that don't really look like emoji, such as I think usually you'd look for full emoji sequences (or perhaps RGI_Emoji) to do this accurately, but I'm not sure. @Manishearth might know, and there might be a reasonable way to do this in icu4x. |
So an annoying thing is that what is and isn't an emoji is kinda font-dependent too. There are a couple properties that may be useful here:
I suspect what you want is the first one. It's not perfect, but it's probably better. Here's what it excludes from Yes, ICU4X has APIs for all of these properties, e.g. |
t-compiler decided to revert the original PR |
To the specific case of this diagnostic, the "best" solution would probably be to somehow gate the "likely bad identifier" codepath to only characters which would otherwise get the "unknown start of token" error. |
Yeah though I think switching to Emoji_Presentation will also be a win. |
Also, the latest UAX 31 draft does have an extension for emoji in identifiers which we could include if someone is up to writing an RFC (I'm happy to let them know what would be required and help them along). |
Ooh, that's cool! Even if Rust doesn't use it for allowed identifiers (without a technical limitation, I don't see a reason not to, given we already support all sorts of "bad" identifiers via the normal XID sets), using that specification for the recognizing of bad identifiers for nicer errors makes a lot of sense. The only thing that worries me a little is that part of it is
which unless that codepoint is in the RGI emoji set added to both Start and Continue, makes the change make some strings which are currently identifiers not. (I understand the purpose: to prevent the use of an allowed identifier emoji in text presentation. And there's no simple way to say "allowed, but only after a codepoint which isn't part of an emoji sequence.") Though the use of this codepoint in a valid identifier should be rare enough that this should be fine? I guess that's for the eventual RFC to hash out. |
It's the opposite: the spec allows it after certain emoji-capable codepoints. But yes, this complicates things. For backcompat it probably makes sense to not touch VS and maybe just exclude emoji that contain characters that are current Rust syntax tokens (it's a very small handful) |
Well, I've confused myself sufficiently for today. VS15 = text presentation, VS16 = emoji presentation. The emoji profile removes VS15 from the identifier Continue, uses VS15 for marking the set of characters which are both pattern syntax and emoji presentation as syntax instead of identifiers, removes VS16 from the operator Continue, and has an example of using VS16(?) to mark ☕ as not an identifier, but syntax instead. I think that because Rust doesn't use Pattern_Syntax for defining the set of syntactically meaningful characters, instead using a small list of ASCII (though IIRC all are in Pattern_Syntax), none of the characters in Or perhaps you're saying to exclude those characters from identifiers, since following them with VS15 should by the standard emoji profile make them behave as syntax instead of identifier. Or perhaps you're just referring to cases like *️⃣, which is It'd be nice if the UAX could make the handling of VS15 a bit clearer without having to read UTS51. I thought I understood2 without knowing the intricacies of emoji sequences, just the basics, but now I'm not sure I do anymore. The maybe-typo of using U+FE0F VS16 in the operator example instead of U+FE0E VS15 like discussed in the following prose certainly doesn't help. Rephrasing, if I understood the spec correctly, the handling of VS15 is just to handle the emoji which are considered syntax. An alternative handling would be to instead just forbid the use of those characters as syntax (this is already the case in Rust), not adding them-plus-VS15 to the set of characters with syntactic use. Thus we can leave VS15 in identifiers without any issue. I'm obviously interested in being clued in if/when an RFC is posted to adopt the emoji profile for Rust identifiers. But until then this doesn't matter too much. Footnotes
|
Yeah, and also I certainly am not planning on proposing this I may propose an RFC for ZWNJ in the identifier profile, and I may propose some lint improvements as PRs. I would help someone doing a mathematical symbols thing and would sketch out what needs to be done for someone wanting to do emoji, but I don't think we have anyone who wants to do that. |
…avidtwco Revert "Don't recover lifetimes/labels containing emojis as character literals" Reverts PR rust-lang#108031 per rust-lang#109754 (comment) Fixes (doesnt close until beta backported) rust-lang#109746 This reverts commit e3f9db5. This reverts commit 98b82ae. This reverts commit 380fa26.
…avidtwco Revert "Don't recover lifetimes/labels containing emojis as character literals" Reverts PR rust-lang#108031 per rust-lang#109754 (comment) Fixes (doesnt close until beta backported) rust-lang#109746 This reverts commit e3f9db5. This reverts commit 98b82ae. This reverts commit 380fa26.
…avidtwco Revert "Don't recover lifetimes/labels containing emojis as character literals" Reverts PR rust-lang#108031 per rust-lang#109754 (comment) Fixes (doesnt close until beta backported) rust-lang#109746 This reverts commit e3f9db5. This reverts commit 98b82ae. This reverts commit 380fa26.
…avidtwco Revert "Don't recover lifetimes/labels containing emojis as character literals" Reverts PR rust-lang#108031 per rust-lang#109754 (comment) Fixes (doesnt close until beta backported) rust-lang#109746 This reverts commit e3f9db5. This reverts commit 98b82ae. This reverts commit 380fa26.
…avidtwco Revert "Don't recover lifetimes/labels containing emojis as character literals" Reverts PR rust-lang#108031 per rust-lang#109754 (comment) Fixes (doesnt close until beta backported) rust-lang#109746 This reverts commit e3f9db5. This reverts commit 98b82ae. This reverts commit 380fa26.
Fix #109746. Some unexpected ASCII chars are considered emoji by
rust-unic
, so we exclude them ourselves.