Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for aligning Unicode version to 15.0 (year 2022) #101840

Closed
30 tasks done
crlf0710 opened this issue Sep 15, 2022 · 12 comments · Fixed by #118229
Closed
30 tasks done

Tracking Issue for aligning Unicode version to 15.0 (year 2022) #101840

crlf0710 opened this issue Sep 15, 2022 · 12 comments · Fixed by #118229
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC

Comments

@crlf0710
Copy link
Member

crlf0710 commented Sep 15, 2022

Unicode is released on a yearly basis, so we update the data files we used accordingly after each Unicode release in the Rust project. (Keep in mind that new dependencies might be added over time.)

About tracking issues

Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.

Steps

Goal: Unicode 15.0 (year 2022)

Unicode version dependent crates:

Libraries
Compiler
Language integrated:
  • unicode-xid (Decide whether it's a valid identifier)
    Current: 0.2.2 (Unicode 13)
    Goal: 0.2.4 (Unicode 15)
  • unicode-normalization (Preprocess identifiers for equality)
    Current: 0.1.13 (Unicode 9)
    Goal: 0.1.22 (Unicode 15)
  • unicode-security (Decide whether lints against unwanted usages should be triggered)
    Current: 0.0.5 (Unicode 13)
    Goal: 0.1.0 (Unicode 15)
  • unicode-script (Used by unicode-security for script detection)
    Current: 0.5.3 (Unicode 13)
    Goal: 0.5.5 (Unicode 15)
Diagnostics:
  • unicode-width (used by rustc-parse, rustc-errors and many others)
    Current: 0.1.8 (Unicode 13)
    Goal: 0.1.10 (Unicode 15)
  • unicode-properties (used by rustc-lexer)
    Current: 0.1.0 (Unicode 15)
    Goal: 0.1.0 (Unicode 15)
  • Removed: unic-char-property (used by unic-emoji-char, then rustc-lexer)
    Current: 0.9.0 (Unclear, No release in 2 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update lexer emoji diagnostics to Unicode 15.0 #114193)
  • Removed: unic-char-range (used by unic-emoji-char, then rustc-lexer)
    Current: 0.9.0 (Unclear, No release in 2 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update lexer emoji diagnostics to Unicode 15.0 #114193)
  • Removed: unic-common (used by unic-emoji-char, then rustc-lexer)
    Current: 0.9.0 (Unclear, No release in 2 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update lexer emoji diagnostics to Unicode 15.0 #114193)
  • Removed: unic-ucd-version (used by unic-emoji-char, then rustc-lexer)
    Current: 0.9.0 (Unclear, No release in 2 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update lexer emoji diagnostics to Unicode 15.0 #114193)
  • Removed: unic-emoji-char (used by rustc-lexer)
    Current: 0.9.0 (Unclear, No release in 2 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update lexer emoji diagnostics to Unicode 15.0 #114193)
Dev-Tools:
Dependency crates:
  • unicode-bidi (used by idna then url then [ammonia, cargo, cargo-test-support, clippy_lints, crates-io, git2, git2-curl, rustc-workspace-hack])
    Previously: 0.3.4 (Unicode 10)
    Goal: >=0.3.10 (Unicode 15)
  • unicode-segmentation (used by rustfmt)
    Previously: 1.9.0 (Unicode 14)
    Goal: >=1.10.0 (Unicode 15)
  • unicode-properties (used by rustfmt)
    Mentioned above in compiler diagnostics section
  • Removed: unicode_categories (used by rustfmt)
    Current: 0.1.1 (Unclear, No release in 6 years)
    Goal: Need a new release (Will be replaced by unicode-properties in Update Unicode data to 15.0 rustfmt#5864)
  • unicase (used by pulldown-cmark, then [rustdoc, clippy-lints, mdbook])
    Current: 2.6.0 (Unclear, No release in 3 years)
    Goal: >=2.7.0 (Unicode 15)

Unicode version independent crates (ignorable for now, just for future reference):

  • unicode-bdd (in-tree maintainence tool): Unicode version independent
  • ucd-parse: Unicode version independent (used by unicode-bdd tool)
  • ucd-trie: Unicode version independent (used by handlebars, then mdbook)
  • unic-langid, unic-langid-impl, unic-langid-macros, unic-langid-macros-impl: Not really Unicode version independent but we only use Unicode version independent part. They're outdated, current: 0.9.1 (CLDR 37, Spring 2020, ~= Unicode 13), would be nice if a new release is used.
@crlf0710 crlf0710 added A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Sep 15, 2022
@thomcc
Copy link
Member

thomcc commented Sep 15, 2022

unicode-bdd is src/tools/unicode-table-generator, which is updated with the libcore update (or more specifically, this year it needed no update).

ucd-* is handled by BurntSushi/ucd-generate#53.

@thomcc
Copy link
Member

thomcc commented Nov 3, 2022

I happened to notice this in the UCD spec https://www.unicode.org/reports/tr44/#Missing_Conventions. It seems that as of 15.0.0 have to start parsing some of the comments to look for @missing decls now.

Starting with Version 15.0, some data files in the UCD may contain multiple @missing lines defined for the same property. When multiple @missing lines are defined this way, they are to be interpreted as follows: Each successive @missing line specifies an overriding range value for all previous @missing definitions. This convention allows a generic default value to be specified first for the entire Unicode code point range, followed by other specific default values for more constrained, specific sub-ranges. This enables an easy-to-understand and easy-to-maintain way of handling complex default values, as for the Bidi_Class or Line_Break properties. (See Complex Default Values.) The following simple example for East_Asian_Width, extracted from DerivedEastAsianWidth.txt, illustrates this mechanism:

# @missing: 0000..10FFFF; Neutral
# @missing: 3400..4DBF; Wide
# @missing: 4E00..9FFF; Wide
# @missing: F900..FAFF; Wide
# @missing: 20000..2FFFD; Wide
# @missing: 30000..3FFFD; Wide

Implementation of parsing for multiple @missing lines for a single property is straightforward. Each time an @missing line is encountered, simply assign the given default value to the specified range. With this strategy, each successive @missing line will automatically override any prior assigned values for a given sub-range.

This is pretty annoying and will likely lead to some bugs (likely minor). I'm certain I've written code that would be buggy under this. (Hell, I'm not sure I've ever seen code that handled these, and I've read a good number of UCD wrangler scripts -- usually it's just to say "assign the obvious default value to anything that doesn't get specified", as in the # @missing: 0000..10FFFF; Neutral line above)

It looks like the only files in UCD that take advantage of this (e.g. use it several times to define a full range as having the obvious default case) are DerivedBidiClass.txt, DerivedEastAsianWidth.txt, and DerivedLineBreak.txt (note: this is just the UCD, and I haven't checked stuff like the UTS39 security data, though). This doesn't include anything we ship in the stdlib AFAIK, but almost certainly impacts some stuff in the compiler (in probably minor ways, if it's limited to the above).

I suspect the argument is folks we were always doing this wrong if they ignored it though? Oof.

CC @Manishearth (since I suspect you know more details here, and even if not, I believe unicode-width is yours), @BurntSushi (since I noticed it while adding something to ucd-generate, presumably there are a few things in regex that might be impacted if you update its tables)

@Manishearth
Copy link
Member

This seems like another reason for us to potentially switch over to ICU4X (maintained by Unicode), potentially with unicode-rs (or somewhere) maintaining crates of docs.rs/databake-format data output.

Unfortunately it's unlikely url will move to something heavier than the unicode-rs crates which have always stayed zero-dep sicne they're at the bottom of many deptrees.

I don't have time to fixup unicode-width but would be happy to accept a patch.

@thomcc
Copy link
Member

thomcc commented Nov 3, 2022

I think rustc doing that is probably very good. It's hard for me to imagine anybody fighting you on it.

For the core::unicode tables... presumably you mean inside the table generator (definitely we couldn't add a dep to libcore on icu4x). We could use it to query the data we build the tables with, instead of ucd-parse. That doesn't sound like a problem eventually. (I'll look into it soonish since I'm messing around with Unicode table generation again).

For the ecosystem at large, I dunno. I suspect it will be a tough sell, due to compile time, binary size, and overall heavyweight dependency trees already being in the most common complaints.

@Manishearth
Copy link
Member

Yeah for the core unicode stuff it'd probably be best done as a table generation. Worth noting though: you could still use the efficient code point trie impl that ICU4X uses for this, and not take on too many dependencies. Note that ICU4X's model also involves running a data generation tool; so it might not be an improvement over ucd-parse for us. Eventually, ICU4X will have a copy of various slices of data on a CDN, and then it would be an improvement over ucd-parse. The only difference right now would be that ICU4X is much more likely to update quickly to the latest unicode stuff over ucd_parse, but that may not be a huge deal since ucd_parse doesn't need much work.

I wasn't really thinking about the ecosystem at large. I'm fine continuing to maintain unicode-rs, I just don't have that much time to devote to it. Overall rustc and other users tend to make PRs when they need them and that works out fine.

FWIW, ICU4X is very much optimized for binary size.. Perhaps not compile times, but it does tend to produce small binary sizes (and it's shipping on embedded devices already).

I'm not really pushing on us moving to ICU4X yet (I think @crlf0710 is investigating a bit though!), I think the status quo is manageable, but it's definitely an option on the table!

@BurntSushi
Copy link
Member

I don't think I have any strong opinions. It seems like ucd-parse could be updated to handle these missing declarations without too much trouble though? And then things would be fine? Or is there something else I'm missing?

Also worth pointing out that the whole ucd-generate stuff is something I built for lack of a better alternative at the time. My goal was to get rid of an ever increasing pile of bespoke Python scripts. But I don't have much vision beyond that, other than keeping it "simplistic" and framing it as a "kitchen sink" of sorts. I've also updated ucd-generate's README to link to ICU4X (wording changes very welcome): https://github.com/BurntSushi/ucd-generate/#alternatives

@Manishearth
Copy link
Member

Oh yeah to be clear, no shade on ucd-parse and ucd-generate! They're great tools and I don't necessarily mean to say ICU4X is much better or anything.

Rustc already needs ICU4X for list formatting and @crlf0710 was investigating using it for more things like the XID and security properties (which rely on those manual python scripts).

@BurntSushi
Copy link
Member

@thomcc I can't make heads or tails of the @missing declarations. From what I can tell, they're mainly being used to apply property values to... unassigned codepoints? Just looking at the diff between Unicode 14 and 15 for extracted/DerivedBidiClass.txt, this text was removed from 14 and replaced with @missing declarations in 15:

# reserved for right-to-left scripts are given either types R or AL.
#
# The unassigned code points that default to AL are in the ranges:
#     [\u0600-\u07BF \u0860-\u08FF \uFB50-\uFDCF \uFDF0-\uFDFF \uFE70-\uFEFF
#      \U00010D00-\U00010D3F \U00010F30-\U00010F6F
#      \U0001EC70-\U0001ECBF \U0001ED00-\U0001ED4F \U0001EE00-\U0001EEFF]
#
#     This includes code points in the Arabic, Syriac, and Thaana blocks, among others.
#
# The unassigned code points that default to R are in the ranges:
#     [\u0590-\u05FF \u07C0-\u085F \uFB1D-\uFB4F
#      \U00010800-\U00010CFF \U00010D40-\U00010F2F \U00010F70-\U00010FFF
#      \U0001E800-\U0001EC6F \U0001ECC0-\U0001ECFF \U0001ED50-\U0001EDFF \U0001EF00-\U0001EFFF]
#
#     This includes code points in the Hebrew, NKo, and Phoenician blocks, among others.
#
# The unassigned code points that default to ET are in the range:
#     [\u20A0-\u20CF]
#
#     This consists of code points in the Currency Symbols block.

So it seems to me like parsers of UCD that ignore @missing in Unicode 15 are probably still just as wrong as they were with Unicode 14?

The Unicode docs do a really terrible job of even explaining what @missing is supposed to mean and why it even exists at all. The only explanation of what they actually are seems to be the first sentence of the section you linked:

Specially-formatted comment lines with the keyword "@missing" are used to define default property values for ranges of code points not explicitly listed in a data file.

Most of the rest of that section is documenting the format. The key thing that's missing from the docs is what @missing actually is and why it exists. Like... why does there need tobe a mechanism for defining default values for ranges of codepoints when the files themselves are already defining values for ranges of codepoints? Something is missing from my conceptual understanding because I don't understand why the @missing declarations aren't just converted into normal lines in the data files.

@Manishearth
Copy link
Member

I have some guesses as to what the answer is but I don't want to say something incorrect, so I've just asked these questions to the Unicode people.

@thomcc
Copy link
Member

thomcc commented Nov 5, 2022

I'd like to hear back from @Manishearth's contacts in case I'm mistaken (or missing some detail), but I'm confident I know what they're for, and have thoughts on them in the context of a low level library like ucd-parse.

These exist because several of the UCD files do already require special handling for missing values, but it's not usable in a programmatic manner and needs hard-coding. I believe the goal is that for properties that have @missing lines, you can now determine the property values for all codepoints using just the data programmatically readable from the UCD, without reading the documentation of that property.

Here are most of the interesting cases I've found for this (some of which are more useful than others), which should give you a good idea of why it exists and why

  1. Properties with very complex defaults, like Bidi_Class. I believe this is intended to replace things like ucd-generate's hard-coded list of Bidi_Class defaults.

    Presumably this list in ucd-generate's source should have have been getting updated to synchronize with the comments in DerivedBidiClass.txt's header on each Unicode version update, but that comment didn't really indicate that it was going to change, so that was not happening. Now, these are in @missing lines, which allows that to happen without manual work.

    I don't know if it changed before now (some documentation hints that it did), but I'd weakly recommend to the Unicode folks that when it changes, the BidiTest.txt and BidiCharacterTest.txt files are also changed to contain tests that fail if the updates have not happened. That may already happen, but that's really the only many people know if their parsing code is wrong for something like this.

  2. Properties with a single default value. In most of these, the default is listed in that property's documentation (or prose in comments in the file) and is kind of the "obvious fallback" (often None, if that's a value for that property). Now you can read @missing line to determine that the default for Canonical_Combining_Class is Not_Reordered, that Numeric_Type defaults to None, or that Age defaults to Unassigned, and so on.

    An odd case here is that the empty string is spelled <none>, but given that that's only used for a property whose value is a codepoint (and not an enumeration), it's unsurprising that it has a special case.

  3. Similarly, boolean properties which default to true Yes and only contain UCD lines when they're false No. There are a few of these in the normalization properties, and it is probably only useful for code that wants to handle things super generically, but that's something.

  4. Properties that default to the codepoint itself, such as NFKC_Casefold is the codepoint itself. Previously, you just had to know this (it's kind of obvious if you think about it, though), but now a # @missing: 0000..10FFFF; NFKC_CF; <code point> line can be used.

  5. Properties that default to the value of another property. For example, the ScriptExtensions.txt has # @missing: 0000..10FFFF; <script>. This codifies that the default value for the Script_Extensions property is the value that codepoint has for the Script property.


There are some annoying things about this:

  1. As a user of a library like ucd-parse (but definitely not for something high level like icu4x), I might want to distinguish between "this is a value explicitly from the file" and "this is the default from @missing".

    This is generally true for several of these, but especially true for a few cases like Script_Extensions, where I may want to be handling Script values entirely separately.

    Obviously a high-level library for Unicode properties would absolutely not expose this distinction, but for something which is explicitly for parsing the UCD files, I suspect including @missing values should be optional (even if it's done by default).

  2. Right now, the @missing property value will often use a different property alias than is used elsewhere in the file.

    For example, a @missing line for EAW uses Neutral, whereas the file uses N. Similarly, Bidi_Class uses Left_To_Right/Right_To_Left in @missing but L and R in the file. Most of these use the "long" form of the property value in the @missing line, and the "short" version in the file itself. I can see the logic here, but it's annoying, as to handle this fully programmatically, you need to read PropertyValueAliases.txt.

  3. It's incomplete and completing it feel like it might be a bit annoying.

    For example, CaseFolding.txt still has "All code points not explicitly listed for Case_Folding have the value C for the status field, and the code point itself for the mapping field" which is not reflected in a @missing line. Given that this is not a single property, it will probably require a special case for the @missing lines.

  4. It's also not clear which of these are "stable" values you can rely on, but I presume the answer is "technically none of them, but in practice it's probably safe to assume something like # @missing: 0000..10FFFF; None on a derived property is unlikely to change".

That said, overall it's probably a good thing, but the release notes for this definitely ignores many of the reasons why this might be a bit of a pain in the ass to handle.

@Manishearth
Copy link
Member

TLDR: they make the files more maintainable and easier to read comments for

They are there to set values for unassigned code points, and to document them.

For example, in DerivedBidiClass.txt, we used to have a manually (painfully) maintained header documenting the intent of which block had which Bidi_Class default, and a piece of code that defined those ranges in the tool (which had to be in sync with the comments), supporting multiple versions of Unicode, and the tool was then printing the values for among those for assigned characters.

In several Unicode versions, we poked bc=AL holes in larger bc=R ranges, and had to carefully adjust both the comments and the tool defaults. Not really "hard", but annoyingly tedious, and for the listing by value, the interspersed unassigned code points added to clutter and length.

Now we have a nicer piece of code that defines the ranges (still by Unicode version), and prints them as @missing lines near the top of the file (replacing hand-edited comments), and we no longer need to print values for unassigned code points among assigned ones.

The Unicode Character Database files are not pure listings of data structures for machine communication. They are intended to also serve human maintainers and readers.
Of course there are trade-offs. For some purposes, one might prefer a listing of each single code point and its value, or ranges of code points in code point order, etc.

@Manishearth
Copy link
Member

Some more background from Asmus, including some explicit steps on how to handle them. They brought up an interesting point that the order of parsing them matters a lot.

These values are usually the default value(s) for a property.

For some properties we've treated the default values like any other value in listing them. This was done for for various reasons that emerged out of the process of first developing the property files for these properties.

The preferred mechanism, however, is to not do that, but to list ranges using an @missing statement (where possible).

You may find that we provide a "Derived..." property file with out preferred file format, so that you can build a parser that doesn't have to understand the "quirky" format of some of the original files.

Independent of all of these considerations, the way you actually use an @missing statement is as follows:

  1. Parse the @missing statement and extract the property name, range and value
  2. The listing of range and property value always matches the usage in the remainder of the file, modulo the choice of long/short property value alias (in other words, use loose matching)
  3. Apply the value to that property for that range, overriding any values assigned to that range
  4. Repeat for all @missing statements in order
  5. Then apply all explicitly listed properties, overriding any values already assigned

In other words, each assignment of a property value for a code point/range, whether in an @missing statement, or in a normal data line for the property always sets the value for that code point/range whether or not that means overriding an existing value from an earlier assignment.

The order of the @missing lines is very important. Usually the first one would cover the entire code point range and set a general default, and any later ones will override that for some specific range and set the particular default for that range, and so on. Finally, any characters that are assigned, would have a specific property value that then overrides any default.

Default values can be of different kinds: there's the N/A value for cases where it's not really meaningful to apply a property; for mappings, the lack of a mapping is often expressed as a default using the "identity" mapping (because it makes using such properties more regular). For booleans, the "false" value is often similarly a "does not apply". In all of these cases, not listing such defaults for each code point makes it easier to focus on those code points that have actual values. Therefore, the listing would omit such values even for assigned characters.

However, there are a few properties, where the default is one of the ordinary values of that property. For example, bidi "L", while the most common value and the default for large ranges of unassigned code points, is not at all a "does not apply" kind of value. So here it makes sense to use the @missing mechanism only for unassigned code points.

None of this background detailing how @missing statements are chosen and listed in the file affects the way @missing statements should be interpreted (whether by a human reader or software).

If there is any file where we currently have both an @missing statement and 1,114,112 actual values the @missing statement would obviously have no effect, because all values it assigns would be overridden. If you are aware of such a file, you could use the contact form to write a bug report and then it can be taken care of.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 31, 2023
…earth

Update lexer emoji diagnostics to Unicode 15.0

This replaces the `unic-emoji-char` dep tree (which hasn't been updated for a while) with `unicode-properties` crate which contains Unicode 15.0 data.

Improves diagnostics for added emoji characters in recent years. (See tests).

cc rust-lang#101840

cc ``@Manishearth``
@bors bors closed this as completed in 0f696e5 Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants