The spec says "code point" where it likely means "unicode scalar value" #207

CAD97 · 2021-09-27T05:32:20Z

Code Point

(1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF₁₆. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Unicode Scalar Value

Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆ inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

The first requirement of the spec is that "KDL documents should be UTF-8". Similarly, strings "MUST be represented as UTF-8 values." Note that a well-formed UTF-8 stream MUST NOT contain surrogate code points; it encodes a sequence of USV.

However, the formal grammar just says unicode (presumably, either [\0-\u{10FFFF}] or [\0-\u{D7FF}\u{E000}-\u{10FFFF}] — unclear (#191, #192)), and the prose spec allows "literal code points" and escaped "Code point described by hex characters". This allows the presence of surrogates, at a minimum as escaped code points, even if their literal inclusion is precluded by the higher-level requirement that the document be well-formed UTF-8.

At least one implementation documents this as a potential spec non-compliance. Given the requirement of UTF-8, I expect that this is just a terminology oversight, and the spec should say Unicode Scalar Value (or non-surrogate code point) in the two locations where it currently just says "code point".

The text was updated successfully, but these errors were encountered:

tabatkins · 2021-10-05T22:56:53Z

The formal grammar referring to unicode as the set of all codepoints is fine, since valid UTF-8 documents can't encode the surrogate code points in the first place.

It's not technically necessary to restrict it from the unicode escape sequence either, tho it would complicated implementations by requiring them to store the values in something other than a UTF-8 string, and always output them escaped. So, I do recommend disallowing those code points, either by removing them from the escape's valid values, or allowing them but having those codepoints transform into U+FFFD REPLACEMENT CHARACTER.

The latter is, for example, what CSS does, but CSS has a requirement that all possible byte streams decode into something, while KDL is happy to reject entire documents that have a parse error. I don't have a strong opinion either way; it's generally an error to write such escapes in the first place.

Fixes: #207

zkat · 2023-12-13T06:11:13Z

This change is now included in the kdl-v2 branch.

This was referenced Sep 27, 2021

Fix and simplify identifier character grammar #192

Closed

Document divergences from the spec Lucretiel/kaydle#8

Open

zkat added documentation Improvements or additions to documentation enhancement New feature or request breaking This can only be done for the next major version of KDL labels Nov 28, 2023

zkat added a commit that referenced this issue Dec 13, 2023

Constrain code points to unicode scalar values

5a7b339

Fixes: #207

zkat closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The spec says "code point" where it likely means "unicode scalar value" #207

The spec says "code point" where it likely means "unicode scalar value" #207

CAD97 commented Sep 27, 2021

tabatkins commented Oct 5, 2021 •

edited

Loading

zkat commented Dec 13, 2023

The spec says "code point" where it likely means "unicode scalar value" #207

The spec says "code point" where it likely means "unicode scalar value" #207

Comments

CAD97 commented Sep 27, 2021

tabatkins commented Oct 5, 2021 • edited Loading

zkat commented Dec 13, 2023

tabatkins commented Oct 5, 2021 •

edited

Loading