Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The spec says "code point" where it likely means "unicode scalar value" #207

Closed
CAD97 opened this issue Sep 27, 2021 · 2 comments
Closed
Labels
breaking This can only be done for the next major version of KDL documentation Improvements or additions to documentation enhancement New feature or request

Comments

@CAD97
Copy link
Contributor

CAD97 commented Sep 27, 2021

From the Unicode Glossary:

Code Point
(1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.
Unicode Scalar Value
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

The first requirement of the spec is that "KDL documents should be UTF-8". Similarly, strings "MUST be represented as UTF-8 values." Note that a well-formed UTF-8 stream MUST NOT contain surrogate code points; it encodes a sequence of USV.

However, the formal grammar just says unicode (presumably, either [\0-\u{10FFFF}] or [\0-\u{D7FF}\u{E000}-\u{10FFFF}] — unclear (#191, #192)), and the prose spec allows "literal code points" and escaped "Code point described by hex characters". This allows the presence of surrogates, at a minimum as escaped code points, even if their literal inclusion is precluded by the higher-level requirement that the document be well-formed UTF-8.

At least one implementation documents this as a potential spec non-compliance. Given the requirement of UTF-8, I expect that this is just a terminology oversight, and the spec should say Unicode Scalar Value (or non-surrogate code point) in the two locations where it currently just says "code point".

@tabatkins
Copy link
Contributor

tabatkins commented Oct 5, 2021

The formal grammar referring to unicode as the set of all codepoints is fine, since valid UTF-8 documents can't encode the surrogate code points in the first place.

It's not technically necessary to restrict it from the unicode escape sequence either, tho it would complicated implementations by requiring them to store the values in something other than a UTF-8 string, and always output them escaped. So, I do recommend disallowing those code points, either by removing them from the escape's valid values, or allowing them but having those codepoints transform into U+FFFD REPLACEMENT CHARACTER.

The latter is, for example, what CSS does, but CSS has a requirement that all possible byte streams decode into something, while KDL is happy to reject entire documents that have a parse error. I don't have a strong opinion either way; it's generally an error to write such escapes in the first place.

@zkat zkat added documentation Improvements or additions to documentation enhancement New feature or request breaking This can only be done for the next major version of KDL labels Nov 28, 2023
zkat added a commit that referenced this issue Dec 13, 2023
@zkat
Copy link
Member

zkat commented Dec 13, 2023

This change is now included in the kdl-v2 branch.

@zkat zkat closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This can only be done for the next major version of KDL documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants