-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude surrogate codepoints #84
base: main
Are you sure you want to change the base?
Conversation
spec/index.html
Outdated
<p> | ||
Unicode <a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>, | ||
the range <code class="codepoint">U+D800</code> to <code class="codepoint">U+DFFF</code> | ||
are excluded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents paired surrogates to be used (see #83). It sounds like a breaking change from Turtle 1.1 that allowed paired surrogates to be escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change happened earlier when "RDF string" came in.
Given that surrogate pairs are translated by UTF-16 only, if a pair is introduced by escape sequences, I don't see how the 1.1 spec allows for a translation to be done; so they remain raw.
RDF 1.1 Turtle says "Unicode character" - which isn't a mentioned term (glossary). There is "character" which has "(2) Synonym for abstract character"
(there may have been "Unicode character" back when the document was written, Unicode does change). A surrogate isn't what Unicode calls an "abstract character" (i.e. after decoding).
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527
Surrogate
Permanently reserved for UTF-16; restricted interchange
Cannot be assigned to abstract character
So, for me, this can be treated as errata on RDF 1.1 Turtle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Thank you for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OT: The Wikipedia disambiguation page for "Down the Rabbit Hole" refers ("see also") to the disambiguation page for "Rabbit Hole" ... which in turn refers to the disambiguation page for "Down the Rabbit Hole". I hope someone did that on purpose.
inclusive, excluding the range <code class="codepoint">U+D800</code> to | ||
<code class="codepoint">U+DFFF</code> | ||
(<a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>), | ||
are allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sentence is redundant with the UTF-8 definition that states The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF but it's maybe a good idea to add it anyway to make things cristal clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. At one level "RDF string" is enough.
It would be better written as "the range U+0000 to U+D7FF and U+E000 to U+10FFFF inclusive" because Unicode writes it that way.
A next sentence saying "no surrogate code points" is then an optional extra for clarity.
(PS Change made)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably put an entry in the changes section for this.
Here or in RDF Concepts? Or both? (added to this PR) |
if we made grammar or processing changes that affected this, it’s worth noting. Otherwise, if the changes are not normative, it doesn’t need its own entry. |
This second commit
The ids for the Not changed: "Encoding considerations:" in the Media Type uses "U+0000 to U+FFFF" etc. |
Either is good for me now we see that test-38 was not in the implementation reports. Whether the escape sequences could be said to be "unclear, on close inspection" or an "erratum" is more a matter of taste. Slight, mild, preference to include it because this has been quietly rumbling for a long time but it is editors' decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes of punctuation and singular/plural disagreements
…e escape sequences list
Say that Uncode code escape sequence must not generate codepoints for surrogates.
I'm not sure this is that best way and it worth getting it right here before including in other documents (NQ, NT, TriG).
The section on the three kinds of escape sequences uses a bulleted list, where the list item has a table.
Now there is more material in these list items, the display does look so good. If this is changed, this too should be sorted out before propagating to other documents.
Preview | Diff