Exclude surrogate codepoints #84

afs · 2025-02-08T10:04:41Z

Say that Uncode code escape sequence must not generate codepoints for surrogates.

I'm not sure this is that best way and it worth getting it right here before including in other documents (NQ, NT, TriG).

The section on the three kinds of escape sequences uses a bulleted list, where the list item has a table.
Now there is more material in these list items, the display does look so good. If this is changed, this too should be sorted out before propagating to other documents.

Preview | Diff

Tpt · 2025-02-08T10:17:49Z

spec/index.html

+        <p>
+          Unicode <a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>,
+          the range <code class="codepoint">U+D800</code> to <code class="codepoint">U+DFFF</code>
+          are excluded.


This prevents paired surrogates to be used (see #83). It sounds like a breaking change from Turtle 1.1 that allowed paired surrogates to be escaped.

The change happened earlier when "RDF string" came in.

Given that surrogate pairs are translated by UTF-16 only, if a pair is introduced by escape sequences, I don't see how the 1.1 spec allows for a translation to be done; so they remain raw.

RDF 1.1 Turtle says "Unicode character" - which isn't a mentioned term (glossary). There is "character" which has "(2) Synonym for abstract character"

(there may have been "Unicode character" back when the document was written, Unicode does change). A surrogate isn't what Unicode calls an "abstract character" (i.e. after decoding).

https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527

Surrogate
Permanently reserved for UTF-16; restricted interchange
Cannot be assigned to abstract character

So, for me, this can be treated as errata on RDF 1.1 Turtle.

Indeed. Thank you for the clarification!

OT: The Wikipedia disambiguation page for "Down the Rabbit Hole" refers ("see also") to the disambiguation page for "Rabbit Hole" ... which in turn refers to the disambiguation page for "Down the Rabbit Hole". I hope someone did that on purpose.

Tpt · 2025-02-08T10:19:01Z

spec/index.html

+    inclusive, excluding the range <code class="codepoint">U+D800</code> to 
+    <code class="codepoint">U+DFFF</code> 
+    (<a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>),
+    are allowed.


I think this sentence is redundant with the UTF-8 definition that states The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF but it's maybe a good idea to add it anyway to make things cristal clear.

Agreed. At one level "RDF string" is enough.

It would be better written as "the range U+0000 to U+D7FF and U+E000 to U+10FFFF inclusive" because Unicode writes it that way.

A next sentence saying "no surrogate code points" is then an optional extra for clarity.

(PS Change made)

gkellogg

We should probably put an entry in the changes section for this.

afs · 2025-02-08T15:21:21Z

We should probably put an entry in the changes section for this.

Here or in RDF Concepts? Or both?

(added to this PR)

gkellogg · 2025-02-08T16:11:26Z

We should probably put an entry in the changes section for this.

Here or in RDF Concepts? Or both?

if we made grammar or processing changes that affected this, it’s worth noting. Otherwise, if the changes are not normative, it doesn’t need its own entry.

afs · 2025-02-09T09:21:03Z

This second commit

Use <dfn> for the three kinds escape sequence
Codepoint range wording changed to "U+0000 to U+D7FF and U+E000 to U+10FFFF" from "whole range except ..."
Tables have boarders
There is a change note entry - it can easily be removed if desired

The ids for the <dfn> are not preserved. These seem to be internal links for the tables in grammar processing. One was "#string" and the map use string to bnode used it,when it meant a normal string (bnode labels are not limited to the two chars of a \t escape!) The other two were "#reserved" and "#numeric", both of which are not distintive enough.

Not changed: "Encoding considerations:" in the Media Type uses "U+0000 to U+FFFF" etc.

afs · 2025-02-09T09:27:39Z

if we made grammar or processing changes that affected this, it’s worth noting. Otherwise, if the changes are not normative, it doesn’t need its own entry.

Either is good for me now we see that test-38 was not in the implementation reports.

Whether the escape sequences could be said to be "unclear, on close inspection" or an "erratum" is more a matter of taste.

Slight, mild, preference to include it because this has been quietly rumbling for a long time but it is editors' decision.

spec/index.html

TallTed

Fixes of punctuation and singular/plural disagreements

spec/index.html

…e escape sequences list

afs requested review from gkellogg and domel February 8, 2025 10:04

afs mentioned this pull request Feb 8, 2025

Escape sequences that encode a supplemental code point using a Unicode surrogate pair #83

Open

Tpt reviewed Feb 8, 2025

View reviewed changes

Tpt mentioned this pull request Feb 8, 2025

SPARQL String. Unicode escapes exclude surrogates. w3c/sparql-query#190

Merged

afs force-pushed the no-surrogates branch from d4adaea to 03e2e07 Compare February 8, 2025 12:25

gkellogg approved these changes Feb 8, 2025

View reviewed changes

afs force-pushed the no-surrogates branch from 03e2e07 to 1548469 Compare February 8, 2025 15:21

afs marked this pull request as ready for review February 9, 2025 09:01

gkellogg reviewed Feb 9, 2025

View reviewed changes

spec/index.html Outdated Show resolved Hide resolved

afs force-pushed the no-surrogates branch from 93fd68a to aa62999 Compare February 9, 2025 16:49

domel approved these changes Feb 9, 2025

View reviewed changes

TallTed suggested changes Feb 9, 2025

View reviewed changes

afs force-pushed the no-surrogates branch from 123bffe to 74031dd Compare February 12, 2025 18:31

Exclude surrogate codepoints. <dfn> for escape sequences. Reformat th…

1c6deff

…e escape sequences list

afs force-pushed the no-surrogates branch from 74031dd to 1c6deff Compare February 15, 2025 15:33

afs requested a review from TallTed February 18, 2025 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude surrogate codepoints #84

Exclude surrogate codepoints #84

afs commented Feb 8, 2025 •

edited by pr-preview bot

Loading

Tpt Feb 8, 2025

afs Feb 8, 2025 •

edited

Loading

Tpt Feb 8, 2025

afs Feb 8, 2025

Tpt Feb 8, 2025

afs Feb 8, 2025 •

edited

Loading

gkellogg left a comment

afs commented Feb 8, 2025

gkellogg commented Feb 8, 2025

afs commented Feb 9, 2025

afs commented Feb 9, 2025 •

edited

Loading

TallTed left a comment

Exclude surrogate codepoints #84

Are you sure you want to change the base?

Exclude surrogate codepoints #84

Conversation

afs commented Feb 8, 2025 • edited by pr-preview bot Loading

Tpt Feb 8, 2025

Choose a reason for hiding this comment

afs Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

Tpt Feb 8, 2025

Choose a reason for hiding this comment

afs Feb 8, 2025

Choose a reason for hiding this comment

Tpt Feb 8, 2025

Choose a reason for hiding this comment

afs Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

gkellogg left a comment

Choose a reason for hiding this comment

afs commented Feb 8, 2025

gkellogg commented Feb 8, 2025

afs commented Feb 9, 2025

afs commented Feb 9, 2025 • edited Loading

TallTed left a comment

Choose a reason for hiding this comment

afs commented Feb 8, 2025 •

edited by pr-preview bot

Loading

afs Feb 8, 2025 •

edited

Loading

afs Feb 8, 2025 •

edited

Loading

afs commented Feb 9, 2025 •

edited

Loading