Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude surrogate codepoints #84

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Exclude surrogate codepoints #84

wants to merge 1 commit into from

Conversation

afs
Copy link
Contributor

@afs afs commented Feb 8, 2025

Say that Uncode code escape sequence must not generate codepoints for surrogates.

I'm not sure this is that best way and it worth getting it right here before including in other documents (NQ, NT, TriG).

The section on the three kinds of escape sequences uses a bulleted list, where the list item has a table.
Now there is more material in these list items, the display does look so good. If this is changed, this too should be sorted out before propagating to other documents.


Preview | Diff

spec/index.html Outdated
<p>
Unicode <a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>,
the range <code class="codepoint">U+D800</code> to <code class="codepoint">U+DFFF</code>
are excluded.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents paired surrogates to be used (see #83). It sounds like a breaking change from Turtle 1.1 that allowed paired surrogates to be escaped.

Copy link
Contributor Author

@afs afs Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change happened earlier when "RDF string" came in.

Given that surrogate pairs are translated by UTF-16 only, if a pair is introduced by escape sequences, I don't see how the 1.1 spec allows for a translation to be done; so they remain raw.

RDF 1.1 Turtle says "Unicode character" - which isn't a mentioned term (glossary). There is "character" which has "(2) Synonym for abstract character"

(there may have been "Unicode character" back when the document was written, Unicode does change). A surrogate isn't what Unicode calls an "abstract character" (i.e. after decoding).

https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527

Surrogate
Permanently reserved for UTF-16; restricted interchange
Cannot be assigned to abstract character

So, for me, this can be treated as errata on RDF 1.1 Turtle.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Thank you for the clarification!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT: The Wikipedia disambiguation page for "Down the Rabbit Hole" refers ("see also") to the disambiguation page for "Rabbit Hole" ... which in turn refers to the disambiguation page for "Down the Rabbit Hole". I hope someone did that on purpose.

inclusive, excluding the range <code class="codepoint">U+D800</code> to
<code class="codepoint">U+DFFF</code>
(<a data-cite="I18N-GLOSSARY#dfn-surrogate" class="lint-ignore">surrogate code points</a>),
are allowed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence is redundant with the UTF-8 definition that states The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF but it's maybe a good idea to add it anyway to make things cristal clear.

Copy link
Contributor Author

@afs afs Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. At one level "RDF string" is enough.

It would be better written as "the range U+0000 to U+D7FF and U+E000 to U+10FFFF inclusive" because Unicode writes it that way.

A next sentence saying "no surrogate code points" is then an optional extra for clarity.

(PS Change made)

Copy link
Member

@gkellogg gkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably put an entry in the changes section for this.

@afs
Copy link
Contributor Author

afs commented Feb 8, 2025

We should probably put an entry in the changes section for this.

Here or in RDF Concepts? Or both?

(added to this PR)

@gkellogg
Copy link
Member

gkellogg commented Feb 8, 2025

We should probably put an entry in the changes section for this.

Here or in RDF Concepts? Or both?

if we made grammar or processing changes that affected this, it’s worth noting. Otherwise, if the changes are not normative, it doesn’t need its own entry.

@afs afs marked this pull request as ready for review February 9, 2025 09:01
@afs
Copy link
Contributor Author

afs commented Feb 9, 2025

This second commit

  • Use <dfn> for the three kinds escape sequence
  • Codepoint range wording changed to "U+0000 to U+D7FF and U+E000 to U+10FFFF" from "whole range except ..."
  • Tables have boarders
  • There is a change note entry - it can easily be removed if desired

The ids for the <dfn> are not preserved. These seem to be internal links for the tables in grammar processing. One was "#string" and the map use string to bnode used it,when it meant a normal string (bnode labels are not limited to the two chars of a \t escape!) The other two were "#reserved" and "#numeric", both of which are not distintive enough.

Not changed: "Encoding considerations:" in the Media Type uses "U+0000 to U+FFFF" etc.

@afs
Copy link
Contributor Author

afs commented Feb 9, 2025

if we made grammar or processing changes that affected this, it’s worth noting. Otherwise, if the changes are not normative, it doesn’t need its own entry.

Either is good for me now we see that test-38 was not in the implementation reports.

Whether the escape sequences could be said to be "unclear, on close inspection" or an "erratum" is more a matter of taste.

Slight, mild, preference to include it because this has been quietly rumbling for a long time but it is editors' decision.

spec/index.html Outdated Show resolved Hide resolved
Copy link
Member

@TallTed TallTed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes of punctuation and singular/plural disagreements

spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
spec/index.html Outdated Show resolved Hide resolved
@afs afs requested a review from TallTed February 18, 2025 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants