Skip to content

Commit

Permalink
Allow surrogates in content, issue unicode-org#895
Browse files Browse the repository at this point in the history
  • Loading branch information
mihnita committed Oct 20, 2024
1 parent 4ea56e4 commit 461555f
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 11 deletions.
12 changes: 8 additions & 4 deletions spec/appendices.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,21 @@ host environments, their serializations and resource formats,
that might be sufficient to prevent most problems.
However, MessageFormat itself does not supply such a restriction.

MessageFormat _messages_ permit nearly all Unicode code points,
with the exception of surrogates,
MessageFormat _messages_ permit nearly all Unicode code points
to appear in _literals_, including the text portions of a _pattern_.
This means that it can be possible for a _message_ to contain invisible characters
(such as bidirectional controls,
ASCII control characters in the range U+0000 to U+001F,
(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F,
or characters that might be interpreted as escapes or syntax in the host format)
that abnormally affect the display of the _message_
when viewed as source code, or in resource formats or translation tools,
but do not generate errors from MessageFormat parsers or processing APIs.

The localizable elements of a message (text and string literals) allow the presence of
unpaired surrogates (U+D800 to U+DFFF). This is for compatibility with existing formats
that are agnostic about them. \
But their presence of unpaired surrogates is likely an indication of mistakes or bad tooling.
Their use is not recommended, and linting (if present) can be used to prevent them.

Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew)
also poses a potential source of confusion for users.
Since MessageFormat 2.0's syntax makes use of
Expand Down
3 changes: 1 addition & 2 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
/ %x41-5B ; omit \ (%x5C)
/ %x5D-7A ; omit { | } (%x7B-7D)
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
/ %x3001-10FFFF ; allowing surrogates is intentional

; Character escapes
escaped-char = backslash ( backslash / "{" / "|" / "}" )
Expand Down
11 changes: 6 additions & 5 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
(U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and literals
only (U+D800 through U+DFFF), and other potentially confusing content.

## Messages and their Syntax

Expand Down Expand Up @@ -274,8 +275,9 @@ A _quoted pattern_ MAY be empty.
### Text
**_<dfn>text</dfn>_** is the translateable content of a _pattern_.
Any Unicode code point is allowed, except for U+0000 NULL
and the surrogate code points U+D800 through U+DFFF inclusive.
Any Unicode code point is allowed, except for U+0000 NULL.
Unpaired surrogates code points (U+D800 through U+DFFF inclusive) are allowed
in localizable elements, but using them is likely a mistake and not recommended.
The characters U+005C REVERSE SOLIDUS `\`,
U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
MUST be escaped as `\\`, `\{`, and `\}` respectively.
Expand Down Expand Up @@ -691,8 +693,7 @@ A _literal_ can appear
as a _key_ value,
as the _operand_ of a _literal-expression_,
or in the value of an _option_.
A _literal_ MAY include any Unicode code point
except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
A _literal_ MAY include any Unicode code point except for U+0000 NULL.

All code points are preserved.

Expand Down

0 comments on commit 461555f

Please sign in to comment.