Skip to content

Commit

Permalink
disallow more code points and outright ban certain ones from KDL docu…
Browse files Browse the repository at this point in the history
…ments altogether (#353)

Fixes: #250
  • Loading branch information
zkat authored Dec 13, 2023
1 parent ea3ca8c commit ba11ffc
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 7 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@
characters).
* `,`, `<`, and `>` are now legal identifier characters. They were previously
reserved for KQL but this is no longer necessary.
* Code points under `0x20`, code points above `0x10FFFF`, Delete control
character (`0x7F`), and the [unicode "direction control"
characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
are now completely banned from appearing literally in KDL documents. They
can now only be represented in regular strings, and there's no facilities to
represent them in raw strings. This should be considered a security
improvement.

### KQL

Expand Down
36 changes: 29 additions & 7 deletions SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ foo 1 key="val" 3 {

A bare Identifier is composed of any Unicode codepoint other than [non-initial
characters](#non-initial-characters), followed by any number of Unicode
codepoints other than [non-identifier characters](#non-identifier-characters),
code points other than [non-identifier characters](#non-identifier-characters),
so long as this doesn't produce something confusable for a [Number](#number),
[Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
and an Identifier can start with `-`, but when an Identifier starts with `-`
Expand Down Expand Up @@ -122,9 +122,9 @@ of having an identifier look like a negative number.
The following characters cannot be used anywhere in a bare
[Identifier](#identifier):

* Any codepoint with hexadecimal value `0x20` or below.
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
* Any of `\/(){};[]="`
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.

### Line Continuation

Expand All @@ -137,6 +137,7 @@ characters and an optional single-line comment. It must be terminated by a
Following a line continuation, processing of a Node can continue as usual.

#### Example

```kdl
my-node 1 2 \ // comments are ok after \
3 4 // This is the actual end of the Node.
Expand Down Expand Up @@ -309,6 +310,10 @@ String Value can encompass multiple lines without behaving like a Newline for

Strings _MUST_ be represented as UTF-8 values.

Strings _MUST NOT_ include the code points for [disallowed literal
code points](#disallowed-literal-code-points) directly. If needed, they can be
specified with their corresponding `\u{}` escape.

#### Escapes

In addition to literal code points, a number of "escapes" are supported.
Expand Down Expand Up @@ -368,6 +373,11 @@ closed by a `"` followed by a _matching_ number of `#` characters. This means
that the string sequence `"` or `"#` and such must not match the closing `"`
with the same or more `#` characters as the opening `r`.

Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
code-points](#disallowed-literal-code-points) as code points in their body.
Unlike with Strings, these cannot simply be escaped, and are thus
unrepresentable when using Raw Strings.

#### Example

```kdl
Expand Down Expand Up @@ -470,6 +480,16 @@ lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf):

Note that for the purpose of new lines, CRLF is considered _a single newline_.

### Disallowed Literal Code Points

The following code points may not appear literally anywhere in the document.
They may be represented in Strings (but not Raw Strings) using `\u{}`.

* Any codepoint with hexadecimal value `0x20` or below (various control characters).
* `0x7F` (the Delete control character).
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)

## Full Grammar

This is the full official grammar for KDL and should be considered
Expand Down Expand Up @@ -498,15 +518,15 @@ bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - key
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
identifier-char := unicode - line-space - [\\/(){};\[\]="]
identifier-char := unicode - line-space - [\\/(){};\[\]="] - disallowed-literal-code-points
keyword := boolean | 'null'
prop := identifier '=' value
prop := identifier '=' valuel
value := type? (string | number | keyword)
type := '(' identifier ')'
string := raw-string | escaped-string
escaped-string := '"' character* '"'
character := '\' escape | [^\"]
escaped-string := '"' string-character* '"'
string-character := '\' escape | [^\"] - disallowed-literal-code-points
escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
hex-digit := [0-9a-fA-F]
Expand Down Expand Up @@ -536,6 +556,8 @@ ws := bom | unicode-space | multi-line-comment
bom := '\u{FEFF}'
disallowed-literal-code-points := See Table (Disallowed Literal Code Points)
unicode-space := See Table (All White_Space unicode characters which are not `newline`)
single-line-comment := '//' ^newline* (newline | eof)
Expand Down

0 comments on commit ba11ffc

Please sign in to comment.