Document divergences from the spec #8

Lucretiel · 2021-09-21T19:11:51Z

There are a few cases where we're making a conscious choice to divert from the KDL spec. These should be documented near the top. Currently this includes:

Many entities in KDL are defined in terms of code points (for instance, KDL identifiers are made up of "any code point except for ...". Rust strings and char are sequences of Unicode Scalar Values, rather than Code Points. A Scalar Value is a slight subset of a Code Point that just excludes low and high surrogates. In practice we don't expect this will cause any issues.
KDL calls for duplicate property keys to be last-key-wins, and other keys ignored. We instead will use the ordinary serde map handling for these cases (ie, next_key_seed will always return the next key, without any consideration for duplicates).We prefer the flexibility offered by this, since Deserialize types have the opportunity to define their own behavior when receiving duplicate keys. HashMap, for instance, uses the last-key-wins strategy, while structs with derive(Deserialize) will fail with an error on a duplicate key.

The text was updated successfully, but these errors were encountered:

tbmreza · 2021-09-23T04:56:39Z

What is the main selling point of kaydle implementation? (or does it need one)

Lucretiel · 2021-09-23T20:35:10Z

serde is definitely the main selling point.

CAD97 · 2021-09-27T05:42:15Z

Many entities in KDL are defined in terms of code points ... Rust strings and char are sequences of Unicode Scalar Values

See also kdl-org/kdl#207

I argue that while the spec refers to "code points," the top-level requirement for the document to be UTF-8 encoded eliminates the possibility for surrogates to show up, as well-formed UTF-8 is an encoded sequence of USV and MUST NOT include surrogates, unpaired nor paired. I think the only location surrogate code points may actually show up per the spec is in \u{...} escapes, and the requirement that "Strings MUST be represented as UTF-8 values" may also prevent codepoints which are not USV in that location as well.

Or IOW, I think this one may formally be a non-issue.

Lucretiel · 2021-09-29T06:18:00Z

well-formed UTF-8 is an encoded sequence of USV and MUST NOT include surrogates, unpaired nor paired

Oh no kidding? This would actually be news to me, that's interesting.

Lucretiel · 2021-09-29T06:46:14Z

Thinking more about the duplicate property keys thing. While I like the flexibility offered by leaning into the serde model, I'm somewhat unhappy that this could cause intentionally valid KDL documents to be rejected (for instance, a configuration dumping tool could deliberately make use of the last-key-wins behavior). I've had an idea for how to implement this in the parse (by adding a lookahead to NodeProcessor::next_event), so probably I'll make it runtime configurable (since opting into the conforming behavior incurs a performance penalty due to the lookahead

Lucretiel added the documentation Improvements or additions to documentation label Sep 21, 2021

CAD97 mentioned this issue Sep 27, 2021

The spec says "code point" where it likely means "unicode scalar value" kdl-org/kdl#207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document divergences from the spec #8

Document divergences from the spec #8

Lucretiel commented Sep 21, 2021 •

edited

Loading

tbmreza commented Sep 23, 2021

Lucretiel commented Sep 23, 2021

CAD97 commented Sep 27, 2021

Lucretiel commented Sep 29, 2021

Lucretiel commented Sep 29, 2021

Document divergences from the spec #8

Document divergences from the spec #8

Comments

Lucretiel commented Sep 21, 2021 • edited Loading

tbmreza commented Sep 23, 2021

Lucretiel commented Sep 23, 2021

CAD97 commented Sep 27, 2021

Lucretiel commented Sep 29, 2021

Lucretiel commented Sep 29, 2021

Lucretiel commented Sep 21, 2021 •

edited

Loading