Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self-expression: Add KDL grammar in KDL #475

Closed
eugenesvk opened this issue Jan 12, 2025 · 16 comments
Closed

self-expression: Add KDL grammar in KDL #475

eugenesvk opened this issue Jan 12, 2025 · 16 comments

Comments

@eugenesvk
Copy link
Contributor

eugenesvk commented Jan 12, 2025

Currently the grammar is only available as an unformatted blob of text in the markdown document.

In addition it would be great if KDL could express itself and you could read a properly formatted/highlighted grammar document

So that instead of this gray uniformity
Image

you could have something like this
Image

@tabatkins
Copy link
Contributor

It's not a "blob of text", it's a grammar description in a well-known grammar description language. Writing the grammar "in KDL" would require us to define a new grammar description language using KDL as the syntax base, which would be a project on its own. I don't see great value in that.

It would also require anyone reading the grammar that defines KDL to already know the grammar of KDL. (Which, tbf, is in broad strokes very simple, but still.)

(I'll also note that what you have written there, while technically a valid KDL document, isn't a meaningful KDL document. It treats section comments as nodes, and requires you to examine the exact order of properties and attributes, since a grammar term is a property and the following attributes up to the next property; for example, "keyword" is defined by its value (boolean) and the following two attributes (|, an ident string; and #null). No one would actually define a KDL language like that.)

@eugenesvk
Copy link
Contributor Author

eugenesvk commented Jan 12, 2025

It's not a "blob of text", it's a grammar description in a well-known grammar description language

Which one and why isn't it marked as such in the spec.md? The description implies ABNF, but when I used ABNF syntax highlighter, it failed since apparently // isn't a comment there. It also failed on char ranges and some other stuff. So I doubt current description is really specced to a "well-known language"

(the bigger issue is that it's not really a formal grammar despite the superficial syntax familiarity, so you can't use it as a standalone description of the language, but have to constantly cross-reference description and tests, but that's a separate ignored issue #64)

I don't see great value in that.

The great value is that it's not a grey blob of text, but a syntax-highlighted doc, which improves readability

KDL to already know the grammar of KDL

Not really, you don't need to understand that \ is line continuation, just read in the grammar that \ is not part of grammar, but part of the extra syntax. And this isn't that different from reading in the current Grammar language that explains the basics such as

Single quotes (') are used to denote literal text

It treats section comments as nodes

it treats sections as nodes, comments are args, but more imporantly

, and requires you to examine the exact order of properties and attributes,

yes, it requires (and is designed for) READING, not programmatic parsing

meaningful KDL document.

so it is meaningful for reading, hence instead of

  • rule1= "rule2 | rule3" which is a neat programmatic 1 prop=value element , but hinders readability due to the extra ", you have
  • rule1 = rule2 | rule3 which is 3 elements

@eugenesvk
Copy link
Contributor Author

"keyword" is defined by its value (boolean) and the following two attributes (|, an ident string; and #null)

By the way, it's also possible to have boolean+|+#null as a single value (as I had it before adding bare strings to the syntax), but it serves no readability purpose, quite the opposite, the extra highlight you get from real #null that separates if from | is worth more than the unspecified benefit of the alternative

Image

@zkat
Copy link
Member

zkat commented Jan 12, 2025

If the goal is syntax highlighting, I would much prefer we just go ahead and rewrite as ABNF, which would make it more idiomatic for #461 anyway.

While this self-referential grammar looks, on the surface, like a regular grammar, it has no good semantic meaning in KDL at all. It is very superficially using KDL syntax, while yielding a completely nonsensical document if parsed (properties and arguments have no mutual ordering requirement so all the “productions” exist on completely different planes than subsequent arguments).

I’m also not sure I like how, even if it did produce a valid KDL doc, the presentation/formatting is taking advantage of some kinda clever things rather than formatting the text idiomatically.

@eugenesvk
Copy link
Contributor Author

If the goal is syntax highlighting,

The other goal was compactness (also structure/vis alignment)

while yielding a completely nonsensical document if parsed

I don't get it, you both bring it up as an issue without explaining what the issue is. Are you parsing "ABNF" as a data document for use anywhere? No

Then what are you planning to do with a parsed grammar that you need it to make "formal sense"?

the presentation/formatting is taking advantage of some kinda clever things rather than formatting the text idiomatically.

It should be viewed as a testament to KDL's expressive power that it can generate clean docs using such tricks

@zkat
Copy link
Member

zkat commented Jan 13, 2025

The whole problem is that it’s NOT generating a sensible doc. The data is semantically lossy. Some parsers might preserve order, but they would be doing so outside the bounds of KDL proper.

And, as a matter of fact, ABNF can be used for parser generators.

@eugenesvk
Copy link
Contributor Author

But you still haven't said what the problem is! What are your use cases for a "sensible doc" and why is it important that the better formatted doc must conform to them while the current non-formatted doc should not?

And, as a matter of fact, ABNF can be used for parser generators.

In theory. In practice this grammar can't and isn't. And likely won't ever be (and I've actually tried it for syntax generation way back with v1 since to me it's the most sensible way, but it didn't work due to both deficiencies in the grammar and the generator).

But also in theory the KDL version could be restructured into a "sensible doc", but it would likely cost in worse readability, so with no value identified it doesn't make sense to bear that cost.

After all, this issue doesn't prescribe any specific format, the screenshots/linked PR are just an example of what I used for myself because working referencing a poorly structured poorly formatted spec was too painful

@zkat
Copy link
Member

zkat commented Jan 13, 2025

It’s confusing to me to read a KDL document which, superficially, shares syntax with KDL, but is semantically meaningless as a KDL document.

If we were to define the grammar in terms of KDL itself, I would like to see the resulting document both formatted idiomatically, and for the document itself to meaningfully, semantically describe the grammar using idiomatic structuring of the productions

@tabatkins
Copy link
Contributor

Yeah, given your grammar fragment:

P "Keywords and Booleans" \
keyword       = boolean | #null \
float_keyword = #inf | #-inf | #nan \
boolean       = #true | #false

This is a P node, containing several arguments and properties. While the KDL data model treats the relative order of arguments as meaningful, it treats properties as unordered, and doesn't define a canonical ordering for arguments and properties relative to each other. It also treats all the string syntaxes as equivalent, so ident strings and quoted strings aren't required to be distinguished.

That is, it would be valid, per the data model, to reformat that node to:

P "Keywords and Booleans" | #null | #-inf | #nan | #false keyword="boolean" float_keyword=#inf boolean=#true

And this is, clearly, meaningless. This is what Kat means by your example being "meaningless as a KDL document" - it's a set of KDL constructs that happen to syntax-highlight well, but don't actually form a meaningful KDL document. It would be like a document written "in JSON" that interpreted [] differently based on whether it was formatted across several lines or compactly on one line.

If we did want to do something like this, it would be with meaningful node structures, something like:

section "Keywords and Booleans" {
  prod keyword {
    nt boolean
    literal "#null"
  }
  prod float_keyword {
    literal "#inf" 
    literal "#-inf" 
    literal "#nan"
  }
  prod boolean {
    literal "#true" 
    literal "#false"
  }
}

Here, sections are nodes with child nodes defining productions. The production's children are the alternatives, with node names defining the type of each value. In less trivial examples that require, say, sequences of productions, or modifiers, they'd be their own nodes with children, like:

current spec:
unambiguous-ident := ((identifier-char - digit - sign - '.') identifier-char*) - disallowed-keyword-strings

in grammar-kdl:
prod unambiguous-ident {
  minus {
    seq {
      minus { nt identifier-char; nt digit; nt sign; literal "." }
      zero-or-more { nt identifier-char }
    }
    nt disallowed-keyword-strings
  }
}

This sort of structure meaningfully follows the KDL data model; reformatting won't change its meaning. It's also just the ABNF's parse-tree, so it's more verbose and often harder to read (but is potentially more clear in complex situations, which carries its own value).

@eugenesvk
Copy link
Contributor Author

so it's more verbose and often harder to read

Indeed it is, which is runs counter to my main goal with this suggestion - to have a single structured and easily readable reference doc (and in your extreme example of having to name every literal as literal I'm not sure the highlighted result will be any better vs the grey/broken highlight of ABNF since there is too much extra text)

And since you still haven't identified a single use case for the data model being "valid" (this doc is meant to be read as a reference by humans creating their parsers/documents), it's a pure downside

(but is potentially more clear in complex situations, which carries its own value).

Hard to say for certain without an example situation, but potentiality this is also achieved via extra verbosity/comments, not necessarily via a valid data model, which I think won't help you see that the pre-closing dedent is the primary one (do you need variables in grammar for this besides terms and nts?) and that equality is strict re the composition of spaces

@larsgw
Copy link
Contributor

larsgw commented Jan 15, 2025

Indeed it is, which is runs counter to my main goal with this suggestion - to have a single structured and easily readable reference doc (and in your extreme example of having to name every literal as literal I'm not sure the highlighted result will be any better vs the grey/broken highlight of ABNF since there is too much extra text)

If it is just about highlighting, we could bake the highlighting into the HTML (or indeed use ABNF). I don't see how it's relevant that we could format the grammar in a way that happens to be parsable as KDL.

@eugenesvk
Copy link
Contributor Author

eugenesvk commented Jan 15, 2025

HTML is adding a huge unergonomic (e.g., for basic things like search, also no edit) dependency - a browser.

ABNF adds a little extra dependency outside the KDL ecosystem. The relevant part here is self-containment. And there is a tiny benefit that you could use grammar as a test document itself (helped me catch a couple of bugs). Maybe you could also have a cleaner doc (like without ' on the screenshots)

Although ABNF in theory could give you LSP-like convenience of in-place rule popups/jumping to rule definition, though I think these exist in practice. So its only practical benefit currently is that it can highlight ? and * etc differently

@larsgw
Copy link
Contributor

larsgw commented Jan 15, 2025

I could see something like this perhaps.

// Keywords and booleans
keyword       (rule)boolean | "#null"
float_keyword "#inf" | "#-inf" | "#nan"
boolean       "#true" | "#false"

Note that you should distinguish between strings that refer to other rules and literal strings, especially since the grammar is normative. Technically | is also a string but I feel like that is clear enough. Note also that this example is relatively straightforward. For more complicated rules it becomes more difficult, there are several features of the grammar where I do not see a clear way to encode it succinctly.

  • Parentheses
  • ? and * (when applied to rules they could be included in the string, but not for literals and parenthesis groups)
  • Character ranges
  • Exclusions from character ranges

Hence I think a specialized format is preferable.

@eugenesvk
Copy link
Contributor Author

Note that you should distinguish between strings that refer to other rules and literal strings, especially since the grammar is normative

Or you can just add a grammar rule: if it's not obvious from the bare string's value or grammar's description, bare string refers to another rule if such a rule exists left of = anywhere

do not see a clear way to encode it succinctly.

You can check the example in the linked PR, which encodes everything in a more succinct way than the current one, but it should be viewed with syntax highlighter so that you can see you don't need to dirty every rule with (rule) as in (rule)boolean because it's already visibly different

@tabatkins
Copy link
Contributor

I think this discussion has run its course. We are not interested in formatting our grammar as a meaningless KDL document solely because KDL syntax highlighting happens to look reasonable on it. If you want a highlighted grammar instead of plain text, that is something we'd be willing to take, with HTML colorizing, and could put that into the spec document.

@eugenesvk
Copy link
Contributor Author

with HTML colorizing

Why would you force everyone to use a browser to see colors???

And you actually get HTML colorizing from the grammar being a valid KDL document as far as I understand, at least https://kdl.dev/play/ playground highlights different elements differently. So the only thing preventing your from having your HTML colorizing is you being stuck on a meaningless requirement of a "meaningful" data structure

@kdl-org kdl-org locked as resolved and limited conversation to collaborators Jan 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants