Remove '#' from legal characters in a bare identifier #204

hkolbeck · 2021-09-26T23:46:30Z

Based on discussion here: #200

* Use same order from grammar definition

/ is not a valid identifier character. This updates unusual_chars_in_bare_id.kdl to only include unusual but valid identifier characters, and adds separate invalid input tests for non-identifier characters. Fixes kdl-org#166

… id test (kdl-org#181) Co-authored-by: Hannah Kolbeck <[email protected]>

zkat · 2021-09-26T23:56:36Z

So KQL current picks out two more symbols that are currently legal identifiers: ~ and +

See https://github.com/kdl-org/kdl/blob/main/QUERY-SPEC.md#selectors

Should we use those, or reserve something else?

hkolbeck · 2021-09-27T00:04:19Z

Do you for-see adding general regex support to string matchers? The natural operator seems like ~= which is perhaps another argument for removing ~. That said, I think it depends entirely on the formal grammar for KQL. If whitespace is required between identifiers and operators, then I don't think ~/+ being legal is an issue, since as far as I can tell the token in that position can only be an operator.

EDIT: Haha just kidding, I forgot a b was a legal (and common) selector. I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

hkolbeck · 2021-09-27T00:39:42Z

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

zkat · 2021-09-27T01:34:08Z

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

oh I didn't realize that. That's right!

I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

I like this a lot, but thinking about verbosity, I think it's must more straightforward to say a b than a >> b, esp with "real" node names? I think the >> will become cumbersome? thoughts?

I'm also pretty in favor of removing ~ from identifiers and reserving it. If we do that, I can't think of anything else we might want to reserve that's important to.

hkolbeck · 2021-09-27T02:44:03Z

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

I'm in favor of removing ~ as well.

zkat · 2021-09-28T00:07:23Z

I'm also kinda wanting to bring back /?

zkat · 2021-09-28T00:09:42Z

if we do that, URLs would become legal identifiers in most cases? I'm trying to think of when they wouldn't be, but maybe "URL encodable" might be the target we want fo KDL 2.0?

zkat · 2021-09-28T00:11:35Z

https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940

Or maybe not. That means we would need to include []();=', among others :(

hkolbeck · 2021-09-28T00:12:44Z

I feel like / is even more fraught than any of the others we've discussed. It sucks to make them force-quoted, but take

// "foo"

Is that a commented line or a node named //?

zkat · 2021-09-28T00:14:25Z

oh right I forgot about comments haha

zkat · 2021-09-28T00:23:08Z

Should we make a new branch and make this PR against that, btw? Like a 2.0 tracking branch that has 1.0-breaking changes in it?

zkat · 2021-09-28T00:38:38Z

boom, done :)

you might need to rebase onto it tho I think I did something bad.

tabatkins · 2021-10-05T16:20:00Z

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

Yes it is, so long as the second character isn't a digit; the grammar just prevents confusion with numbers. It's allowed by itself as a bare-ident.

tabatkins · 2021-10-05T16:22:15Z

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

Fwiw, the CSSWG (aka, me, in this case) tried to add >> as an alternate way to spell the descendant combinator. It failed due to lack of impl interest, but I would have liked it (and ++ as the way to spell ~).

zkat · 2021-10-05T17:18:21Z

So how about making the final list of removed characters just # and +?

tabatkins · 2021-10-05T18:13:43Z

I'm fine with that. (We could keep + as an ident char but just remove it from initial ident chars if that would still be useful, so KQL could still tell apart + from an ident. I don't have a strong opinion either way.)

zkat · 2021-10-05T18:23:38Z

ugh but + is so useful, esp as a first char. I wonder if there's something else we could use for "sibling"/"next"?

Also, should we pre-emptively ban |? I feel like it smells like a bracket identifier to me. And it means we could use it for OR syntaxes?

zkat · 2021-10-06T03:49:47Z

after thinking about it, I don't know if we should remove any extra characters beyond #. We're removing # because it potentially really complicates some kinds of implementations, but I don't think we need to reserve anything else for the sake of the query language as long as we make one change: queries must be <matcher> <operator> [<matcher> <operator>]. That is, we no longer do descendants with spaces, and use >> for "descendants", as @tabatkins described.

I think that'll take care of it, and we can just merge this branch. What do y'all think?

tabatkins · 2021-10-06T05:41:28Z

Yeah, that sounds really good. It leaves KQL's evolution much more open without tying it so intimately to KDL's precise syntax model, and avoids restricting KDL itself unnecessarily.

(Recall that CSS's implicit descendant combinator comes from the very first version, when there weren't any combinators at all. Child/next-sibling/general-sibling are all later inventions layered on top. We don't need to repeat that.)

tabatkins · 2021-10-06T06:23:48Z

Re: # in particular, a final possibility I don't think has been brought up is just disallowing an ident from starting with r# specifically. This appears to be kdl4j's current behavior, per the conversation in #200.

It's certainly slightly clumsier than a simple blanket prohibition, but it also means we maintain maximum ident syntax without requiring excessive lookahead.

hkolbeck · 2021-10-06T20:39:04Z

I'm wavering. It's definitely clearer and probably easier to implement with parser generators to simply ban #, but I can absolutely see people wanting to use hashtag-like identifiers and it feels worthwhile to make that clean.

tabatkins · 2021-10-06T22:00:49Z

Yeah, so just outlawing the numberish or rawstringish starts are easy for hand-rolled parsers, but for the grammar itself, let's see...

// current
bare-identifier := ((identifier-char - digit - sign) identifier-char* | sign ((identifier-char - digit) identifier-char*)?) - keyword

// proposed new grammar
bare-identifier := (bare-ident-start identifier-char*) - keyword
bare-ident-start := ((identifier-char - digit) identifier-char?) - (sign digit) - "r#"

// or possibly
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?

I suspect the latter is the most compatible with parser gens, since the subtractions are single-character set subtractions, rather than sequence subtractions like the - keyword.

Rather than banning `#` in idents altogether, as #204 does, this takes the same approach as our number-like handling, and just bans idents from looking like a raw string *in their first two characters*: just as an ident can't start with `digit` or `sign digit`, it now can't start with `r#` or `r"` either. (That last isn't an ident, but it prevents a naive parser not using first-wins collision resolution from reading it as an ident `r` followed by junk.) It also refactors the bare-ident production a bit to make the structure clearer, as it was already getting a bit messy just trying to ban number-likes.

…, closes #241

zkat · 2023-12-11T16:50:22Z

Note: With #354, and potentially with whatever happens with #350, I think we can be much less aggressive here and just ban # as the first character in identifiers, like we do with signs and digits already. I'll keep this open for now, but I'm thinking of just resolving things that way.

zkat · 2023-12-13T07:20:51Z

This is done in kdl-v2 now, so I'm closing this one. # is outright illegal now.

jihchi and others added 10 commits September 15, 2021 07:37

Fix some typos (kdl-org#161)

a2e5399

Prevent markdown from parsing []() as link (kdl-org#162)

16269d4

* Use same order from grammar definition

Preserve duplicate nodes (kdl-org#167)

8ae4ae6

Fixes: kdl-org#165

tests: ensure output of illegal ident is quoted (kdl-org#178)

8ca18fe

Add type annotation tests from kdl4j and tweak unusual char in quoted…

f59ca24

… id test (kdl-org#181) Co-authored-by: Hannah Kolbeck <[email protected]>

Update line continuation wording (kdl-org#184)

5c4f83b

tests: add more coverage for keywords (kdl-org#189)

24faa6b

Added xml2kdl to the list of implementations. (kdl-org#201)

99d9168

Remove # from legal characters in a bare identifier

de6e6e4

zkat added the breaking This can only be done for the next major version of KDL label Sep 28, 2021

zkat changed the base branch from main to kdl-v2 September 28, 2021 00:37

zkat mentioned this pull request Oct 6, 2021

KQL: require operator and change operator grammar a bit #221

Merged

tabatkins mentioned this pull request Oct 17, 2021

Minimal fix to #200, banning string-like idents #241

Closed

zkat mentioned this pull request Aug 28, 2022

Merge KDL v2 #286

Merged

tabatkins added a commit that referenced this pull request Oct 6, 2023

Restrict idents from looking like raw strings. Closes #200, closes #204…

0836df1

…, closes #241

zkat closed this Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove '#' from legal characters in a bare identifier #204

Remove '#' from legal characters in a bare identifier #204

hkolbeck commented Sep 26, 2021

zkat commented Sep 26, 2021

hkolbeck commented Sep 27, 2021 •

edited

Loading

hkolbeck commented Sep 27, 2021

zkat commented Sep 27, 2021

hkolbeck commented Sep 27, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

hkolbeck commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

tabatkins commented Oct 5, 2021 •

edited

Loading

tabatkins commented Oct 5, 2021

zkat commented Oct 5, 2021

tabatkins commented Oct 5, 2021

zkat commented Oct 5, 2021

zkat commented Oct 6, 2021

tabatkins commented Oct 6, 2021

tabatkins commented Oct 6, 2021

hkolbeck commented Oct 6, 2021

tabatkins commented Oct 6, 2021

zkat commented Dec 11, 2023 •

edited

Loading

zkat commented Dec 13, 2023

Remove '#' from legal characters in a bare identifier #204

Remove '#' from legal characters in a bare identifier #204

Conversation

hkolbeck commented Sep 26, 2021

zkat commented Sep 26, 2021

hkolbeck commented Sep 27, 2021 • edited Loading

hkolbeck commented Sep 27, 2021

zkat commented Sep 27, 2021

hkolbeck commented Sep 27, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

hkolbeck commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

zkat commented Sep 28, 2021

tabatkins commented Oct 5, 2021 • edited Loading

tabatkins commented Oct 5, 2021

zkat commented Oct 5, 2021

tabatkins commented Oct 5, 2021

zkat commented Oct 5, 2021

zkat commented Oct 6, 2021

tabatkins commented Oct 6, 2021

tabatkins commented Oct 6, 2021

hkolbeck commented Oct 6, 2021

tabatkins commented Oct 6, 2021

zkat commented Dec 11, 2023 • edited Loading

zkat commented Dec 13, 2023

hkolbeck commented Sep 27, 2021 •

edited

Loading

tabatkins commented Oct 5, 2021 •

edited

Loading

zkat commented Dec 11, 2023 •

edited

Loading