Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove '#' from legal characters in a bare identifier #204

Closed
wants to merge 10 commits into from

Conversation

hkolbeck
Copy link
Member

Based on discussion here: #200

@zkat
Copy link
Member

zkat commented Sep 26, 2021

So KQL current picks out two more symbols that are currently legal identifiers: ~ and +

See https://github.com/kdl-org/kdl/blob/main/QUERY-SPEC.md#selectors

Should we use those, or reserve something else?

@hkolbeck
Copy link
Member Author

hkolbeck commented Sep 27, 2021

Do you for-see adding general regex support to string matchers? The natural operator seems like ~= which is perhaps another argument for removing ~. That said, I think it depends entirely on the formal grammar for KQL. If whitespace is required between identifiers and operators, then I don't think ~/+ being legal is an issue, since as far as I can tell the token in that position can only be an operator.

EDIT: Haha just kidding, I forgot a b was a legal (and common) selector. I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

@hkolbeck
Copy link
Member Author

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

@zkat
Copy link
Member

zkat commented Sep 27, 2021

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

oh I didn't realize that. That's right!

I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

I like this a lot, but thinking about verbosity, I think it's must more straightforward to say a b than a >> b, esp with "real" node names? I think the >> will become cumbersome? thoughts?

I'm also pretty in favor of removing ~ from identifiers and reserving it. If we do that, I can't think of anything else we might want to reserve that's important to.

@hkolbeck
Copy link
Member Author

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

I'm in favor of removing ~ as well.

@zkat
Copy link
Member

zkat commented Sep 28, 2021

I'm also kinda wanting to bring back /?

@zkat
Copy link
Member

zkat commented Sep 28, 2021

if we do that, URLs would become legal identifiers in most cases? I'm trying to think of when they wouldn't be, but maybe "URL encodable" might be the target we want fo KDL 2.0?

@zkat
Copy link
Member

zkat commented Sep 28, 2021

https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940

Or maybe not. That means we would need to include []();=', among others :(

@hkolbeck
Copy link
Member Author

I feel like / is even more fraught than any of the others we've discussed. It sucks to make them force-quoted, but take

// "foo"

Is that a commented line or a node named //?

@zkat
Copy link
Member

zkat commented Sep 28, 2021

oh right I forgot about comments haha

@zkat
Copy link
Member

zkat commented Sep 28, 2021

Should we make a new branch and make this PR against that, btw? Like a 2.0 tracking branch that has 1.0-breaking changes in it?

@zkat zkat added the breaking This can only be done for the next major version of KDL label Sep 28, 2021
@zkat zkat changed the base branch from main to kdl-v2 September 28, 2021 00:37
@zkat
Copy link
Member

zkat commented Sep 28, 2021

boom, done :)

you might need to rebase onto it tho I think I did something bad.

@tabatkins
Copy link
Contributor

tabatkins commented Oct 5, 2021

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

Yes it is, so long as the second character isn't a digit; the grammar just prevents confusion with numbers. It's allowed by itself as a bare-ident.

@tabatkins
Copy link
Contributor

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

Fwiw, the CSSWG (aka, me, in this case) tried to add >> as an alternate way to spell the descendant combinator. It failed due to lack of impl interest, but I would have liked it (and ++ as the way to spell ~).

@zkat
Copy link
Member

zkat commented Oct 5, 2021

So how about making the final list of removed characters just # and +?

@tabatkins
Copy link
Contributor

I'm fine with that. (We could keep + as an ident char but just remove it from initial ident chars if that would still be useful, so KQL could still tell apart + from an ident. I don't have a strong opinion either way.)

@zkat
Copy link
Member

zkat commented Oct 5, 2021

ugh but + is so useful, esp as a first char. I wonder if there's something else we could use for "sibling"/"next"?

Also, should we pre-emptively ban |? I feel like it smells like a bracket identifier to me. And it means we could use it for OR syntaxes?

@zkat
Copy link
Member

zkat commented Oct 6, 2021

after thinking about it, I don't know if we should remove any extra characters beyond #. We're removing # because it potentially really complicates some kinds of implementations, but I don't think we need to reserve anything else for the sake of the query language as long as we make one change: queries must be <matcher> <operator> [<matcher> <operator>]. That is, we no longer do descendants with spaces, and use >> for "descendants", as @tabatkins described.

I think that'll take care of it, and we can just merge this branch. What do y'all think?

@tabatkins
Copy link
Contributor

Yeah, that sounds really good. It leaves KQL's evolution much more open without tying it so intimately to KDL's precise syntax model, and avoids restricting KDL itself unnecessarily.

(Recall that CSS's implicit descendant combinator comes from the very first version, when there weren't any combinators at all. Child/next-sibling/general-sibling are all later inventions layered on top. We don't need to repeat that.)

@tabatkins
Copy link
Contributor

Re: # in particular, a final possibility I don't think has been brought up is just disallowing an ident from starting with r# specifically. This appears to be kdl4j's current behavior, per the conversation in #200.

It's certainly slightly clumsier than a simple blanket prohibition, but it also means we maintain maximum ident syntax without requiring excessive lookahead.

@hkolbeck
Copy link
Member Author

hkolbeck commented Oct 6, 2021

I'm wavering. It's definitely clearer and probably easier to implement with parser generators to simply ban #, but I can absolutely see people wanting to use hashtag-like identifiers and it feels worthwhile to make that clean.

@tabatkins
Copy link
Contributor

Yeah, so just outlawing the numberish or rawstringish starts are easy for hand-rolled parsers, but for the grammar itself, let's see...

// current
bare-identifier := ((identifier-char - digit - sign) identifier-char* | sign ((identifier-char - digit) identifier-char*)?) - keyword

// proposed new grammar
bare-identifier := (bare-ident-start identifier-char*) - keyword
bare-ident-start := ((identifier-char - digit) identifier-char?) - (sign digit) - "r#"

// or possibly
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?

I suspect the latter is the most compatible with parser gens, since the subtractions are single-character set subtractions, rather than sequence subtractions like the - keyword.

tabatkins added a commit that referenced this pull request Oct 17, 2021
Rather than banning `#` in idents altogether, as #204 does, this takes the same approach as our number-like handling, and just bans idents from looking like a raw string *in their first two characters*: just as an ident can't start with `digit` or `sign digit`, it now can't start with `r#` or `r"` either.  (That last isn't an ident, but it prevents a naive parser not using first-wins collision resolution from reading it as an ident `r` followed by junk.)

It also refactors the bare-ident production a bit to make the structure clearer, as it was already getting a bit messy just trying to ban number-likes.
@zkat zkat mentioned this pull request Aug 28, 2022
tabatkins added a commit that referenced this pull request Oct 6, 2023
@zkat
Copy link
Member

zkat commented Dec 11, 2023

Note: With #354, and potentially with whatever happens with #350, I think we can be much less aggressive here and just ban # as the first character in identifiers, like we do with signs and digits already. I'll keep this open for now, but I'm thinking of just resolving things that way.

@zkat
Copy link
Member

zkat commented Dec 13, 2023

This is done in kdl-v2 now, so I'm closing this one. # is outright illegal now.

@zkat zkat closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This can only be done for the next major version of KDL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants