Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Swift) Incorrect highlighting in operator functions #2895

Closed
svanimpe opened this issue Nov 28, 2020 · 22 comments · Fixed by #2930
Closed

(Swift) Incorrect highlighting in operator functions #2895

svanimpe opened this issue Nov 28, 2020 · 22 comments · Fixed by #2930
Labels
bug help welcome Could use help from community language

Comments

@svanimpe
Copy link
Contributor

What exactly is going on here?

Screen Shot 2020-11-28 at 06 34 26

Screen Shot 2020-11-28 at 06 35 06

@joshgoebel I’ll try to improve the Swift grammar a bit during the holidays. I'll start by looking into the JS regex syntax, then reading the HLJS developer documentation Can you recommend some grammars I should look at for inspiration?

@svanimpe svanimpe added bug help welcome Could use help from community language labels Nov 28, 2020
@joshgoebel
Copy link
Member

I'd try to tell you but you gave me screenshots instead of actual code I could copy and paste. :-p :~(

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 2, 2020

Sorry!

protocol Comparable: Equatable {

  static func < (lhs: Self, rhs: Self) -> Bool
  static func <= (lhs: Self, rhs: Self) -> Bool
  static func > (lhs: Self, rhs: Self) -> Bool
  static func >= (lhs: Self, rhs: Self) -> Bool
}

Now that I'm more familiar with the grammar, the problem seems to be that the grammar doesn't support protocol requirements. It assumes that the requirement static func < (lhs: Self, rhs: Self) -> Bool is a function declaration, so it keeps looking for an opening brace, which is never found, so the remainder of the code is parsed as a function declaration:

Screen Shot 2020-12-02 at 18 25 46

@joshgoebel
Copy link
Member

joshgoebel commented Dec 2, 2020

Yes, quite right. A few thoughts:

  • The TITLE_MODE rule also will need to be expanded to include the list of valid operators that Swift allows overloading. (I assume in this context we'd highlight them as titles like we do function names?)
  • If you wanted to tackle that we'd want the operators in a separate array (one per line) and then we'd join then using utility functions and merge them with the existing /[A-Za-z$][0-9A-Za-z$]*/ title rule.
  • If you wanted to start by just fixing the actual BUG here then I think we may simply need a contains rule that matches -> and then endsParent... or the end function rule could be updated.
  • Detecting the difference between a function/declaration/protocol requirement might be a bit out of scope.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 2, 2020

Ah you're also going to fall afoul the templating(?) rule begin: /</, end: />/ so I think to deal with this properly we'll have to fix the TITLE rule first. And I'm worried we're quickly going to get into ambiguous territory here. :-)

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 3, 2020

Is there a way to express a specific ordering of patterns?

For example, in a function declaration, the function's name must come before the generic parameters, so in this example:

func < <T>()

The first < is the name of the function (< is a valid operator name), whereas the second < starts the generic parameter list. As you can see, GitHub's syntax highlighter matches this correctly. Can we express this in HLJS as well?

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 3, 2020

W.r.t. the TITLE rule, I was looking at the grammar for identifiers in Swift:
https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#grammar_identifier

This seems doable to implement. However, I'm wondering if the 5-hex-digit Unicode characters (such as U+9FFFD) are supported? I'm finding contradicting information regarding \uhhhhh. Is this supported?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 4, 2020

Is there a way to express a specific ordering of patterns?

We have starts but that can get gnarly fast. I think in this case it might make sense to treat func < as a special case and handle it with a slightly different rule. (variants can do this) IE something like:

begin: regex.concat(/func /, regex.either(...OPERATORS)),
keywords: { title: OPERATORS.join(" ") }

It might even be simpler to just handle all the operators this way and avoid even touching TITLE. Of course I'm not sure what other syntax Swift has that might get in the way of this or not.

This seems doable to implement. However, I'm wondering if the 5-hex-digit Unicode characters (such as U+9FFFD) are supported? I'm finding contradicting information regarding \uhhhhh. Is this supported?

Regarding TITLE I was only talking about fixing it to match the operators... all those unicode characters might be a bit out of scope and better solved with #2756.

There are one or two places we use Unicode stuff in grammars now; ie, java:

var JAVA_IDENT_RE = '[\u00C0-\u02B8a-zA-Z_$][\u00C0-\u02B8a-zA-Z_$0-9]*';

But as you can see that's very minimal and clean usage. If a grammar instead needs 50 different unicode match rules (as Swift seems to) then it's time to consider a different approach.

@joshgoebel
Copy link
Member

I think it's important when working on grammars to remember the desired goal is to be useful not necessarily correct. Often correctness is impossible (or too hard to accomplish) so we just do the best we can to provide the best highlighting in the most circumstances.

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 4, 2020

Is there a (performance) downside to writing it correctly?

I've noticed the TextMate grammar uses the following rule for identifiers: (?<q>`?)[\p{L}_][\p{L}_\p{N}\p{M}]*(\k<q>).
So basically "starts with a letter or an underscore, followed by zero or more letters, marks, or numbers, and optionally wrapped in backquotes. I don't think this is 100% the same as the official grammar, but it may be close.

I assume that's acceptable?

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 4, 2020

The TextMate grammar also seems to spell out all possible Unicode characters for operators:

Screen Shot 2020-12-04 at 10 01 17

@joshgoebel
Copy link
Member

The TextMate grammar also seems to spell out all possible Unicode characters for operators:

That seems a bit much. I'd consider a smaller subset or simplifying the rule to make it "wider" such that is still mostly works even if not 100% correct. This being a case of size/complexity vs correctness. Also anytime I see this UTF-8 stuff I wonder if it's better solved by some of the new UTF-8 stuff we haven't gotten into yet.

Is there a (performance) downside to writing it correctly?

Well, usually I'm more concerned about size/complexity since we don't a benchmark suite, but yes performance is also a consideration. More complex grammars take longer to execute than less complex grammars.

Some of the TextMate grammars are like 1mb of rules... that's simply not what we're aspiring to do here, so sometimes we look at a thing and go "we're going to have to take a pass on that".

I've noticed the TextMate grammar uses the following rule for identifiers: (?`?)[\p{L}][\p{L}\p{N}\p{M}]*(\k).
So basically "starts with a letter or an underscore, followed by zero or more letters, marks, or numbers, and optionally wrapped in backquotes. I don't think this is 100% the same as the official grammar, but it may be close.

Pretty sure none of this is supported at all unless we're using the new ES2015 /u regex flag, which we don't yet. Again see #2756. For many languages using the new Unicode stuff might ultimately be the "best" solution, but so far no one has been interested in looking into the performance considerations.

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 5, 2020

Some of the TextMate grammars are like 1mb of rules... that's simply not what we're aspiring to do here, so sometimes we look at a thing and go "we're going to have to take a pass on that".

I'm also trying to find some middle ground. I just want to get rid of incorrect highlighting that may confuse readers. My current plan is try to to implement the grammar for types — especially function types — and their prerequisites, such as operators, identifiers, and attributes. I'm hoping that will allow me to detect the end of a function declaration, even if there's no body, which will fix this issue.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 5, 2020

I just want to get rid of incorrect highlighting that may confuse readers.

Sure, just it can be hard to balance sometimes. Historically we err on the side of highlighting "more things" with "good" accuracy vs highlighting less but with 100% accuracy. For example see our regex support in many languages which is often cantfix because it's simply too complex to figure out what is and isn't a regex without a full parser aware of context. Some people absolutely think this is the wrong choice, but 99% of the time the outcome is very good.

Simply never highlighting regex because we get it VERY wrong 1% of the time would be a huge loss.

My current plan is try to to implement the grammar for types — especially function types — and their prerequisites, such as operators, identifiers, and attributes. I'm hoping that will allow me to detect the end of a function declaration, even if there's no body, which will fix this issue.

Often newer contributors over-engineer their initial solutions. :-) Trying to build language parsers instead of "do the simplest thing". If the regexs for all this is straight forward that might be ok, but in many cases an equivalent thing can often be achieved with a "one line hack"... such as adding additional "short-circuit" characters to the end mode... and often those hacks are much, much easier to maintain long-term than whatever the far more complex "correct" solution would be.

So I always first look for the "simple" solution to solve the "issue at hand" vs trying to assemble a 100% accurate Swift parser.

@svanimpe
Copy link
Contributor Author

Quick question here: does the end rule take precendence over the contains rule? I'm wondering if I can detect the end of a function declaration (even in the absence of an opening brace) by providing subrules for all the parts (generic parameters, parameters, return type, and where clause) and then just ending the function mode if there's no more submodes to match.

@svanimpe
Copy link
Contributor Author

I have a quick fix for this ready :)
I'll upload it in my next PR and try to fix #2857 too, as that's also function related.

@joshgoebel
Copy link
Member

Could we make these separate PRs? It's super hard to constantly review a huge PR that keeps growing. It would be much simpler to freeze it, get it solid... and then add these things as smaller and much easier to review PRs that come after.

@joshgoebel
Copy link
Member

Quick question here: does the end rule take precendence over the contains rule?

I'm not sure what you mean. When in a given mode the contains list + the end expression (plus illegal) is constantly scanned for. First found wins. Finding end of course ends the mode.

I'm wondering if I can detect the end of a function declaration (even in the absence of an opening brace) by providing subrules for all the parts (generic parameters, parameters, return type, and where clause) and then just ending the function mode if there's no more submodes to match.

This can work in some cases if it doesn't get too complex and if the rules are distinct enough that lack of sequencing is not an issue - as we do not provide great tools for X, then Y, then Z type expressions.

@svanimpe
Copy link
Contributor Author

I suggested combining these issues into a PR because they both involve rewriting the FUNCTION mode.

W.r.t. the end/contains, I was wondering in what order these are attempted. First end, then contains? First contains, then end? Or undefined? I was just trying to think of a way to match rules like "may be followed by ..." when there is no clear end character, just the lack of ... following it.

@joshgoebel
Copy link
Member

Yes I was only saying not to mix them with the existing PR... it's already too large.

W.r.t. the end/contains, I was wondering in what order these are attempted. First end, then contains?

No. contains + end + illegal. End or illegal only matches when no contains first match.

I was just trying to think of a way to match rules like "may be followed by ..." when there is no clear end character, just the lack of ... following it.

Huh? I dont' follow.

@svanimpe
Copy link
Contributor Author

Yes I was only saying not to mix them with the existing PR... it's already too large.

Ah yes, the current PR is definitely too large 😅

Huh? I dont' follow.

Currently, the function params mode uses endParent. However, a function declaration may also have trailing keywords (throw or throws), a return type (-> Type) if it's not Void, and a list of generic constraints (where T: Constraint), in that order. I'm contemplating adding modes to match those, so I can properly detect those.

However, now that I can distinguish a function title of < from the start of the generic parameter clause (also <), that may not be necessary anymore, and I may be able to match whatever comes after the params as "just code", and hope that it just works.

@joshgoebel
Copy link
Member

However, a function declaration may also have trailing keywords (throw or throws), a return type (-> Type) if it's not Void, and a list of generic constraints (where T: Constraint), in that order. I'm contemplating adding modes to match those, so I can properly detect those.

Those things don't necessarily need to be handed inside the scope of "function" though. Context-free grammars are better (much easier to debug and maintain) than context-filled grammars, when possible.

and I may be able to match whatever comes after the params as "just code", and hope that it just works.

And it just may. I think you're saying the same thing I just said.

@svanimpe
Copy link
Contributor Author

WIP:

Screen Shot 2020-12-15 at 14 31 23

Getting there!

joshgoebel pushed a commit that referenced this issue Dec 28, 2020
…ipts (#2930)

This PR improves highlighting for functions, initializers, and subscripts. This includes detailed highlighting for generic parameters as well as function parameters.

To differentiate between tuple elements and function params, I've also added a mode to match tuples, whose element names should not be highlighted as params.

- Fixes #2895
- Fixes #2857
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help welcome Could use help from community language
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants