(Swift) Incorrect highlighting in operator functions #2895

svanimpe · 2020-11-28T05:41:21Z

What exactly is going on here?

@joshgoebel I’ll try to improve the Swift grammar a bit during the holidays. I'll start by looking into the JS regex syntax, then reading the HLJS developer documentation Can you recommend some grammars I should look at for inspiration?

joshgoebel · 2020-12-02T17:19:36Z

I'd try to tell you but you gave me screenshots instead of actual code I could copy and paste. :-p :~(

svanimpe · 2020-12-02T17:26:10Z

Sorry!

protocol Comparable: Equatable {

  static func < (lhs: Self, rhs: Self) -> Bool
  static func <= (lhs: Self, rhs: Self) -> Bool
  static func > (lhs: Self, rhs: Self) -> Bool
  static func >= (lhs: Self, rhs: Self) -> Bool
}

Now that I'm more familiar with the grammar, the problem seems to be that the grammar doesn't support protocol requirements. It assumes that the requirement static func < (lhs: Self, rhs: Self) -> Bool is a function declaration, so it keeps looking for an opening brace, which is never found, so the remainder of the code is parsed as a function declaration:

joshgoebel · 2020-12-02T17:49:31Z

Yes, quite right. A few thoughts:

The TITLE_MODE rule also will need to be expanded to include the list of valid operators that Swift allows overloading. (I assume in this context we'd highlight them as titles like we do function names?)
If you wanted to tackle that we'd want the operators in a separate array (one per line) and then we'd join then using utility functions and merge them with the existing /[A-Za-z$][0-9A-Za-z$]*/ title rule.
If you wanted to start by just fixing the actual BUG here then I think we may simply need a contains rule that matches -> and then endsParent... or the end function rule could be updated.
Detecting the difference between a function/declaration/protocol requirement might be a bit out of scope.

joshgoebel · 2020-12-02T17:53:35Z

Ah you're also going to fall afoul the templating(?) rule begin: /</, end: />/ so I think to deal with this properly we'll have to fix the TITLE rule first. And I'm worried we're quickly going to get into ambiguous territory here. :-)

svanimpe · 2020-12-03T18:57:25Z

Is there a way to express a specific ordering of patterns?

For example, in a function declaration, the function's name must come before the generic parameters, so in this example:

func < <T>()

The first < is the name of the function (< is a valid operator name), whereas the second < starts the generic parameter list. As you can see, GitHub's syntax highlighter matches this correctly. Can we express this in HLJS as well?

svanimpe · 2020-12-03T19:06:30Z

W.r.t. the TITLE rule, I was looking at the grammar for identifiers in Swift:
https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#grammar_identifier

This seems doable to implement. However, I'm wondering if the 5-hex-digit Unicode characters (such as U+9FFFD) are supported? I'm finding contradicting information regarding \uhhhhh. Is this supported?

joshgoebel · 2020-12-04T06:27:07Z

Is there a way to express a specific ordering of patterns?

We have starts but that can get gnarly fast. I think in this case it might make sense to treat func < as a special case and handle it with a slightly different rule. (variants can do this) IE something like:

begin: regex.concat(/func /, regex.either(...OPERATORS)),
keywords: { title: OPERATORS.join(" ") }

It might even be simpler to just handle all the operators this way and avoid even touching TITLE. Of course I'm not sure what other syntax Swift has that might get in the way of this or not.

This seems doable to implement. However, I'm wondering if the 5-hex-digit Unicode characters (such as U+9FFFD) are supported? I'm finding contradicting information regarding \uhhhhh. Is this supported?

Regarding TITLE I was only talking about fixing it to match the operators... all those unicode characters might be a bit out of scope and better solved with #2756.

There are one or two places we use Unicode stuff in grammars now; ie, java:

var JAVA_IDENT_RE = '[\u00C0-\u02B8a-zA-Z_$][\u00C0-\u02B8a-zA-Z_$0-9]*';

But as you can see that's very minimal and clean usage. If a grammar instead needs 50 different unicode match rules (as Swift seems to) then it's time to consider a different approach.

joshgoebel · 2020-12-04T06:29:13Z

I think it's important when working on grammars to remember the desired goal is to be useful not necessarily correct. Often correctness is impossible (or too hard to accomplish) so we just do the best we can to provide the best highlighting in the most circumstances.

svanimpe · 2020-12-04T07:08:51Z

Is there a (performance) downside to writing it correctly?

I've noticed the TextMate grammar uses the following rule for identifiers: (?<q>`?)[\p{L}_][\p{L}_\p{N}\p{M}]*(\k<q>).
So basically "starts with a letter or an underscore, followed by zero or more letters, marks, or numbers, and optionally wrapped in backquotes. I don't think this is 100% the same as the official grammar, but it may be close.

I assume that's acceptable?

svanimpe · 2020-12-04T09:06:50Z

The TextMate grammar also seems to spell out all possible Unicode characters for operators:

joshgoebel · 2020-12-05T15:49:25Z

The TextMate grammar also seems to spell out all possible Unicode characters for operators:

That seems a bit much. I'd consider a smaller subset or simplifying the rule to make it "wider" such that is still mostly works even if not 100% correct. This being a case of size/complexity vs correctness. Also anytime I see this UTF-8 stuff I wonder if it's better solved by some of the new UTF-8 stuff we haven't gotten into yet.

Is there a (performance) downside to writing it correctly?

Well, usually I'm more concerned about size/complexity since we don't a benchmark suite, but yes performance is also a consideration. More complex grammars take longer to execute than less complex grammars.

Some of the TextMate grammars are like 1mb of rules... that's simply not what we're aspiring to do here, so sometimes we look at a thing and go "we're going to have to take a pass on that".

I've noticed the TextMate grammar uses the following rule for identifiers: (?`?)[\p{L}][\p{L}\p{N}\p{M}]*(\k).
So basically "starts with a letter or an underscore, followed by zero or more letters, marks, or numbers, and optionally wrapped in backquotes. I don't think this is 100% the same as the official grammar, but it may be close.

Pretty sure none of this is supported at all unless we're using the new ES2015 /u regex flag, which we don't yet. Again see #2756. For many languages using the new Unicode stuff might ultimately be the "best" solution, but so far no one has been interested in looking into the performance considerations.

svanimpe · 2020-12-05T16:36:23Z

Some of the TextMate grammars are like 1mb of rules... that's simply not what we're aspiring to do here, so sometimes we look at a thing and go "we're going to have to take a pass on that".

I'm also trying to find some middle ground. I just want to get rid of incorrect highlighting that may confuse readers. My current plan is try to to implement the grammar for types — especially function types — and their prerequisites, such as operators, identifiers, and attributes. I'm hoping that will allow me to detect the end of a function declaration, even if there's no body, which will fix this issue.

joshgoebel · 2020-12-05T16:47:39Z

I just want to get rid of incorrect highlighting that may confuse readers.

Sure, just it can be hard to balance sometimes. Historically we err on the side of highlighting "more things" with "good" accuracy vs highlighting less but with 100% accuracy. For example see our regex support in many languages which is often cantfix because it's simply too complex to figure out what is and isn't a regex without a full parser aware of context. Some people absolutely think this is the wrong choice, but 99% of the time the outcome is very good.

Simply never highlighting regex because we get it VERY wrong 1% of the time would be a huge loss.

My current plan is try to to implement the grammar for types — especially function types — and their prerequisites, such as operators, identifiers, and attributes. I'm hoping that will allow me to detect the end of a function declaration, even if there's no body, which will fix this issue.

Often newer contributors over-engineer their initial solutions. :-) Trying to build language parsers instead of "do the simplest thing". If the regexs for all this is straight forward that might be ok, but in many cases an equivalent thing can often be achieved with a "one line hack"... such as adding additional "short-circuit" characters to the end mode... and often those hacks are much, much easier to maintain long-term than whatever the far more complex "correct" solution would be.

So I always first look for the "simple" solution to solve the "issue at hand" vs trying to assemble a 100% accurate Swift parser.

svanimpe · 2020-12-12T16:24:22Z

Quick question here: does the end rule take precendence over the contains rule? I'm wondering if I can detect the end of a function declaration (even in the absence of an opening brace) by providing subrules for all the parts (generic parameters, parameters, return type, and where clause) and then just ending the function mode if there's no more submodes to match.

svanimpe · 2020-12-12T17:52:38Z

I have a quick fix for this ready :)
I'll upload it in my next PR and try to fix #2857 too, as that's also function related.

joshgoebel · 2020-12-12T18:45:00Z

Could we make these separate PRs? It's super hard to constantly review a huge PR that keeps growing. It would be much simpler to freeze it, get it solid... and then add these things as smaller and much easier to review PRs that come after.

joshgoebel · 2020-12-12T18:50:26Z

Quick question here: does the end rule take precendence over the contains rule?

I'm not sure what you mean. When in a given mode the contains list + the end expression (plus illegal) is constantly scanned for. First found wins. Finding end of course ends the mode.

I'm wondering if I can detect the end of a function declaration (even in the absence of an opening brace) by providing subrules for all the parts (generic parameters, parameters, return type, and where clause) and then just ending the function mode if there's no more submodes to match.

This can work in some cases if it doesn't get too complex and if the rules are distinct enough that lack of sequencing is not an issue - as we do not provide great tools for X, then Y, then Z type expressions.

svanimpe · 2020-12-12T19:00:58Z

I suggested combining these issues into a PR because they both involve rewriting the FUNCTION mode.

W.r.t. the end/contains, I was wondering in what order these are attempted. First end, then contains? First contains, then end? Or undefined? I was just trying to think of a way to match rules like "may be followed by ..." when there is no clear end character, just the lack of ... following it.

joshgoebel · 2020-12-12T19:32:15Z

Yes I was only saying not to mix them with the existing PR... it's already too large.

W.r.t. the end/contains, I was wondering in what order these are attempted. First end, then contains?

No. contains + end + illegal. End or illegal only matches when no contains first match.

I was just trying to think of a way to match rules like "may be followed by ..." when there is no clear end character, just the lack of ... following it.

Huh? I dont' follow.

svanimpe · 2020-12-12T19:41:50Z

Yes I was only saying not to mix them with the existing PR... it's already too large.

Ah yes, the current PR is definitely too large 😅

Huh? I dont' follow.

Currently, the function params mode uses endParent. However, a function declaration may also have trailing keywords (throw or throws), a return type (-> Type) if it's not Void, and a list of generic constraints (where T: Constraint), in that order. I'm contemplating adding modes to match those, so I can properly detect those.

However, now that I can distinguish a function title of < from the start of the generic parameter clause (also <), that may not be necessary anymore, and I may be able to match whatever comes after the params as "just code", and hope that it just works.

joshgoebel · 2020-12-12T19:44:52Z

However, a function declaration may also have trailing keywords (throw or throws), a return type (-> Type) if it's not Void, and a list of generic constraints (where T: Constraint), in that order. I'm contemplating adding modes to match those, so I can properly detect those.

Those things don't necessarily need to be handed inside the scope of "function" though. Context-free grammars are better (much easier to debug and maintain) than context-filled grammars, when possible.

and I may be able to match whatever comes after the params as "just code", and hope that it just works.

And it just may. I think you're saying the same thing I just said.

svanimpe · 2020-12-15T13:32:25Z

WIP:

Getting there!

…ipts (#2930) This PR improves highlighting for functions, initializers, and subscripts. This includes detailed highlighting for generic parameters as well as function parameters. To differentiate between tuple elements and function params, I've also added a mode to match tuples, whose element names should not be highlighted as params. - Fixes #2895 - Fixes #2857

svanimpe added bug help welcome Could use help from community language labels Nov 28, 2020

This was referenced Dec 4, 2020

enh(swift) Add support for closure arguments (#2871) #2907

Closed

(swift) Revamped keywords and added support for operators #2908

Merged

This was referenced Dec 16, 2020

enh(swift) Improved highlighting for types and generic arguments (#2920) #2920

Merged

(swift) Improved highlighting for functions, initializers, and subscripts #2930

Merged

joshgoebel closed this as completed in #2930 Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Swift) Incorrect highlighting in operator functions #2895

(Swift) Incorrect highlighting in operator functions #2895

svanimpe commented Nov 28, 2020

joshgoebel commented Dec 2, 2020

svanimpe commented Dec 2, 2020

joshgoebel commented Dec 2, 2020 •

edited

Loading

joshgoebel commented Dec 2, 2020 •

edited

Loading

svanimpe commented Dec 3, 2020 •

edited by joshgoebel

Loading

svanimpe commented Dec 3, 2020

joshgoebel commented Dec 4, 2020 •

edited

Loading

joshgoebel commented Dec 4, 2020

svanimpe commented Dec 4, 2020 •

edited

Loading

svanimpe commented Dec 4, 2020

joshgoebel commented Dec 5, 2020

svanimpe commented Dec 5, 2020

joshgoebel commented Dec 5, 2020 •

edited

Loading

svanimpe commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 15, 2020

(Swift) Incorrect highlighting in operator functions #2895

(Swift) Incorrect highlighting in operator functions #2895

Comments

svanimpe commented Nov 28, 2020

joshgoebel commented Dec 2, 2020

svanimpe commented Dec 2, 2020

joshgoebel commented Dec 2, 2020 • edited Loading

joshgoebel commented Dec 2, 2020 • edited Loading

svanimpe commented Dec 3, 2020 • edited by joshgoebel Loading

svanimpe commented Dec 3, 2020

joshgoebel commented Dec 4, 2020 • edited Loading

joshgoebel commented Dec 4, 2020

svanimpe commented Dec 4, 2020 • edited Loading

svanimpe commented Dec 4, 2020

joshgoebel commented Dec 5, 2020

svanimpe commented Dec 5, 2020

joshgoebel commented Dec 5, 2020 • edited Loading

svanimpe commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 12, 2020

joshgoebel commented Dec 12, 2020

svanimpe commented Dec 15, 2020

joshgoebel commented Dec 2, 2020 •

edited

Loading

joshgoebel commented Dec 2, 2020 •

edited

Loading

svanimpe commented Dec 3, 2020 •

edited by joshgoebel

Loading

joshgoebel commented Dec 4, 2020 •

edited

Loading

svanimpe commented Dec 4, 2020 •

edited

Loading

joshgoebel commented Dec 5, 2020 •

edited

Loading