Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[semantic] proposals for new standard semantic token types #97063

Open
aeschli opened this issue May 6, 2020 · 31 comments
Open

[semantic] proposals for new standard semantic token types #97063

aeschli opened this issue May 6, 2020 · 31 comments
Assignees
Labels
feature-request Request for new features or functionality semantic-tokens Semantic tokens issues
Milestone

Comments

@aeschli
Copy link
Contributor

aeschli commented May 6, 2020

The new semantic token provider API comes with a list of standard token types and modifiers.
https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

These type serve as a base across languages and by having all/most providers using theme will make easier to write theming rules across languages.

That said, semantic token providers are not forced to stick to the standard, but can add new types/modifiers, or extend existing types as seen in the doc.

This issue is to collect proposals for new types and modifiers. When making a suggestion, please add a description and a small code sample. If it exists, name the corresponding TextMate scope.

The standard token types should be be applicable across multiple languages and be useful for theming. We want to keep the set of standard tokens consistent and coherent.

Proposed types:

Identifier (extends) Description Ref Sample TextMate scope
importKeyword (keyword) keywords related to imports/includes (1) import * from x
modifierKeyword (keyword) keywords describing a modifier (1) private void foo();
docComment (comment) documentation comments #96712 /** */

Proposed modifiers:

Identifier Description Ref Sample
unused annotated all unused symbols (2) let unusedVariable;

References:
(1) microsoft/language-server-protocol#968
(2) microsoft/vscode-languageserver-node#604

@kjeremy
Copy link

kjeremy commented May 7, 2020

@matklad

@matklad
Copy link

matklad commented May 7, 2020

Types

Identifier(extenrs) Description Sample
attribute/annotation syntax for attributes and annotations #[test] fn smoke() {}, @Test void method() {}
builtinType(type) non user-defined types i32, long long int
typeAlias(type) name of type aliases and typedefs type Unit = (); typedef int int_alias
union(type) name of C-style untagged unions union U { a: i32, b: f32}, union u { int a; float b; };

Modifiers

Identifier(extenrs) Description Sample
unresolved for all unresolved symbols (such symbols should also have a corresponding diagnostics) let my_resolved = 92; my_un_resolved;

All extra tokens&types defined by rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/blob/9cb55966fe0fee791072f275ac55b90b8ee13e32/editors/code/package.json#L522-L572

Hm, actually, unresolved might want to be a type, rather than modifier. It feels similar to unused, but if something is unresolved you, by definition, can't say which type it is. It is type in rust-analyzer.

@ghost
Copy link

ghost commented May 7, 2020

While I think it makes sense to add in things like typeAlias and union since a larger subset of system level languages is for sure likely to use this, these still sound specific enough that I'm wondering if a theme author may not give a different shade to all of these or have trouble figuring out what shade to give: imagine somebody who just did some Python scripting and wants to make a new theme, they'll probably have a hard time judging if something like union needs a separate color and to what it should be close in shade. It's probably enough to color type, but I'm not sure how immediately obvious that is to a theme author...?

As a result, I'm wondering: should the resulting LSP semantic tokens formal specification also include some guidance on which token types and modifier types should be considered as important to have different colors? To sort of establish a baseline on what a theme is expected to cover. Or would the expectation really be that all theme authors pick something for a relatively specific thing like a union in particular, or that they know that it's not required?

While this doesn't directly matter to the protocol implementation on either side of course, I feel like it could probably be pretty relevant for how it all plays together in the end to give some guidance for theme authors here.

@woody77
Copy link

woody77 commented May 11, 2020

Identifier Description Ref Sample
documentation for tokens that are part of documentation (1) javadoc, rust docs, doxygen
disabled for tokens that are turned off by compilation flags. #ifdef(foo) ....... #endif in c/c++, #[cfg(foo)] in rust
example/sample sample code in comments (doc or code) https://doc.rust-lang.org/src/std/time.rs.html#175
markdown these types are markdown (in e.g. comments) see above

Note that "documentation" exists today, but without any documentation as to when it's to be used, and what semantic meaning it has, so this is a proposal that comment.documentation would apply to:

  • javadoc
  • rust doc comments
  • doxygen in c++
  • other languages that specifically call out "doc comments" separate from "code comments"

disabled is something that I see VSCode do with C++ and #ifdefs, but doesn't seem to be via the semantic types. I see #ifdef'd out code with a muted set of colors (50% transparency?) but the type inspector says it should be the same as not-disabled code.

example or sample sample code in doc comments is parsed, semantically highlighted, and flagged for correctness in some IDEs, but is rendered by the themes in a mix of comment and normal formatting (say normal colors, but in italics).

markdown may be best handled in other ways, but e.g. Rust uses Markdown type headings and links in it's doc comments, but maybe the better way to handle that is is by marking them as markdown types (heading, links, etc.), and applying documentation as the modifier.

References:
(1) https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

@DanTup
Copy link
Contributor

DanTup commented Aug 13, 2020

Should there be a token type for things like TypeScript's decorators (Dart has similar called annotations):

// TypeScript

function foo() {}

@foo
function bar() {}
// Dart

@mustCallSuper
void foo() {}

I don't think any of the existing ones fit?

@aeschli
Copy link
Contributor Author

aeschli commented Aug 14, 2020

Yes, I agree that a token type annotation would be useful.

@TylerLeonhardt
Copy link
Member

We have been talking about annotation in microsoft/language-server-protocol#1067

@woody77
Copy link

woody77 commented Aug 17, 2020

rust-analyzer also provides something similar (called attribute there, as both a token type and a modifier, since there can be functions within them:

Screen Shot 2020-08-17 at 1 14 43 PM

The derive is a function.attribute, and the rest of the item has attribute token type. Although the Debug should maybe be marked as an interface ("trait" in Rust).

@DanTup
Copy link
Contributor

DanTup commented Aug 25, 2020

There are types for string and number so should there also be one for boolean?

@dannymcgee
Copy link

Hate to resurrect a stale comment thread, but hey, how bout that decorator/attribute/annotation token. :)

I love semantic highlighting but it's killing me that the @ symbol is the only thing distinguishing my function decorators from the functions they're decorating — it makes the code quite a bit less legible.

@dbaeumer
Copy link
Member

@aeschli do you have any plans to extend this in VS Code?

@0dinD
Copy link
Contributor

0dinD commented Jan 26, 2021

What's the status on the modifierKeyword token type? Was a bit confused about it a while ago when implementing some additions to semantic highlighting in the Java language server, since a modifier token type already seems to be part of the official LSP spec. From reading microsoft/language-server-protocol#968 however, it becomes even more unclear whether or not modifier or modifierKeyword is or will be a standard token type. After some discussion, we decided to use the modifier token type as it seems to be more standardized. But I ended up having to treat it as a custom token type in the vscode-java extension anyway (declaring scope mapping etc.), since it doesn't seem to be part of the standard token types in VS Code.

I think some coordination is required between LSP and VS Code here, to make sure that standard LSP token types are also standard in VS Code, as well as agreeing on a name (modifier vs modifierKeyword). At the very least, some scope mapping for modifier would be nice, so that extensions don't have to define it themselves.

@sam-mccall
Copy link

annotation, builtinType, typeAlias, union mentioned above would all be useful for clangd (C++).

unresolved or maybe "unknown" too, and I think it should be a type rather than a modifier. (For those familiar with C++ templates, dependent names could be modeled as a modifier, and their tokens would be either Type+DependentName or Unknown+DependentName)

@sam-mccall
Copy link

What do people think of modifiers for scope? Maybe function/class/module/global

int x; // variable+globalScope
static int x; // variable+moduleScope
class C {
  int x; // property+classScope
  static int x; // variable+classScope
};
void F() {
  int x; // variable+functionScope
}

These are loose, but distinguishing global variables from function-locals at a glance seems pretty useful!

@woody77
Copy link

woody77 commented Jan 29, 2021

modifiers for scope would be useful. RustAnalyzer has some custom types that somewhat work along those lines:

  • fields of structures
  • function params
  • bare stack variables
  • static variables (well, constants)

Rust doesn't have global in the same way, but the same spectrum of types applies.

@stamblerre
Copy link

From #125448: A token type to represent string placeholders. For example, the %s in "Hello, my name is %s" in Go. Per @aeschli's suggestion, it could be called stringPlaceholder.

@DanTup
Copy link
Contributor

DanTup commented Jun 9, 2021

A token type to represent string placeholders

Slightly related (though not sure if these should be types or modifiers):

  • Interpolation markers. They're not strictly placeholders, but should be coloured. Eg. the $, {, } in "a $foo b ${foo}".
  • Escaped characters. These exist in the textmate grammars (constant.character.escape) but not in semantic tokens so I had to make my own. This allows the \n in "foo\nfbar" to be coloured.
  • A reset (again this exists in the textmate grammar as meta.embedded) to allow semantic tokens to remove colours added by the textmate grammar. For example if the textmate grammar doesn't do string interpolation and just colours an entire string but the semantic tokens then want to layer colours on top, they might want to have some "uncoloured" sections (for example the interpolated expression contains some operators that are usually uncoloured). I'm currently also handling this myself (I made a "source" type and mapped it to "meta.embedded" in package.json), but since these types/modifiers are shared with LSP and other LSP clients won't have this package.json, it would be better to support natively.

@aeschli
Copy link
Contributor Author

aeschli commented Oct 21, 2021

I added a new type decorator to be used for declrators and annotations. (see #114082)
The current TextMate fallback is meta.decorator, entity.name.function. If someone has a better fallback, let me know,

@lnicola
Copy link
Contributor

lnicola commented Oct 21, 2021

@aeschli should decorator and label also be added to LSP?

@dbaeumer
Copy link
Member

Added it.

@DanTup
Copy link
Contributor

DanTup commented Apr 28, 2022

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

Dart-Code/Dart-Code#3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

@dannymcgee
Copy link

dannymcgee commented Apr 29, 2022

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

Dart-Code/Dart-Code#3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

@DanTup What I generally try to do is use the TextMate grammar to map out most of the syntax, and only use semantic tokens to give semantic meaning to identifiers (e.g., to distinguish between a class, an interface, and a type alias — something that you can't really do without parsing the source code). Keywords are trivial to catch with a regular expression, and then you can just use a back-reference to insert the matched text into the TM scope:

{
  "match": "\\b(if|else|switch|case|for|while|break)\\b",
  "name": "keyword.control.$1.languageid",
}

Then a theme author could use, e.g., "keyword.control.for" to make that specific keyword its own color if they really wanted to.

Marking up the entire syntax with semantic tokens is something I would try to avoid personally (or hide behind a configuration flag if you need to provide those tokens for editors other than VS Code), because VS Code treats semantic tokens a bit like an ID selector in CSS, which does really limit the flexibility of theme authors and end users to customize the syntax colors in a granular way.

@DanTup
Copy link
Contributor

DanTup commented Apr 29, 2022

@dannymcgee I don't think adding configuration to the server to produce a reduced set of tokens would be a good fit here. It would mean the server has to have some knowledge of the specific client and its textmate grammar (which may change over time). I'd prefer to add additional modifiers than that, but I was hoping there could be a better way (themes are the sort of things people really like to make their own, so being able to customise some specific tokens without the servers needing to mark them all up individually seems like a powerful feature).

@aeschli
Copy link
Contributor Author

aeschli commented Apr 29, 2022

@DanTup Currently we need all semantic token types and modifiers to be known beforehand. So yes, there's no alternative to list them.

@dannymcgee
Copy link

@dannymcgee It would mean the server has to have some knowledge of the specific client and its textmate grammar

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

For what it's worth, it wouldn't be without precedent — that's how the TypeScript implementation works, and Rust Analyzer has an option to skip tokenizing strings. (But no pressure, obviously, it is your project. 🙂)

@HighCommander4
Copy link

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

If I'm understanding you correctly, the augmentsSyntaxTokens capability in the upcoming 3.17 version of the spec is precisely such a toggle.

@DanTup
Copy link
Contributor

DanTup commented May 1, 2022

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers?

I don't think so - the semantic tokens are adding more value than just identifiers. There are a lot of things that are complicated to handle 100% accurately in the textmate grammar (expressions in string interpolation can include keywords, for example, and documentation comments can include full code blocks).

Even with a built-in toggle, it seems like assumptions would have to be made about what the client is otherwise colouring, and unless/until LSP allowed us to provide the textmate grammar to the client, that's something I'd prefer not to make assumptions about (at least, not for something minor like a small number of users wanting to customise colours of a few specific keywords).

My real question is really about how fine-grain these tokens/modifiers can/should be. I can easily handle this by just adding a custom modifier for every keyword (we already have a lot of custom modifiers to help theming), and that feels better to me that producing a restricted set of tokens - but I don't think it's as good as VS Code having more flexible built-in theming (since anything I do specifically for my language will not necessarily be consistent with other languages).

@Jamesernator
Copy link

Jamesernator commented Sep 1, 2022

So I have my own custom theme that uses a mapping from semantic token -> textmate tokens so that I can write my theme entirely semantically and have it work on non-semantic languages automatically. For the most part the semantic tokens cover most things I've come across however there are a few semantic token types that would be helpful as quite a few tokens simply have no corresponding semantic token to denote them.

Of note is a lack of semantic tokens for HTML/XML like tokens (semantically I don't feel the existing tokens cover any of these even if some could be contrived like class<->tag):

  • tag
    • Corresponding to <tag> in HTML/XML etc
  • text
    • Corresponding to text in HTML/XML etc, NOTE this differs from string literals in that attributes would generally be colored as string literals, but text content would differ, this can be seen in a sample on github like:
    <tag attr="value">Some text</tag>
    In this example the text semantic token would refer to Some text, but the existing string token would be used for "value" (in the attribute)
  • attribute
    • Like the other two HTML/XML tokens suggested, attribute would refer to HTML attribute names (not their values)

From adding rules for JS, I found of particular help distinguishing would be:

  • boolean
    • Other literal types like number, regexp, string exist, but not boolean which is supported by many languages
  • constant
    • Would cover literal types without a more specific type like number/regexp/string/etc
  • .operator modifier for keyword
    • Some keywords like new are more semantically like operators than other keywords
  • .expression modifier for keyword
    • Some keywords are semantically more like values than "keywords", for example this
  • .storage modifier for keyword
    • Some keywords specifically denote kinds of storage, for example const/let/var/readonly/private etc
  • .control modifier for keyword
    • For keywords that declare control structures like if/for/while/etc
  • null
    • For the null literal (similar to number/string/etc), very common literal in languages
  • .assignment modifier for operator
    • Would cover operators like =, +=, etc, generally want to visually distinguish these from expression operators
  • .comparison modifier for operator
    • Would cover operators like ==/</>/etc
  • .logical modifier for operator
    • Would cover operators like &&/!/not/and/etc
  • .arithmetic modifier for operator
    • Would cover operators like +/-/*/etc
  • punctuation
    • There should be a semantic way to refer to punctuation, like ., {, (, etc etc, modifiers would probably be desirable here (though personally I just color them all grey)
  • .characterClass modifier for regexp
    • This would target [a-z] and similar inside regexps
  • .escape modifier for string and regexp
    • This would target \n, \u2202 and similar
  • .delimiter for string and regexp
    • This would allow targeting the quotes and slashes for strings and regexps

@aeschli
Copy link
Contributor Author

aeschli commented Sep 1, 2022

@Jamesernator Thanks a lot for sharing!

@iDad5
Copy link

iDad5 commented Sep 13, 2022

I don not have a clear idea of what kind of modifier to add, Something like @DanTup suggested here seems an option to m, but I#m far from havin a deep understanding. Trying to create I theme though I found that the scope of variable.defaultLibrary in JS and TS ist very broad and overrides quite a lot, probably other *.defaultLibrary in various languages do too. I'd guess that I'm not the only one who would like to give visual preference for certain built in constructs over others.

I came upon this, when I tried to give special emphasis to to console which by nature has (for me) a very different scope and use than in built constants like Math.

@MartinGC94
Copy link

How about Command Arguments (alternative names could be bare quote strings or generic tokens)?
Command line languages like PowerShell, Batch, Bash, etc. allow you to run commands like: command -parameter argument where the argument can either be a quoted or unquoted string value. Whether or not the string is quoted is important info because it affects how Bash handles wildcard characters and PowerShell includes similar logic when calling native programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality semantic-tokens Semantic tokens issues
Projects
None yet
Development

No branches or pull requests