[semantic] proposals for new standard semantic token types #97063

aeschli · 2020-05-06T10:01:18Z

The new semantic token provider API comes with a list of standard token types and modifiers.
https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

These type serve as a base across languages and by having all/most providers using theme will make easier to write theming rules across languages.

That said, semantic token providers are not forced to stick to the standard, but can add new types/modifiers, or extend existing types as seen in the doc.

This issue is to collect proposals for new types and modifiers. When making a suggestion, please add a description and a small code sample. If it exists, name the corresponding TextMate scope.

The standard token types should be be applicable across multiple languages and be useful for theming. We want to keep the set of standard tokens consistent and coherent.

Proposed types:

Identifier (extends)	Description	Ref	Sample
importKeyword (keyword)	keywords related to imports/includes	(1)	import * from x
modifierKeyword (keyword)	keywords describing a modifier	(1)	private void foo();
docComment (comment)	documentation comments	#96712	/** */

Proposed modifiers:

Identifier	Description	Ref	Sample
unused	annotated all unused symbols	(2)	let unusedVariable;

References:
(1) microsoft/language-server-protocol#968
(2) microsoft/vscode-languageserver-node#604

kjeremy · 2020-05-07T09:09:34Z

@matklad

matklad · 2020-05-07T10:57:11Z

Types

Identifier(extenrs)	Description	Sample
attribute/annotation	syntax for attributes and annotations	`#[test] fn smoke() {}`, `@Test void method() {}`
builtinType(type)	non user-defined types	`i32`, `long long int`
typeAlias(type)	name of type aliases and typedefs	`type Unit = (); typedef int int_alias`
union(type)	name of C-style untagged unions	`union U { a: i32, b: f32}`, `union u { int a; float b; };`

Modifiers

Identifier(extenrs)	Description	Sample
unresolved	for all unresolved symbols (such symbols should also have a corresponding diagnostics)	`let my_resolved = 92; my_un_resolved;`

All extra tokens&types defined by rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/blob/9cb55966fe0fee791072f275ac55b90b8ee13e32/editors/code/package.json#L522-L572

Hm, actually, unresolved might want to be a type, rather than modifier. It feels similar to unused, but if something is unresolved you, by definition, can't say which type it is. It is type in rust-analyzer.

ghost · 2020-05-07T16:57:05Z

While I think it makes sense to add in things like typeAlias and union since a larger subset of system level languages is for sure likely to use this, these still sound specific enough that I'm wondering if a theme author may not give a different shade to all of these or have trouble figuring out what shade to give: imagine somebody who just did some Python scripting and wants to make a new theme, they'll probably have a hard time judging if something like union needs a separate color and to what it should be close in shade. It's probably enough to color type, but I'm not sure how immediately obvious that is to a theme author...?

As a result, I'm wondering: should the resulting LSP semantic tokens formal specification also include some guidance on which token types and modifier types should be considered as important to have different colors? To sort of establish a baseline on what a theme is expected to cover. Or would the expectation really be that all theme authors pick something for a relatively specific thing like a union in particular, or that they know that it's not required?

While this doesn't directly matter to the protocol implementation on either side of course, I feel like it could probably be pretty relevant for how it all plays together in the end to give some guidance for theme authors here.

woody77 · 2020-05-11T21:20:14Z

Identifier	Description	Ref	Sample
documentation	for tokens that are part of documentation	(1)	javadoc, rust docs, doxygen
disabled	for tokens that are turned off by compilation flags.		`#ifdef(foo) ....... #endif` in c/c++, `#[cfg(foo)]` in rust
example/sample	sample code in comments (doc or code)		https://doc.rust-lang.org/src/std/time.rs.html#175
markdown	these types are markdown (in e.g. comments)		see above

Note that "documentation" exists today, but without any documentation as to when it's to be used, and what semantic meaning it has, so this is a proposal that comment.documentation would apply to:

javadoc
rust doc comments
doxygen in c++
other languages that specifically call out "doc comments" separate from "code comments"

disabled is something that I see VSCode do with C++ and #ifdefs, but doesn't seem to be via the semantic types. I see #ifdef'd out code with a muted set of colors (50% transparency?) but the type inspector says it should be the same as not-disabled code.

example or sample sample code in doc comments is parsed, semantically highlighted, and flagged for correctness in some IDEs, but is rendered by the themes in a mix of comment and normal formatting (say normal colors, but in italics).

markdown may be best handled in other ways, but e.g. Rust uses Markdown type headings and links in it's doc comments, but maybe the better way to handle that is is by marking them as markdown types (heading, links, etc.), and applying documentation as the modifier.

References:
(1) https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide#semantic-token-classification

DanTup · 2020-08-13T15:42:04Z

Should there be a token type for things like TypeScript's decorators (Dart has similar called annotations):

// TypeScript

function foo() {}

@foo
function bar() {}

// Dart

@mustCallSuper
void foo() {}

I don't think any of the existing ones fit?

aeschli · 2020-08-14T08:06:23Z

Yes, I agree that a token type annotation would be useful.

TylerLeonhardt · 2020-08-14T20:54:37Z

We have been talking about annotation in microsoft/language-server-protocol#1067

woody77 · 2020-08-17T20:14:31Z

rust-analyzer also provides something similar (called attribute there, as both a token type and a modifier, since there can be functions within them:

The derive is a function.attribute, and the rest of the item has attribute token type. Although the Debug should maybe be marked as an interface ("trait" in Rust).

DanTup · 2020-08-25T16:01:17Z

There are types for string and number so should there also be one for boolean?

dannymcgee · 2020-11-06T13:08:27Z

Hate to resurrect a stale comment thread, but hey, how bout that decorator/attribute/annotation token. :)

I love semantic highlighting but it's killing me that the @ symbol is the only thing distinguishing my function decorators from the functions they're decorating — it makes the code quite a bit less legible.

dbaeumer · 2020-11-17T12:46:28Z

@aeschli do you have any plans to extend this in VS Code?

0dinD · 2021-01-26T21:54:05Z

What's the status on the modifierKeyword token type? Was a bit confused about it a while ago when implementing some additions to semantic highlighting in the Java language server, since a modifier token type already seems to be part of the official LSP spec. From reading microsoft/language-server-protocol#968 however, it becomes even more unclear whether or not modifier or modifierKeyword is or will be a standard token type. After some discussion, we decided to use the modifier token type as it seems to be more standardized. But I ended up having to treat it as a custom token type in the vscode-java extension anyway (declaring scope mapping etc.), since it doesn't seem to be part of the standard token types in VS Code.

I think some coordination is required between LSP and VS Code here, to make sure that standard LSP token types are also standard in VS Code, as well as agreeing on a name (modifier vs modifierKeyword). At the very least, some scope mapping for modifier would be nice, so that extensions don't have to define it themselves.

sam-mccall · 2021-01-29T17:11:43Z

annotation, builtinType, typeAlias, union mentioned above would all be useful for clangd (C++).

unresolved or maybe "unknown" too, and I think it should be a type rather than a modifier. (For those familiar with C++ templates, dependent names could be modeled as a modifier, and their tokens would be either Type+DependentName or Unknown+DependentName)

sam-mccall · 2021-01-29T17:15:09Z

What do people think of modifiers for scope? Maybe function/class/module/global

int x; // variable+globalScope
static int x; // variable+moduleScope
class C {
  int x; // property+classScope
  static int x; // variable+classScope
};
void F() {
  int x; // variable+functionScope
}

These are loose, but distinguishing global variables from function-locals at a glance seems pretty useful!

woody77 · 2021-01-29T17:43:07Z

modifiers for scope would be useful. RustAnalyzer has some custom types that somewhat work along those lines:

fields of structures
function params
bare stack variables
static variables (well, constants)

Rust doesn't have global in the same way, but the same spectrum of types applies.

stamblerre · 2021-06-09T04:02:20Z

From #125448: A token type to represent string placeholders. For example, the %s in "Hello, my name is %s" in Go. Per @aeschli's suggestion, it could be called stringPlaceholder.

DanTup · 2021-06-09T09:10:41Z

A token type to represent string placeholders

Slightly related (though not sure if these should be types or modifiers):

Interpolation markers. They're not strictly placeholders, but should be coloured. Eg. the $, {, } in "a $foo b ${foo}".
Escaped characters. These exist in the textmate grammars (constant.character.escape) but not in semantic tokens so I had to make my own. This allows the \n in "foo\nfbar" to be coloured.
A reset (again this exists in the textmate grammar as meta.embedded) to allow semantic tokens to remove colours added by the textmate grammar. For example if the textmate grammar doesn't do string interpolation and just colours an entire string but the semantic tokens then want to layer colours on top, they might want to have some "uncoloured" sections (for example the interpolated expression contains some operators that are usually uncoloured). I'm currently also handling this myself (I made a "source" type and mapped it to "meta.embedded" in package.json), but since these types/modifiers are shared with LSP and other LSP clients won't have this package.json, it would be better to support natively.

aeschli · 2021-10-21T12:27:04Z

I added a new type decorator to be used for declrators and annotations. (see #114082)
The current TextMate fallback is meta.decorator, entity.name.function. If someone has a better fallback, let me know,

lnicola · 2021-10-21T12:41:35Z

@aeschli should decorator and label also be added to LSP?

dbaeumer · 2021-10-21T14:51:31Z

Added it.

DanTup · 2022-04-28T14:42:01Z

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

Dart-Code/Dart-Code#3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

dannymcgee · 2022-04-29T03:08:38Z

@aeschli I had a request for additional modifiers so that a theme author can customise colours of some keywords specifically:

Dart-Code/Dart-Code#3926

It feels awkward to provide a modifier for each language keyword - are there any guidelines on how fine-grain these should be? Would it be a reasonable/feasible VS Code feature request to allow the text content of a token to be used by theme authors? (for ex. keyword['void'])

@DanTup What I generally try to do is use the TextMate grammar to map out most of the syntax, and only use semantic tokens to give semantic meaning to identifiers (e.g., to distinguish between a class, an interface, and a type alias — something that you can't really do without parsing the source code). Keywords are trivial to catch with a regular expression, and then you can just use a back-reference to insert the matched text into the TM scope:

{
  "match": "\\b(if|else|switch|case|for|while|break)\\b",
  "name": "keyword.control.$1.languageid",
}

Then a theme author could use, e.g., "keyword.control.for" to make that specific keyword its own color if they really wanted to.

Marking up the entire syntax with semantic tokens is something I would try to avoid personally (or hide behind a configuration flag if you need to provide those tokens for editors other than VS Code), because VS Code treats semantic tokens a bit like an ID selector in CSS, which does really limit the flexibility of theme authors and end users to customize the syntax colors in a granular way.

DanTup · 2022-04-29T10:58:24Z

@dannymcgee I don't think adding configuration to the server to produce a reduced set of tokens would be a good fit here. It would mean the server has to have some knowledge of the specific client and its textmate grammar (which may change over time). I'd prefer to add additional modifiers than that, but I was hoping there could be a better way (themes are the sort of things people really like to make their own, so being able to customise some specific tokens without the servers needing to mark them all up individually seems like a powerful feature).

aeschli · 2022-04-29T13:24:08Z

@DanTup Currently we need all semantic token types and modifiers to be known beforehand. So yes, there's no alternative to list them.

dannymcgee · 2022-04-29T23:44:57Z

@dannymcgee It would mean the server has to have some knowledge of the specific client and its textmate grammar

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

For what it's worth, it wouldn't be without precedent — that's how the TypeScript implementation works, and Rust Analyzer has an option to skip tokenizing strings. (But no pressure, obviously, it is your project. 🙂)

HighCommander4 · 2022-04-30T00:01:49Z

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers? (The latter is the option I would personally prefer as an end user.) It doesn't require any knowledge of the specific grammar (or even the specific client), just a general idea that certain clients may be supplementing the semantic tokens with some other tokenizer, so they only need specification of semantic (as opposed to syntactic) information.

If I'm understanding you correctly, the augmentsSyntaxTokens capability in the upcoming 3.17 version of the spec is precisely such a toggle.

DanTup · 2022-05-01T14:48:57Z

Couldn't you just use a simple toggle that either a) tokenizes everything or b) tokenizes only identifiers?

I don't think so - the semantic tokens are adding more value than just identifiers. There are a lot of things that are complicated to handle 100% accurately in the textmate grammar (expressions in string interpolation can include keywords, for example, and documentation comments can include full code blocks).

Even with a built-in toggle, it seems like assumptions would have to be made about what the client is otherwise colouring, and unless/until LSP allowed us to provide the textmate grammar to the client, that's something I'd prefer not to make assumptions about (at least, not for something minor like a small number of users wanting to customise colours of a few specific keywords).

My real question is really about how fine-grain these tokens/modifiers can/should be. I can easily handle this by just adding a custom modifier for every keyword (we already have a lot of custom modifiers to help theming), and that feels better to me that producing a restricted set of tokens - but I don't think it's as good as VS Code having more flexible built-in theming (since anything I do specifically for my language will not necessarily be consistent with other languages).

Jamesernator · 2022-09-01T13:00:47Z

So I have my own custom theme that uses a mapping from semantic token -> textmate tokens so that I can write my theme entirely semantically and have it work on non-semantic languages automatically. For the most part the semantic tokens cover most things I've come across however there are a few semantic token types that would be helpful as quite a few tokens simply have no corresponding semantic token to denote them.

Of note is a lack of semantic tokens for HTML/XML like tokens (semantically I don't feel the existing tokens cover any of these even if some could be contrived like class<->tag):

tag
- Corresponding to <tag> in HTML/XML etc
text
- Corresponding to text in HTML/XML etc, NOTE this differs from string literals in that attributes would generally be colored as string literals, but text content would differ, this can be seen in a sample on github like:
```
<tag attr="value">Some text</tag>
```
In this example the text semantic token would refer to Some text, but the existing string token would be used for "value" (in the attribute)
attribute
- Like the other two HTML/XML tokens suggested, attribute would refer to HTML attribute names (not their values)

From adding rules for JS, I found of particular help distinguishing would be:

boolean
- Other literal types like number, regexp, string exist, but not boolean which is supported by many languages
constant
- Would cover literal types without a more specific type like number/regexp/string/etc
.operator modifier for keyword
- Some keywords like new are more semantically like operators than other keywords
.expression modifier for keyword
- Some keywords are semantically more like values than "keywords", for example this
.storage modifier for keyword
- Some keywords specifically denote kinds of storage, for example const/let/var/readonly/private etc
.control modifier for keyword
- For keywords that declare control structures like if/for/while/etc
null
- For the null literal (similar to number/string/etc), very common literal in languages
.assignment modifier for operator
- Would cover operators like =, +=, etc, generally want to visually distinguish these from expression operators
.comparison modifier for operator
- Would cover operators like ==/</>/etc
.logical modifier for operator
- Would cover operators like &&/!/not/and/etc
.arithmetic modifier for operator
- Would cover operators like +/-/*/etc
punctuation
- There should be a semantic way to refer to punctuation, like ., {, (, etc etc, modifiers would probably be desirable here (though personally I just color them all grey)
.characterClass modifier for regexp
- This would target [a-z] and similar inside regexps
.escape modifier for string and regexp
- This would target \n, \u2202 and similar
.delimiter for string and regexp
- This would allow targeting the quotes and slashes for strings and regexps

aeschli · 2022-09-01T14:35:28Z

@Jamesernator Thanks a lot for sharing!

iDad5 · 2022-09-13T22:13:33Z

I don not have a clear idea of what kind of modifier to add, Something like @DanTup suggested here seems an option to m, but I#m far from havin a deep understanding. Trying to create I theme though I found that the scope of variable.defaultLibrary in JS and TS ist very broad and overrides quite a lot, probably other *.defaultLibrary in various languages do too. I'd guess that I'm not the only one who would like to give visual preference for certain built in constructs over others.

I came upon this, when I tried to give special emphasis to to console which by nature has (for me) a very different scope and use than in built constants like Math.

MartinGC94 · 2024-07-19T23:10:58Z

How about Command Arguments (alternative names could be bare quote strings or generic tokens)?
Command line languages like PowerShell, Batch, Bash, etc. allow you to run commands like: command -parameter argument where the argument can either be a quoted or unquoted string value. Whether or not the string is quoted is important info because it affects how Bash handles wildcard characters and PowerShell includes similar logic when calling native programs.

aeschli self-assigned this May 6, 2020

aeschli added semantic-tokens Semantic tokens issues feature-request Request for new features or functionality labels May 6, 2020

aeschli added this to the Backlog milestone May 6, 2020

aeschli mentioned this issue May 6, 2020

"comment" semantic token type is overly broad vs the syntax types that it overrides when themeing #96712

Closed

rcjsuen mentioned this issue May 6, 2020

Semantic highlighting API draft should incorporate "importKeyword" and "modifierKeyword" token types microsoft/language-server-protocol#968

Open

aeschli mentioned this issue May 7, 2020

SemanticTokensModifier: Add unused token modifier microsoft/vscode-languageserver-node#604

Open

woody77 mentioned this issue May 21, 2020

Name resolution works poorly with #[cfg(target)]? rust-lang/rust-analyzer#4041

Open

aeschli mentioned this issue Aug 11, 2020

Support specifying language for semantic token types and modifiers #103097

Closed

rcjsuen mentioned this issue Aug 14, 2020

Additional Semantic Token Type microsoft/language-server-protocol#1067

Open

Eskibear mentioned this issue Sep 30, 2020

Improve semantic token modifiers eclipse-jdtls/eclipse.jdt.ls#1539

Merged

matklad mentioned this issue Nov 17, 2020

Collect new SymbolKinds microsoft/language-server-protocol#344

Open

0dinD mentioned this issue Jan 26, 2021

Declare semantic token modifiers redhat-developer/vscode-java#1760

Merged

aeschli mentioned this issue Jun 7, 2021

Semantic token modifiers do not appear to affect colors #125448

Closed

DanTup mentioned this issue Apr 28, 2022

Specific semantic tokens for various "keyword" types not provided Dart-Code/Dart-Code#3926

Open

iDad5 mentioned this issue Dec 6, 2022

Inconsistencies, Bugs and Generally Confusing Documentation/Guides for Theming and Syntax-Highlighting microsoft/vscode-docs#5831

Open

DanTup mentioned this issue Nov 10, 2022

Change the String enclosing quotes color Dart-Code/Dart-Code#4258

Open

benrbray mentioned this issue Jun 17, 2024

Semantic Tokens for TypeScript Template Literal Strings microsoft/TypeScript#58900

Open

MartinGC94 mentioned this issue Jul 22, 2024

Update semantic tokens PowerShell/PowerShellEditorServices#2168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[semantic] proposals for new standard semantic token types #97063

[semantic] proposals for new standard semantic token types #97063

aeschli commented May 6, 2020 •

edited

Loading

kjeremy commented May 7, 2020

matklad commented May 7, 2020 •

edited

Loading

ghost commented May 7, 2020 •

edited by ghost

Loading

woody77 commented May 11, 2020

DanTup commented Aug 13, 2020

aeschli commented Aug 14, 2020

TylerLeonhardt commented Aug 14, 2020

woody77 commented Aug 17, 2020 •

edited

Loading

DanTup commented Aug 25, 2020

dannymcgee commented Nov 6, 2020

dbaeumer commented Nov 17, 2020

0dinD commented Jan 26, 2021

sam-mccall commented Jan 29, 2021

sam-mccall commented Jan 29, 2021

woody77 commented Jan 29, 2021

stamblerre commented Jun 9, 2021

DanTup commented Jun 9, 2021

aeschli commented Oct 21, 2021

lnicola commented Oct 21, 2021

dbaeumer commented Oct 21, 2021

DanTup commented Apr 28, 2022

dannymcgee commented Apr 29, 2022 •

edited

Loading

DanTup commented Apr 29, 2022

aeschli commented Apr 29, 2022

dannymcgee commented Apr 29, 2022

HighCommander4 commented Apr 30, 2022

DanTup commented May 1, 2022

Jamesernator commented Sep 1, 2022 •

edited

Loading

aeschli commented Sep 1, 2022

iDad5 commented Sep 13, 2022

MartinGC94 commented Jul 19, 2024

[semantic] proposals for new standard semantic token types #97063

[semantic] proposals for new standard semantic token types #97063

Comments

aeschli commented May 6, 2020 • edited Loading

kjeremy commented May 7, 2020

matklad commented May 7, 2020 • edited Loading

ghost commented May 7, 2020 • edited by ghost Loading

woody77 commented May 11, 2020

DanTup commented Aug 13, 2020

aeschli commented Aug 14, 2020

TylerLeonhardt commented Aug 14, 2020

woody77 commented Aug 17, 2020 • edited Loading

DanTup commented Aug 25, 2020

dannymcgee commented Nov 6, 2020

dbaeumer commented Nov 17, 2020

0dinD commented Jan 26, 2021

sam-mccall commented Jan 29, 2021

sam-mccall commented Jan 29, 2021

woody77 commented Jan 29, 2021

stamblerre commented Jun 9, 2021

DanTup commented Jun 9, 2021

aeschli commented Oct 21, 2021

lnicola commented Oct 21, 2021

dbaeumer commented Oct 21, 2021

DanTup commented Apr 28, 2022

dannymcgee commented Apr 29, 2022 • edited Loading

DanTup commented Apr 29, 2022

aeschli commented Apr 29, 2022

dannymcgee commented Apr 29, 2022

HighCommander4 commented Apr 30, 2022

DanTup commented May 1, 2022

Jamesernator commented Sep 1, 2022 • edited Loading

aeschli commented Sep 1, 2022

iDad5 commented Sep 13, 2022

MartinGC94 commented Jul 19, 2024

aeschli commented May 6, 2020 •

edited

Loading

matklad commented May 7, 2020 •

edited

Loading

ghost commented May 7, 2020 •

edited by ghost

Loading

woody77 commented Aug 17, 2020 •

edited

Loading

dannymcgee commented Apr 29, 2022 •

edited

Loading

Jamesernator commented Sep 1, 2022 •

edited

Loading