Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement (contextual) keywords and use their versioning from v2 #723

Merged
merged 39 commits into from
Jan 8, 2024

Conversation

Xanewok
Copy link
Contributor

@Xanewok Xanewok commented Dec 27, 2023

Closes #568

There is still one outstanding issue where we return a Vec<TokenKind> from next_token; it'd like to return a more specialized type and ideally pass it on stack (2x2 bytes), rather than on-heap (extra 3x8 bytes for the Vec handle + indirection). We should name it better and properly show that we can return at most 2 token kinds (single token kind or identifier + kw combo).

To do:

  • Return tokens from next_token via stack

Apart from that, I think this is a more correct approach than #598, especially accounting for the new keyword definition format in DSL v2.

The main change is that we only check the keyword trie and additionally the (newly introduced) compound keyword scanners only after the token has been lexed as an identifier. For each context, we collect Identifier scanners used by the keywords and attempt promotion there.

The existing lexing performance is not impacted from what I've seen when running the sanctuary tests and I can verify (incl. CST tests) that we now properly parse source that uses contextual keywords (e.g. from) and that the compound keywords (e.g. ufixedMxN) are properly versioned.

This adapts the existing codegen_grammar interface that's a leftover from DSLv1; I did that to work on finishing #638; once this is merged and we now properly parse contextual keywords, I'll move to clean it up and reduce the parser codegen indirection (right now we go from v2 -> v1 model -> code generator -> Tera templates; it'd like to at least cut out the v1 model and/or simplify visiting v2 from the existing CodeGenerator).

Please excuse the WIP comments in the middle; the first and the last ones should make sense when reviewing. I can simplify this a bit for review, if needed.

@Xanewok Xanewok requested a review from a team as a code owner December 27, 2023 18:06
Copy link

changeset-bot bot commented Dec 27, 2023

🦋 Changeset detected

Latest commit: 4fb350e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@nomicfoundation/slang Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@Xanewok
Copy link
Contributor Author

Xanewok commented Jan 2, 2024

Instead of next_token returning Vec (first pass), we now introduce the following type:

pub enum ScannedToken {
    Single(TokenKind),
    IdentifierOrKeyword {
        identifier: TokenKind,
        kw: KeywordScan,
    },
}

Not only we don't go through heap for the often-called next_token, we now explicitly have to handle the case of identifiers/keywords and whether a keyword is strictly reserved or "present", so usable in both keyword and identifier position.

To keep the existing parser flow mostly the same, a helper fn ScannedToken::unambiguous(self) -> TokenKind is introduced, that tries to map back to a single token kind if possible (for the "present" keyword it falls back to a general "identifier" token kind).

This also introduces another helper fn ScannedToken::accepted_as(self, expected: TokenKind) -> bool that is used in the parser functions as a convenience, which returns whether the scanned token is accepted in a given position, streamlining the underlying additional kw reservation checks for identifier token kinds.

$(
{
if let result @ (KeywordScan::Present(..) | KeywordScan::Reserved(..)) = ($scanner) {
if $ident.len() == $stream.position().utf8 - save.utf8 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not perfect, I admit; this should probably be handled with an automaton for the keywords and executed/matched on the string till it's exhausted but let's punt on this for now

Copy link
Contributor

@OmarTawfik OmarTawfik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few suggestions.

@Xanewok Xanewok added this pull request to the merge queue Jan 8, 2024
Merged via the queue into NomicFoundation:main with commit b3dc6bc Jan 8, 2024
1 check passed
@Xanewok Xanewok deleted the keyword-idents-take-2 branch January 8, 2024 17:27
@github-actions github-actions bot mentioned this pull request Jan 8, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jan 11, 2024
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and publish to npm
yourself or [setup this action to publish
automatically](https://github.com/changesets/action#with-publishing). If
you're not ready to do a release yet, that's fine, whenever you add more
changesets to main, this PR will be updated.


# Releases
## @nomicfoundation/[email protected]

### Minor Changes

- [#710](#710)
[`2025b6cb`](2025b6c)
Thanks [@Xanewok](https://github.com/Xanewok)! - CST children nodes are
now named

- [#723](#723)
[`b3dc6bcd`](b3dc6bc)
Thanks [@Xanewok](https://github.com/Xanewok)! - Properly parse
unreserved keywords in an identifier position, i.e. `from`, `emit`,
`global` etc.

- [#728](#728)
[`662a672c`](662a672)
Thanks [@Xanewok](https://github.com/Xanewok)! - Remove Language#scan
API; use the parser API instead

- [#719](#719)
[`1ad6bb37`](1ad6bb3)
Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - introduce strong
types for all Solidity non terminals in the TypeScript API.

### Patch Changes

- [#719](#719)
[`1ad6bb37`](1ad6bb3)
Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - unify
Rust/TypeScript node helpers: `*_with_kind()`, `*_with_kinds()`,
`*_is_kind()`), ...

- [#731](#731)
[`3deaea2e`](3deaea2)
Thanks [@OmarTawfik](https://github.com/OmarTawfik)! - add
`RuleNode.unparse()` to the TypeScript API

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow keywords to overlap identifiers
3 participants