Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement goto definition with low CPU and memory overhead #332

Merged
merged 30 commits into from
Oct 17, 2018

Conversation

olafurpg
Copy link
Member

@olafurpg olafurpg commented Oct 8, 2018

This PR replaces the existing navigation solution with a new implementation that is fast and uses little memory. To avoid making this PR larger, only goto definition is implemented.

The fast indexing is enabled by a new Scala parser that extract only toplevel classes from Scala sources at ~800k loc/s according to JMH "single shot" average time. This parser is implemented on top of the LegacyScanner, which is a fork of the Scala compiler scanner and is used by the Scalameta tokenizer. Java source files don't need indexing since their toplevel classes are inferred from their filename and can be computed on-demand.

The low-memory is achieved by storing only "non-trivial" toplevel classes in the index, that is toplevel members that don't match their enclosing file path. An index of the combined sources for akka+spark+finch+scalameta+linkerd+kafka+grpc uses ~17MB of memory, at ~1.4MB per 100k lines of Scala code. The index uses Scala stdlib Map with String keys so it might be possible to compress this further with a smarter data structure like prefix trees.

This PR removes pretty much everything from the original Metals codebase, including functional parts like the Main entrypoint, LSP functionality, sbt server integration and all tests suites. Some of those parts may get revived in the future but I think first we should first write more comprehensive tests for those components. In particular, the old implementation suffered from lack of e2e integration tests making it easy to silently introduce regressions.

This PR introduces a new testing infrastructure based on expect tests that optimizes ease of writing tests, code reviewing tests and updating existing tests once the expected behavior changes. Some test like DefinitionSuite stress a range of important high-level properties like "every identifier token should resolve to a definition and every non-identifier token should not resolve to a definition". These property check have so far been very effective at catching bugs.

This commit removes pretty much everything in Metals, including the
test suites.  The current implementation severals from several problems

- it is tied to Scalameta v2.1.7 and upgrading to v4.0.0 is difficult
- it consumes a lot of memory, up to ~2-3GB for medium sized codebases.
- it is slow at indexing dependency sources, importing a medium/large sized
  project can take several minutes.
- it uses complex persistent caching to make up for slow indexing,
  storing data on the users disk. This implementation detail leaked into
  the user experience through the "Clear index cache" command.
- tests are difficult to write, read and update to changing
  requirements.

There are some good parts that are removed in this commit but may get
revived in the future:

- token edit distance works great for stale SemanticDB files
- sbt server integration works and has an integration test
- MetalsServices + Main + Configuration
- upgrade scalameta
- remove unnecessary projects
- use resource generators to feed input project
  classpath/sourceDirectories/sourceJars into test suite.
We're going to implement a parser with the LegacyScanner (tokenizer) and
this basic utilty will come in handy.
These methods were spread out through different objects and now they are
in one place only.
This commit implements a high-performance Scala parser that only
exctracts top-level symbols from Scala source files. This parser
needs to be fast because it is on the critical path when importing
a new project and we index the depenency sources.
This commit introduces a new module mtags that implements goto
definition with a new design that achieves the following goals:

- source indexing is fast, Scala sources are indexed at ~800k loc/s
  and Java sources don't require any indexing altogether.
- the in-memory index uses little ram, an index for ~1.5 million lines
  of Scala takes ~15MB.

The new symbol indexer is able to handle different kinds of sources:
source files, sources.jar and SemanticDB files.  Source and  SemanticDB
can complement each other, the indexer falls back source indexing when
the SemanticDB file is either missing or is incomplete due to for
example macro annotations.

An important difference from the old indexer is that the new indexer is
not responsible for computing the range position inside the source file.
The final range position is left for a separate module that handles
other concerns such as stale SemanticDBs using edit distance.
These are copy-pasted from the Scalameta test suite.
- BaseSuite: replaces utest DSL with ScalaTest FunSuite syntax.
- BaseExpectSuite: replaces utest DSL with ScalaTest FunSuite syntax.
- SingleFileExpectSuite: expect test with a single output file
- DirectoryExpectSuite: expect test with a directory containing multiple
  expected output files.
- InputProperties: metadata about the input project.
Test cases are plain Scala code, no magic assertions.
This set of files stress a range of interesting Scala and Java language
features.
This test suite is the most advanced of the current test suites,
it tests a range of functionality:
- synthetic symbols from case classes fallback to non-synthetic symbols
- source + semanticdb indexing can be combined to support navigation to
  macro annotations.
We care equally about memory useage and CPU time to index sources.
Rough summary of the results:

- indexing 7 million loc Java and 1.5 million loc Scala takes ~15M ram
- Scala topelevel parser handles ~800k loc/s

The numbers are so good that I'm not too stressed about
micro-optimization as the moment. We run the full indexer for a
medium-sized classpath on every unit test without annoying me,
total runtime of `unit/test` is ~4 seconds.
```
> bench/jmh:run -i 10 -wi 10 -f1 -t1
[info] Benchmark                   Mode  Cnt  Score   Error  Units
[info] MetalsBench.indexSources      ss   10  0.620 ± 0.058   s/op
[info] MetalsBench.javaMtags         ss   10  7.233 ± 0.017   s/op
[info] MetalsBench.scalaMtags        ss   10  4.617 ± 0.034   s/op
[info] MetalsBench.scalaToplevels    ss   10  0.361 ± 0.005   s/op
```
@olafurpg olafurpg changed the title Implement high-performance goto definition Implement goto definition with low CPU and memory overhead Oct 8, 2018
- Only deal with definitions, no references.
- No SemanticDB indexing, only mtags.
- Document OnDemandSymbolIndex.
We needed to add the enclosing source directory to the API to handle
this case. Added detailed documentation to the methods.
@olafurpg
Copy link
Member Author

olafurpg commented Oct 9, 2018

Updated the PR with more docs and simplified the GlobalSymbolIndex interface to only be about finding definitions of global symbols from plains source files, no SemanticDB needed.

@olafurpg olafurpg requested a review from gabro October 9, 2018 21:37
Copy link
Member

@gabro gabro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments, I haven't got to the bottom of this yet.

loadFromSourceJars(trivialPaths(toplevel)).foreach(addSemanticdb)
}
}
if (!definitions.contains(symbol)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this repeated because the above can mutate definitions?

build.sbt Outdated
"org.scalameta" %% "testkit" % V.scalameta % Test,
"org.scalameta" %% "scalameta" % V.scalameta,
"org.scalameta" %% "symtab" % V.scalameta,
"org.scalameta" % "interactive" % V.scalameta cross CrossVersion.full,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still in use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, ls4ps is not strictly used either because the metals project has 2 source files at the moment. I will clean up the metals project once we add language server stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the interactive part specifically, but there's no rush to clean this up now, I agree

2.12.6 are not supported.
* Java 8. Note that Java 9 or higher has not been tested.
* macOS or Linux. Note that there have been reported issues on Windows.
- Scala 2.12.7, down to the exact PATCH version. Note that Scala versions 2.11.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make the Scala version dynamic using V.scala212?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs don't build actually at the moment because the Configuration object is now gone. We'll need to update the docs before there is a release.

/**
* Utility to load relative paths from a classpath.
*
* Provides similar functionality as URLClassLoader but does no caching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the no-caching a feature or a limitation? Maybe it's worth explaining why this class is needed over URLClassLoader

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important bit is that it uses AbsolutePath and Option[T] instead of java.net.URL and nulls. Updated docstring to reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of caching was actually out-dated, the underlying implementation uses URLClassLoader at the moment.


}

case class SymbolDefinition(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe SymbolDefinitionPath or SymbolDefinitionFile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending with Path makes the querySymbol/definitionSymbol fields a bit confusing IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, this is not really a symbol definition. It's the file where the symbol is defined, so that was confusing at a first read

import scala.meta.internal.mtags.Enrichments._

/**
* Utility to generate method symbol disambiguators according to SemanticDB spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double space between symbol and disambiguators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, only if we had a formatting tool that could reformat our comments 🤔

// java.lang.NullPointerException: null
// at com.thoughtworks.qdox.builder.impl.ModelBuilder.createTypeVariable(ModelBuilder.java:503)
// at com.thoughtworks.qdox.builder.impl.ModelBuilder.endMethod(ModelBuilder.java:470)
// TODO(olafur) report bug to qdox.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been reported? 😇

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, didn't manage to minimize the error. Replaced the comment with explanation why we ignore parse errors.

While removing the unused `DefinitionAlternatives.methodToVal` I
realized that DefinitionSuite didn't properly document which cases
relied on fallbacks.
Copy link
Member Author

@olafurpg olafurpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review, a lot of great questions @gabro !

build.sbt Outdated
"org.scalameta" %% "testkit" % V.scalameta % Test,
"org.scalameta" %% "scalameta" % V.scalameta,
"org.scalameta" %% "symtab" % V.scalameta,
"org.scalameta" % "interactive" % V.scalameta cross CrossVersion.full,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, ls4ps is not strictly used either because the metals project has 2 source files at the moment. I will clean up the metals project once we add language server stuff.

2.12.6 are not supported.
* Java 8. Note that Java 9 or higher has not been tested.
* macOS or Linux. Note that there have been reported issues on Windows.
- Scala 2.12.7, down to the exact PATCH version. Note that Scala versions 2.11.x
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

2.12.6 are not supported.
* Java 8. Note that Java 9 or higher has not been tested.
* macOS or Linux. Note that there have been reported issues on Windows.
- Scala 2.12.7, down to the exact PATCH version. Note that Scala versions 2.11.x
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs don't build actually at the moment because the Configuration object is now gone. We'll need to update the docs before there is a release.

/**
* Utility to load relative paths from a classpath.
*
* Provides similar functionality as URLClassLoader but does no caching.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important bit is that it uses AbsolutePath and Option[T] instead of java.net.URL and nulls. Updated docstring to reflect that.

/**
* Utility to load relative paths from a classpath.
*
* Provides similar functionality as URLClassLoader but does no caching.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of caching was actually out-dated, the underlying implementation uses URLClassLoader at the moment.

}

/** Fallback to the val term for a def with multiple params */
private def methodToVal(symbol: String): Option[String] =
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used, removed.

*
* Returns the path of the file that defines the symbol but does not include
* the exact position of the definition. Computing the range position of the
* definition is left for the client and can be done using the mtags module.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified with

the exact position of the definition. Computing the range position of the
definition is not handled by this method, it is left for the user and can
be done using the mtags module.

def addSourceFile(
file: AbsolutePath,
sourceDirectory: Option[AbsolutePath]
): Unit
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this throws an exception for errors like tokenization failures. I'm not sure yet what the best error handling method so I'd prefer to refine it down the road once we have a clearer picture of how it will be used in the language server. Maybe the underlying implementation can accept a Reporter 🤔


}

case class SymbolDefinition(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending with Path makes the querySymbol/definitionSymbol fields a bit confusing IMO

Copy link
Member

@gabro gabro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two other minor comments

*
* Fast indexing is enabled by ScalaToplevelMtags, a custom parser that extracts
* only toplevel symbols from a Scala source file. Java source files don't need indexing
* because their file location can be inferred from the symbol itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a reference to the note from above, where you list the limitations of this approach.

) extends GlobalSymbolIndex {
val mtags = new Mtags
private val sourceJars = new ClasspathLoader(Classpath(Nil))
var indexedSources = 0L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be accessed concurrently? Maybe using an AtomicInteger would be safer.

@gabro
Copy link
Member

gabro commented Oct 15, 2018

An overall comment that I have is about the pervasive use of String and extensions methods on it.

It feels a bit brittle and I suspect it makes it difficult to refactor some of the code.

Especially the use of String for symbols makes me a bit uneasy and I wished it were a distinct type (even through the use of a newtype).

Biggest change was wrapping all strings in a `Symbol` class. There's
currently no standard symbol class in scalameta that we can reuse so we
copy-paste what exists in scalafix and adjust to our needs.
@olafurpg
Copy link
Member Author

An overall comment that I have is about the pervasive use of String and extensions methods on it.

That's a good point. I've updated the PR to use a wrapper class Symbol(val value: String). Let's see how that work, I'd love to get a data structure like this into org.scalameta:semanticdb that we can reuse but so far it doesn't exist so we live with copy-paste.

Thank you @gabro for the review, a lot of great comments 🙏

I will merge this now after CI is green, next step is to setup a bsp client 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants