Implement in-memory symbol index with fast writes and low memory footprint #292

olafurpg · 2018-04-21T16:26:21Z

I'm opening this ticket to see if someone would be interested in implementing a mutable associative array with compact memory footprint and fast writes. Read performance is not so important. It does not have to support concurrent access.

Currently, the symbol index uses quite a bit of memory. I would like to explore the possibility to keep the index in-memory before trying on-disk solutions such as SQL. Reading SemanticDB protobuf files is fast, so if building the index in-memory is fast and memory compact enough, we can rebuild the index all the time and keep the *.semanticdb files on disk as single source of truth.

Keys are hierarchical String so it sounds like it would be a good fit for HAT trie

paper http://crpit.com/confpapers/CRPITV62Askitis.pdf
blog post https://tessil.github.io/2017/06/22/hat-trie.html

But any other data structure that does the job is fine.

The objective would be to replace this map here (and make things single-threaded)

metals/metals/src/main/scala/scala/meta/metals/search/InMemorySymbolIndexer.scala

Line 16 in fcac1cc

symbols: collection.concurrent.Map[String, AtomicReference[SymbolData]] =

Reference implementations:

The metrics to measure in benchmarks are

write speed for ~1M string keys that are ~4-50 characteres long and have a huge amount of redundancy in the prefix.
memory footprint of computed trie

The string keys look something like this

scala.
scala.collection.
scala.collection.List#
scala.collection.List.
scala.collection.mutable.ArrayBuffer.
scala.collection.mutable.ArrayBuffer#
scala.concurrent.Future#
scala.concurrent.Future.
...

The text was updated successfully, but these errors were encountered:

longshorej · 2018-08-21T02:32:29Z

Interesting! Are there some existing datasets to reference for benchmarks?

I prepared the following program to generate a ~1M dataset. Admittedly it would be better with real world data but this might be close enough to get started.

object Hat {
  def main(args: Array[String]): Unit = {
    val prefix = Seq("scala", "java", "spark", "lightbend", "farmco", "rust", "elm", "react", "cats", "dogs", "cars", "boats")
    val tlds = Seq("com", "info", "org", "net", "io", "edu", "gov")
    val packages = Seq("util", "logging", "collections", "io", "impl", "inter")
    val classes = Seq("FooBarBazzer", "BazBarFooer", "Reader", "Writer", "Adder", "Multiplier", "Printer", "Loader", "Notifier", "Remover")
    val versions = 200

    val obj = 42

    val names =
      for {
        p <- prefix
        t <- tlds
        pa <- packages
        c <- classes
        v <- 1 to versions
      } yield s"$t.$p.$pa.${c}Version$v"

    println(s"names: ${names.length}")

    Thread.sleep(120000)
  }
}

olafurpg · 2018-08-21T13:28:07Z

You might be able to achieve the same effect with a dictionary list http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt

We have a decently sized semanticdb corpus here https://drive.google.com/file/d/0B5BBsTPBxGcMYnlXSk5KUk40Q0k/view but it was compiled with a bit of an old version of semanticdb-scalac (I can't remember exactly what version). Once scalameta v4.0 final is out (before september) I plan to re-build the corpus. Until then I think your random generator is an OK benchmark :)

olafurpg · 2018-08-21T13:29:17Z

Problem with dictionary list is that it doesn't have the same degree of redundancy in the prefix like symbols so I suspect your random generator is more representative.

olafurpg · 2018-10-17T09:31:52Z

Fixed in #332, it turns out if you only store "non-trivial" top level symbols and use nio.file.Path the memory footprint isn't that big (<30mb) even for huge classpaths. I'm sure we will need more optimization work but it's not clear to me that the proposal in this ticket is relevant anymore.

olafurpg added performance labels Apr 21, 2018

olafurpg changed the title ~~Implement symbol index with memory and write optimized key/value store~~ Implement in-memory symbol index with fast writes and low memory footprint Apr 21, 2018

olafurpg added navigation Related to goto definition, find references, open symbol and removed help wanted labels Sep 27, 2018

olafurpg closed this as completed Oct 17, 2018

This was referenced Oct 17, 2018

Emit method overload symbol in Scala mtags #282

Closed

Emit method overload symbol in Java mtags #281

Closed

Remove noise from logs #227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement in-memory symbol index with fast writes and low memory footprint #292

Implement in-memory symbol index with fast writes and low memory footprint #292

olafurpg commented Apr 21, 2018 •

edited

Loading

longshorej commented Aug 21, 2018 •

edited

Loading

olafurpg commented Aug 21, 2018

olafurpg commented Aug 21, 2018

olafurpg commented Oct 17, 2018

Implement in-memory symbol index with fast writes and low memory footprint #292

Implement in-memory symbol index with fast writes and low memory footprint #292

Comments

olafurpg commented Apr 21, 2018 • edited Loading

longshorej commented Aug 21, 2018 • edited Loading

olafurpg commented Aug 21, 2018

olafurpg commented Aug 21, 2018

olafurpg commented Oct 17, 2018

olafurpg commented Apr 21, 2018 •

edited

Loading

longshorej commented Aug 21, 2018 •

edited

Loading