-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement in-memory symbol index with fast writes and low memory footprint #292
Comments
Interesting! Are there some existing datasets to reference for benchmarks? I prepared the following program to generate a ~1M dataset. Admittedly it would be better with real world data but this might be close enough to get started. object Hat {
def main(args: Array[String]): Unit = {
val prefix = Seq("scala", "java", "spark", "lightbend", "farmco", "rust", "elm", "react", "cats", "dogs", "cars", "boats")
val tlds = Seq("com", "info", "org", "net", "io", "edu", "gov")
val packages = Seq("util", "logging", "collections", "io", "impl", "inter")
val classes = Seq("FooBarBazzer", "BazBarFooer", "Reader", "Writer", "Adder", "Multiplier", "Printer", "Loader", "Notifier", "Remover")
val versions = 200
val obj = 42
val names =
for {
p <- prefix
t <- tlds
pa <- packages
c <- classes
v <- 1 to versions
} yield s"$t.$p.$pa.${c}Version$v"
println(s"names: ${names.length}")
Thread.sleep(120000)
}
}
|
You might be able to achieve the same effect with a dictionary list http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt We have a decently sized semanticdb corpus here https://drive.google.com/file/d/0B5BBsTPBxGcMYnlXSk5KUk40Q0k/view but it was compiled with a bit of an old version of semanticdb-scalac (I can't remember exactly what version). Once scalameta v4.0 final is out (before september) I plan to re-build the corpus. Until then I think your random generator is an OK benchmark :) |
Problem with dictionary list is that it doesn't have the same degree of redundancy in the prefix like symbols so I suspect your random generator is more representative. |
Fixed in #332, it turns out if you only store "non-trivial" top level symbols and use |
I'm opening this ticket to see if someone would be interested in implementing a mutable associative array with compact memory footprint and fast writes. Read performance is not so important. It does not have to support concurrent access.
Currently, the symbol index uses quite a bit of memory. I would like to explore the possibility to keep the index in-memory before trying on-disk solutions such as SQL. Reading SemanticDB protobuf files is fast, so if building the index in-memory is fast and memory compact enough, we can rebuild the index all the time and keep the *.semanticdb files on disk as single source of truth.
Keys are hierarchical
String
so it sounds like it would be a good fit for HAT trieBut any other data structure that does the job is fine.
The objective would be to replace this map here (and make things single-threaded)
metals/metals/src/main/scala/scala/meta/metals/search/InMemorySymbolIndexer.scala
Line 16 in fcac1cc
Reference implementations:
The metrics to measure in benchmarks are
The string keys look something like this
The text was updated successfully, but these errors were encountered: