-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing features for modules #347
Conversation
@zachallaun ☝️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't quite finish reading through all of this, but made a few comments on things that I noticed.
apps/remote_control/lib/lexical/remote_control/search/indexer/extractors/module.ex
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/module.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/source.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/extractors/module.ex
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/source/reducer.ex
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the AST stuff is lost on me (I need to brush up on that) but the overall approach looks good.
I noticed the indexer doesn't appear to use any concurrency at the moment; is that on purpose? Unless I'm mistaken, parallelizing around files should be transparent to the rest of the indexing process?
There's obviously the issue of choosing how many concurrent processes to use, which would ideally depend on the available cores. Probably not worth doing now to keep this PR focused on the core algorithm, was just wondering if you'd given it some thought.
Yes, that's intentional.
Still working on #1. |
BTW, this is a draft PR, I'd like feedback on the overall approach |
apps/remote_control/lib/lexical/remote_control/search/indexer.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/source/reducer.ex
Show resolved
Hide resolved
9c109ae
to
9758268
Compare
f9149dd
to
3852f75
Compare
apps/remote_control/lib/lexical/remote_control/search/indexer.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/test/lexical/remote_control/search/indexer/source_test.exs
Outdated
Show resolved
Hide resolved
apps/remote_control/test/lexical/remote_control/search/indexer/source_test.exs
Show resolved
Hide resolved
apps/remote_control/test/lexical/remote_control/search/indexer/source_test.exs
Outdated
Show resolved
Hide resolved
apps/remote_control/test/lexical/remote_control/search/indexer/module_test.exs
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/source/reducer.ex
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only read the fuzzy and scorer modules so far, but I think they're definitely on the right track! This will be a very useful feature for lexical.
apps/remote_control/test/lexical/remote_control/search/fuzzy/scorer_test.exs
Outdated
Show resolved
Hide resolved
apps/remote_control/test/lexical/remote_control/search/fuzzy_test.exs
Outdated
Show resolved
Hide resolved
@Blond11516 please re-review; the unit tests for scorer were actually backwards. |
8142e3d
to
66b2aeb
Compare
This commit introduces indexing for module references. It currently will only work for files to which we have the source code, turns them into AST and then applies one or more "extractors" on it. An extractor runs over AST and indetifies things it's looking for, and then emits entries, which will be used to build a searchable index. Currently, there is one extractor implemented: Module references. This does a fairly exhaustive search for module references, wherever they're hiding. It then uses the aliases module to find the original source module. Note that most of the higher level functions are just placeholders. An indexing strategy needs to find things inside of .exs files as well as .ex files, and we currently have no way to index core elixir (this will be highly dependent on how elixir was installed and if we can get the source code to the version we're running on).
Turns out `Aliases.at()` was super slow --especially since we called it so much when resolving aliases during indexing. This was due to it having to re-parse the document multiple times for each alias reference. Moving to sourceror and allowing the AST to be sent in cut the indexing time for lexical and all its deps from 60 seconds to 10.
This commit has ETS based file storage and a fuzzy matching system. I was having a lot of trouble coming up with an efficient way to perform fuzzy matching against a corpous of subjects efficiently using the data stores. ETS is... ets, and it appears that it's impossible to use it for any type of detailed string matching. SQLite had similar problems. While it had a lot more complicated and helpful utilities, it just couldn't do what I wanted when it comes to fuzziness. Then I thought, hey, why not just store the subjects in memory with all the places that they appear? Then you just keep that up to date, and you remove a lot of duplication, which improves performance a lot. Prior to this, ETS was taking around 300ms to do simple-ish matching, and SQLite was taking around 180ms. Fuzzy can do it in 14ms. Better yet, building a fuzzy from scratch only takes 40ms for lexical and all its deps. This includes doing much more interesting matching. I think we have a winner.
Prior, we were looking at all the modules in a project and indexing them, which was somewhat problematic, as we were gobbling up erlang modules as well as elixir ones. To start simpler, this just looks at your project's .ex and .exs files (including deps) and indexes them. It might be a bit overzealous, as it'll pick up _anything_ in the project root, but this can be tamed by using Mix and asking it where your sources and deps live and just indexing those files. TODO: We likely need to index both elixir and erlang's standard library, but this is a bit fraught, as it's not always available.
This allowed me to tighten up some of the types and interfaces. I'm still not super happy with exposing sync_to, but I couldn't figure out a way to do that efficiently with ets.
Fuzzy was pretty coupled naming wise to references and paths. This changes it around a bit to use the nomenclature of keys, values an subjects. I also played around with fuzzy matching and found that the results were quite poor when using all of the discovered modules. This was due to not taking the amount of match into account and not sorting in descending order. I added unit tests and moved things around.
* Renamed traverse_until to prewalk_until * Changed entries to hold ranges rather than start and finish keys * Changed naming in fuzzy.ex
Tests were written with hard-coded range tests, which are difficult to understand and fragile. I extracted RangeSupport into an include-able module that decorates a document's range with angle quotes (extracted from EntityTest) so we can make assertions about where the range begins and ends.
1a307ef
to
e5bd366
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have to go through the ETS backend and store state but the updates to fuzzy and scorer look good!
apps/remote_control/lib/lexical/remote_control/search/store/ets.ex
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look closely at the fuzzy part; I like the indexer part.
Apparently patching File is a no-no. I think I've done this before with poor results. We were getting an exit in the test process, with a process being killed. Not informative.
apps/remote_control/lib/lexical/remote_control/search/fuzzy/scorer.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/fuzzy/scorer.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/entry.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/location.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/extractors/module.ex
Outdated
Show resolved
Hide resolved
apps/remote_control/lib/lexical/remote_control/search/indexer/extractors/module.ex
Show resolved
Hide resolved
…orer.ex Co-authored-by: Zach Allaun <[email protected]>
This commit introduces indexing for module references. It currently will only work for files to which we have the source code, turns them into AST and then applies one or more "extractors" on it. An extractor runs over AST and indetifies things it's looking for, and then emits entries, which will be used to build a searchable index.
Currently, there is one extractor implemented: Module references. This does a fairly exhaustive search for module references, wherever they're hiding. It then uses the aliases module to find the original source module.
Note that most of the higher level functions are just placeholders. An indexing strategy needs to find things inside of .exs files as well as .ex files, and we currently have no way to index core elixir (this will be highly dependent on how elixir was installed and if we can get the source code to the version we're running on).