-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
history based reindex #3951
Merged
Merged
history based reindex #3951
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- refactor - use getDocument() - collect all files from Git regardless of their nature
- store and grab last revision - fix IndexDatabaseTest - refactor
- fix paths when collecting the files - change args counting in indexDown() - set configuration to make repositories visible in RuntimeEnvironment
avoids '/dev/null' entries
This is necessary to allow for forced reindex from scratch.
renamed file detection was broken
This is preparation for the history traversal that generates the history cache and collects the changed files at once.
- by moving IndexDownArgs away and introducing factory - also fix bug in index traversal for history based reindex - add parametrized test for IndexDatabase update and file changes
on Windows this avoids Git detecting the files as modified
copyDirectory needs to copy sub-directories as well
the index is now created for each test separately
ELF analyzer uses RandomAccessFile which has troubles with closing the file on Windows. As a result main.o cannot be deleted which leads to failure of the testGetIndexDownArgs.
fixes the test on Windows
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change reworks the indexer so that when history cache is present, the list of files to process by the indexer is acquired using the SCM, rather than traversing the file tree and comparing the file presence/timestamps against the documents in the index (which incurs lots of syscalls and potentially also I/O churn).
Works for Mercurial and Git.
This requires these features to be enabled: projects, history, history cache. Otherwise, the file collection will fall back to the file traversal method. Further, a project will be eligible for history based reindex only if all its repositories can perform the extraction of files by traversing the history. There is a set of tunables (on global/project/repository level) that can set this behavior. The
--historyBased
option was added to the indexer that can turn this on and off. The default is on.Initial indexing is still done using classic file tree traversal as using SCM history would likely be counterproductive. Also, I like the idea of having two different implementations so that tests can verify there is no difference. Further, when someone needs to index files not tracked by the SCM in question, the above mentioned tunables are handy to alter the behvior (obviously, if a file change was is not tracked with SCM operations, it cannot be detected using history based reindex).
Using the Firefox Mercurial repository with almost 1 million of files with a day (May 20th) of changes stripped for benchmarking on my Lenovo laptop with Intel Core i7 and built-in SSD, the file collection went down from 18 seconds to 8 seconds and the overall re-indexing time (using economy mode) went from 2:08 to 1:53 minutes.
The file collection is done during the history cache refresh, using the visitor pattern. While there, I refactored
IndexDatabase
significantly.