Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

history based reindex #3951

Merged
merged 88 commits into from
Jun 8, 2022
Merged

Conversation

vladak
Copy link
Member

@vladak vladak commented May 24, 2022

This change reworks the indexer so that when history cache is present, the list of files to process by the indexer is acquired using the SCM, rather than traversing the file tree and comparing the file presence/timestamps against the documents in the index (which incurs lots of syscalls and potentially also I/O churn).

Works for Mercurial and Git.

This requires these features to be enabled: projects, history, history cache. Otherwise, the file collection will fall back to the file traversal method. Further, a project will be eligible for history based reindex only if all its repositories can perform the extraction of files by traversing the history. There is a set of tunables (on global/project/repository level) that can set this behavior. The --historyBased option was added to the indexer that can turn this on and off. The default is on.

Initial indexing is still done using classic file tree traversal as using SCM history would likely be counterproductive. Also, I like the idea of having two different implementations so that tests can verify there is no difference. Further, when someone needs to index files not tracked by the SCM in question, the above mentioned tunables are handy to alter the behvior (obviously, if a file change was is not tracked with SCM operations, it cannot be detected using history based reindex).

Using the Firefox Mercurial repository with almost 1 million of files with a day (May 20th) of changes stripped for benchmarking on my Lenovo laptop with Intel Core i7 and built-in SSD, the file collection went down from 18 seconds to 8 seconds and the overall re-indexing time (using economy mode) went from 2:08 to 1:53 minutes.

The file collection is done during the history cache refresh, using the visitor pattern. While there, I refactored IndexDatabase significantly.

Vladimir Kotal and others added 30 commits May 24, 2022 18:20
- refactor
- use getDocument()
- collect all files from Git regardless of their nature
- store and grab last revision
- fix IndexDatabaseTest
- refactor
- fix paths when collecting the files
- change args counting in indexDown()
- set configuration to make repositories visible in RuntimeEnvironment
avoids '/dev/null' entries
This is necessary to allow for forced reindex from scratch.
renamed file detection was broken
This is preparation for the history traversal that generates the history
cache and collects the changed files at once.
- by moving IndexDownArgs away and introducing factory
- also fix bug in index traversal for history based reindex
- add parametrized test for IndexDatabase update and file changes
@vladak vladak added the indexer label May 24, 2022
@vladak vladak merged commit 059d25e into oracle:master Jun 8, 2022
@vladak vladak deleted the truly_incremental_reindex branch June 8, 2022 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant