history based reindex #3951

vladak · 2022-05-24T16:49:50Z

This change reworks the indexer so that when history cache is present, the list of files to process by the indexer is acquired using the SCM, rather than traversing the file tree and comparing the file presence/timestamps against the documents in the index (which incurs lots of syscalls and potentially also I/O churn).

Works for Mercurial and Git.

This requires these features to be enabled: projects, history, history cache. Otherwise, the file collection will fall back to the file traversal method. Further, a project will be eligible for history based reindex only if all its repositories can perform the extraction of files by traversing the history. There is a set of tunables (on global/project/repository level) that can set this behavior. The --historyBased option was added to the indexer that can turn this on and off. The default is on.

Initial indexing is still done using classic file tree traversal as using SCM history would likely be counterproductive. Also, I like the idea of having two different implementations so that tests can verify there is no difference. Further, when someone needs to index files not tracked by the SCM in question, the above mentioned tunables are handy to alter the behvior (obviously, if a file change was is not tracked with SCM operations, it cannot be detected using history based reindex).

Using the Firefox Mercurial repository with almost 1 million of files with a day (May 20th) of changes stripped for benchmarking on my Lenovo laptop with Intel Core i7 and built-in SSD, the file collection went down from 18 seconds to 8 seconds and the overall re-indexing time (using economy mode) went from 2:08 to 1:53 minutes.

The file collection is done during the history cache refresh, using the visitor pattern. While there, I refactored IndexDatabase significantly.

- refactor - use getDocument() - collect all files from Git regardless of their nature

- store and grab last revision - fix IndexDatabaseTest - refactor

- fix paths when collecting the files - change args counting in indexDown() - set configuration to make repositories visible in RuntimeEnvironment

avoids '/dev/null' entries

This is necessary to allow for forced reindex from scratch.

renamed file detection was broken

This is preparation for the history traversal that generates the history cache and collects the changed files at once.

- by moving IndexDownArgs away and introducing factory - also fix bug in index traversal for history based reindex - add parametrized test for IndexDatabase update and file changes

…oblem

on Windows this avoids Git detecting the files as modified

copyDirectory needs to copy sub-directories as well

the index is now created for each test separately

ELF analyzer uses RandomAccessFile which has troubles with closing the file on Windows. As a result main.o cannot be deleted which leads to failure of the testGetIndexDownArgs.

fixes the test on Windows

Vladimir Kotal and others added 30 commits May 24, 2022 18:20

truly incremental reindex

fc53bae

next stage

1665873

- refactor - use getDocument() - collect all files from Git regardless of their nature

next chunk of changes

32b4cd6

- store and grab last revision - fix IndexDatabaseTest - refactor

fix some nits

f65b382

IndexDatabase grew too long

6b9b964

fix more style nits

e18d9cb

add missing whitespace

390dbae

fix testXrefGeneration()

b2fc531

avoid NPE, fix test to be consistent

7e740a8

add global tunable

12365fb

add notes/comments

37bbf5c

make it work in the basic mode

b8ad142

- fix paths when collecting the files - change args counting in indexDown() - set configuration to make repositories visible in RuntimeEnvironment

add per project property

7a438ac

fix deleted files harvesting

cab9624

avoids '/dev/null' entries

even for truly incremental reindex the whole index has to be traversed

af69832

This is necessary to allow for forced reindex from scratch.

fix FileHistoryCacheTest

660da71

renamed file detection was broken

renamed parts should be part of the changed files in HistoryEntry

1a0b523

remove debug-only code

99642fe

remove unused import

4328c1b

handle trailing terms properly for history based reindex

f4192d9

check if repository has history enabled

ca2549f

refactor truly incremental check for repository

2cf4698

convert visitor pattern (use list of visitors)

8b9069f

This is preparation for the history traversal that generates the history cache and collects the changed files at once.

remove trailing space

72882ae

move the CommitInfo construction

f5a0be4

make indexDown*() testable

2d8cba2

- by moving IndexDownArgs away and introducing factory - also fix bug in index traversal for history based reindex - add parametrized test for IndexDatabase update and file changes

truly incremental -> history based

f457197

remove redundant public modifier

4fdf134

remove the VisibleForTesting annotation

4cad065

fix nits

29312e6

vladak added 25 commits May 24, 2022 18:20

add checks for history related tunables

3372675

use single Statistics instance when reporting file collection

c54bd05

add project-less based test for history based reindex

68ee168

unwrap the line for better readability

a96f032

add check for numCommits argument value

516b9eb

convert Mercurial to RepositoryWithHistoryTraversal

794d0b7

add Override annotation

1799a97

limit the visibility

001ca5d

remove unused imports

5559579

fix style

2e99b75

fix style

ea354d5

do not consider history vs. history based reindex as configuration pr…

202e6da

…oblem

move configuration check to Configuration class

61dce4c

reuse already existing copyDirectory()

f542b68

bump year

f682e5b

copy files preserving attributes

6824825

on Windows this avoids Git detecting the files as modified

re-clone the Git repository in setup

38dee2a

make sure the move does not fail on Windows

9cfb33d

add asserts for Git operations

a4a222e

close the Git object

08db34c

fix the test

6dc8614

copyDirectory needs to copy sub-directories as well

remove obsolete comment

7eea590

the index is now created for each test separately

do not use main.o for Git tests

6f227d5

ELF analyzer uses RandomAccessFile which has troubles with closing the file on Windows. As a result main.o cannot be deleted which leads to failure of the testGetIndexDownArgs.

fix Windows path

b0a8246

use native path separator

855e7d6

fixes the test on Windows

vladak added the indexer label May 24, 2022

vladak merged commit 059d25e into oracle:master Jun 8, 2022

vladak deleted the truly_incremental_reindex branch June 8, 2022 17:24

vladak mentioned this pull request Jun 8, 2022

history based incremental reindex for changeset based SCMs #3077

Closed

vladak mentioned this pull request Mar 9, 2023

Duplicate Results while searching in Opengrok #4180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history based reindex #3951

history based reindex #3951

vladak commented May 24, 2022 •

edited

Loading

history based reindex #3951

history based reindex #3951

Conversation

vladak commented May 24, 2022 • edited Loading

vladak commented May 24, 2022 •

edited

Loading