history based incremental reindex for changeset based SCMs #3077

vladak · 2020-03-18T09:26:42Z

Is your feature request related to a problem? Please describe.
As mentioned in #3071 (comment) the indexer could take a list of files to process from the changesets that were used to update given project. This way the reindex could be made truly incremental (currently only history cache reindex is incremental) for repositories based on SCMs that operate on changesets. This will reduce the time spent in indexDown() significantly for repostories with big number of files (depending on the number of files impacted by the changesets added).

Describe the solution you'd like
The opengrok-mirror script could generate the list of files (after all it can already do the "incoming" check) and pass that to the indexer. This will be handy especially for per project reindex.

Describe alternatives you've considered
The indexer will figure out the list of files itself. After all the history cache stores the latest indexed changeset ID and the repository classes have already contain code for retrieving history entries and parse the output from various SCM commands.

The text was updated successfully, but these errors were encountered:

vladak · 2020-03-18T09:27:58Z

#3033 should be also considered as it might share some code with the solution for this issue.

vladak · 2020-08-21T07:32:28Z

Ruminating on the possible solutions:

The list of files from the opengrok-mirror script would have to be passed to the indexer somehow, via text file on disk or possibly via pipe (when running from opengrok-sync). With huge changes there could be some non trivial toil associated.

The history cache stores the current indexed revision in the OpenGroklatestRev file for each known repository of given project. The code for parsing history (down to individual files) is already present in the indexer. The Python code that checks for incoming changes merely runs a command.

So, it makes more sense to implement this wholly in the indexer.

vladak · 2021-11-04T17:15:58Z

Thinking about this a bit more: I wanted to hijack getHistory(File file, String sinceRevision, String tillRevision) of RepositoryWithPerPartesHistory classes to extract the changed/deleted files (and symlinks !) in order to populate IndexDownArgs in IndexDatabase#update() however I realized that this would re-introduce #3243, unless the same approach (splitting history into chunks) would be used which I am not thrilled to do in IndexDatabase as that would complicate it needlessly.

Instead, getHistory(File file, String sinceRevision, String tillRevision) should be converted into a consumer of generic method that traverses the history in given range and invokes a callback for each commit with lists of files. The IndexDownArgs collector would use that traversal method too.

vladak · 2021-11-04T17:50:11Z

May be worth noting that this approach will not avoid the syscall (I/O) churn completely, esp. if there is a lot of renames - the files coming from the history traversal method would have to be checked for existence on a file system before turned into IndexFileWork objects. Otherwise if file A was renamed to B in one changeset and then from B to C in another subsequent changeset, the list would contain both B and C. IndexDatabase#addFile() would then choke on B later on when indexParallel() runs because the B file would not exist anymore. Hence it is necessary to perform the existence check and reduce the list to files present on file system which leads to some syscall load. Still makes sense I think.

It is also a question whether this makes sense to do for the initial indexing. Or, in general for incremental reindex of sizable (in terms of number of changed files) changesets however this is impossible to decide without actually traversing the history and getting the list of changed files and comparing that to the number of files in given repository.

Also, this general approach would work only if all repositories for given project supported the history traversal. That is, if GitRepository was converted to support the file list extraction but MercurialRepository was not and the project had a combination of these repository types, the IndexDatabase#update() would have to fall back to the indexDown() method. Even if a project consisted entirely of GitRepository repositories (assuming this is converted), this approach assumes there are no files untracked by Git in the project.

vladak · 2022-06-08T17:42:04Z

fixed in #3951

vladak added enhancement indexer labels Mar 18, 2020

vladak mentioned this issue Jun 3, 2020

Documentation request: How does OpenGrok detect file changes? #3161

Closed

vladak mentioned this issue Feb 24, 2021

Last indexed time in API? #3423

Closed

vladak changed the title ~~truly incremental reindex~~ truly incremental reindex for changeset based SCMs Mar 29, 2021

vladak mentioned this issue Nov 3, 2021

Can the indexed data be imported in an OpenGrok that another OpenGrok indexed? #3758

Closed

vladak mentioned this issue Jan 10, 2022

Indexing time consuming problem #3865

Closed

vladak mentioned this issue Mar 3, 2022

Whether the index can be partially updated #3903

Closed

vladak self-assigned this May 5, 2022

vladak changed the title ~~truly incremental reindex for changeset based SCMs~~ history based incremental reindex for changeset based SCMs May 20, 2022

vladak added the performance label May 20, 2022

vladak closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history based incremental reindex for changeset based SCMs #3077

history based incremental reindex for changeset based SCMs #3077

vladak commented Mar 18, 2020

vladak commented Mar 18, 2020

vladak commented Aug 21, 2020

vladak commented Nov 4, 2021

vladak commented Nov 4, 2021 •

edited

Loading

vladak commented Jun 8, 2022

history based incremental reindex for changeset based SCMs #3077

history based incremental reindex for changeset based SCMs #3077

Comments

vladak commented Mar 18, 2020

vladak commented Mar 18, 2020

vladak commented Aug 21, 2020

vladak commented Nov 4, 2021

vladak commented Nov 4, 2021 • edited Loading

vladak commented Jun 8, 2022

vladak commented Nov 4, 2021 •

edited

Loading