Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

history based incremental reindex for changeset based SCMs #3077

Closed
vladak opened this issue Mar 18, 2020 · 5 comments
Closed

history based incremental reindex for changeset based SCMs #3077

vladak opened this issue Mar 18, 2020 · 5 comments

Comments

@vladak
Copy link
Member

vladak commented Mar 18, 2020

Is your feature request related to a problem? Please describe.
As mentioned in #3071 (comment) the indexer could take a list of files to process from the changesets that were used to update given project. This way the reindex could be made truly incremental (currently only history cache reindex is incremental) for repositories based on SCMs that operate on changesets. This will reduce the time spent in indexDown() significantly for repostories with big number of files (depending on the number of files impacted by the changesets added).

Describe the solution you'd like
The opengrok-mirror script could generate the list of files (after all it can already do the "incoming" check) and pass that to the indexer. This will be handy especially for per project reindex.

Describe alternatives you've considered
The indexer will figure out the list of files itself. After all the history cache stores the latest indexed changeset ID and the repository classes have already contain code for retrieving history entries and parse the output from various SCM commands.

@vladak
Copy link
Member Author

vladak commented Mar 18, 2020

#3033 should be also considered as it might share some code with the solution for this issue.

@vladak
Copy link
Member Author

vladak commented Aug 21, 2020

Ruminating on the possible solutions:

The list of files from the opengrok-mirror script would have to be passed to the indexer somehow, via text file on disk or possibly via pipe (when running from opengrok-sync). With huge changes there could be some non trivial toil associated.

The history cache stores the current indexed revision in the OpenGroklatestRev file for each known repository of given project. The code for parsing history (down to individual files) is already present in the indexer. The Python code that checks for incoming changes merely runs a command.

So, it makes more sense to implement this wholly in the indexer.

@vladak vladak changed the title truly incremental reindex truly incremental reindex for changeset based SCMs Mar 29, 2021
@vladak
Copy link
Member Author

vladak commented Nov 4, 2021

Thinking about this a bit more: I wanted to hijack getHistory(File file, String sinceRevision, String tillRevision) of RepositoryWithPerPartesHistory classes to extract the changed/deleted files (and symlinks !) in order to populate IndexDownArgs in IndexDatabase#update() however I realized that this would re-introduce #3243, unless the same approach (splitting history into chunks) would be used which I am not thrilled to do in IndexDatabase as that would complicate it needlessly.

Instead, getHistory(File file, String sinceRevision, String tillRevision) should be converted into a consumer of generic method that traverses the history in given range and invokes a callback for each commit with lists of files. The IndexDownArgs collector would use that traversal method too.

@vladak
Copy link
Member Author

vladak commented Nov 4, 2021

May be worth noting that this approach will not avoid the syscall (I/O) churn completely, esp. if there is a lot of renames - the files coming from the history traversal method would have to be checked for existence on a file system before turned into IndexFileWork objects. Otherwise if file A was renamed to B in one changeset and then from B to C in another subsequent changeset, the list would contain both B and C. IndexDatabase#addFile() would then choke on B later on when indexParallel() runs because the B file would not exist anymore. Hence it is necessary to perform the existence check and reduce the list to files present on file system which leads to some syscall load. Still makes sense I think.

It is also a question whether this makes sense to do for the initial indexing. Or, in general for incremental reindex of sizable (in terms of number of changed files) changesets however this is impossible to decide without actually traversing the history and getting the list of changed files and comparing that to the number of files in given repository.

Also, this general approach would work only if all repositories for given project supported the history traversal. That is, if GitRepository was converted to support the file list extraction but MercurialRepository was not and the project had a combination of these repository types, the IndexDatabase#update() would have to fall back to the indexDown() method. Even if a project consisted entirely of GitRepository repositories (assuming this is converted), this approach assumes there are no files untracked by Git in the project.

@vladak vladak self-assigned this May 5, 2022
@vladak vladak changed the title truly incremental reindex for changeset based SCMs history based incremental reindex for changeset based SCMs May 20, 2022
@vladak
Copy link
Member Author

vladak commented Jun 8, 2022

fixed in #3951

@vladak vladak closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant