Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the indexed data be imported in an OpenGrok that another OpenGrok indexed? #3758

Closed
jetm opened this issue Nov 1, 2021 · 7 comments
Closed

Comments

@jetm
Copy link

jetm commented Nov 1, 2021

It is not a bug report. It's more like a question.

There are huge repositories (double-digit GB of code) where OpenGrok takes hours to finish indexing because reindexing always fails. Previously to Opengrok indexes the new code fetched by Git, the indexed data is removed, and the indexing must be done from scratch. So, all these hours, OpenGrok cannot be used.

Is it possible to index the new code in an OpenGrok running on a different machine, copy the newly indexed data, and reuse it in the production OpenGrok? Aside from source code, indexed data, what else needs to be copied? File configuration?

@ahornace
Copy link
Contributor

ahornace commented Nov 1, 2021

I don't see a problem as long as you use the same major version and configuration (history generation etc.). You should also be able to add the new project without restart (using https://opengrok.docs.apiary.io/#reference/0/projects/add-project and https://opengrok.docs.apiary.io/#reference/0/project-metadata-management/marks-project-as-indexed).

However, what do you mean by because reindexing always fails. ? Is there a problem with OpenGrok that should rather be fixed than doing a workaround like this?

@jetm
Copy link
Author

jetm commented Nov 2, 2021

@ahornace, thank you for your quick response.

With because reindexing always fails I meant OpenGrok reindexing for our huge repositories is unreliable for the following issues:

  • Sometimes, the reindexing process is very slow, slower than scratch indexing. The logs are not helpful because it shows the same logs as the reindexing before but slower. And one repo indexing could take one day to finish when the last reindexing took one hour.

  • Sometimes and in silently matters, it missed to reindex new code. The reindex finished without errors, but later somebody reports its most recent change is not showing up in OpenGrok.

  • OpenGrok reindexing is slow in our huge repositories, even when Git pulled one file change. If only one file was changed, I expect to finish quicker, but it takes double-digits minutes to complete. Compared to indexing from scratch, there is not much difference.

Those problems continue with the most recent OpenGrok releases.

@vladak
Copy link
Member

vladak commented Nov 3, 2021

@ahornace, thank you for your quick response.

With because reindexing always fails I meant OpenGrok reindexing for our huge repositories is unreliable for the following issues:

* Sometimes, the reindexing process is very slow, slower than scratch indexing. The logs are not helpful because it shows the same logs as the reindexing before but slower. And one repo indexing could take one day to finish when the last reindexing took one hour.

time to bump the log level up to see what is going on. Maybe use --progress as well.

* Sometimes and in silently matters, it missed to reindex new code. The reindex finished without errors, but later somebody reports its most recent change is not showing up in OpenGrok.

these need to be debugged case by case. What exactly is missing ? xref ? history ? something else ?

* OpenGrok reindexing is slow in our huge repositories, even when Git pulled one file change. If only one file was changed, I expect to finish quicker, but it takes double-digits minutes to complete. Compared to indexing from scratch, there is not much difference.

This is caused by the directory traversal that happens for every reindex. The fix is tracked by #3077.

@jetm
Copy link
Author

jetm commented Nov 3, 2021

It's using --progress, but it's not helpful either I can see the same result as the previous one, but with slower progress.

Usually, it's missing the indexed data. Git fetches the new code change, OpenGrok reindexes the new repo, indexer said it finished successfully. OpenGrok web page shows the date when finished indexing, but that change is not showing in OpenGrok when you search or open the file directly. It's difficult to debug because there are no errors related to the missing indexed data in the logs.

Yes, I am aware of #3077. Thank you for sharing. Because of that issue, I assumed reindexing was or might still be broken. And problems, as I have experienced, would be expected in big repositories. #3077 made me change from reindexing to always from scratch.

I have another question. Because I tried different combinations without success, and the documentation is not clear. What is the workflow to add one project at once? Is it supported? I mean, Git pulls the change for the foo repository and tells OpenGrok to index only the foo repository.

@vladak
Copy link
Member

vladak commented Nov 4, 2021

It's using --progress, but it's not helpful either I can see the same result as the previous one, but with slower progress.

Grabbing the stack traces of the indexer process with jstack at the moment there is no progress reported in the logs might shed more light.

Usually, it's missing the indexed data. Git fetches the new code change, OpenGrok reindexes the new repo, indexer said it finished successfully. OpenGrok web page shows the date when finished indexing, but that change is not showing in OpenGrok when you search or open the file directly. It's difficult to debug because there are no errors related to the missing indexed data in the logs.

It would be nice to get to the bottom of this because this is the first time I hear about such problem. I mean functional problem, not performance. For each file reported in the logs with DefaultIndexChangedListener.fileAdd (reported with FINE log level), there should be corresponding xref/index refresh. Were the files for which the problem happened reported in the logs ?

The indexer traverses the whole directory tree of given project (in IndexDatabase#indexDown()) and for each file present it checks its last modified time stamp against the time stamp of the document corresponding to the file in the index. If the time stamp of the file on the file system is greater, the document is refreshed. So, either there was a problem with identifying that the file has changed or something has failed during the document refresh. It also would not hurt to check that the file was indeed updated on the file system, esp. the last modified time of the file.

Yes, I am aware of #3077. Thank you for sharing. Because of that issue, I assumed reindexing was or might still be broken. And problems, as I have experienced, would be expected in big repositories. #3077 made me change from reindexing to always from scratch.

#3077 is merely performance enhancement. What is the structure of the repositories in yours big project in terms of repository types ?

I have another question. Because I tried different combinations without success, and the documentation is not clear. What is the workflow to add one project at once? Is it supported? I mean, Git pulls the change for the foo repository and tells OpenGrok to index only the foo repository.

The indexing granularity is per project, i.e. it is not possible to index just one repository of a project.
There has been some discussion related to per project workflow in #3728 recently.

@jetm
Copy link
Author

jetm commented Nov 5, 2021

It would be nice to get to the bottom of this because this is the first time I hear about such problem. I mean functional problem, not performance. For each file reported in the logs with DefaultIndexChangedListener.fileAdd (reported with FINE log level), there should be corresponding xref/index refresh. Were the files for which the problem happened reported in the logs ?

Sadly, I don't have the logs to show because I changed everything to build from scratch. I am making a new OpenGrok setup; I could change it to reindexing, wait until it happens to look in the logs and report it back. It could take a while. Sorry.

#3077 is merely performance enhancement. What is the structure of the repositories in yours big project in terms of repository types ?

Let me try to reply with most of the information that I am legally allowed.

OpenGrok runs in a dedicated Ubuntu 18.04 VM with 64 GB RAM; less than this RAM will fail with OOM. It gives an idea of how big the repositories are.

We have around six big Git repositories. Each one of them has around 20 GB of source code plus Git history. Each repo is treated as an OG project. It's indexed to keep the Git history. It has a lot of OG indexing filters to optimize the indexing time and avoid ctag-universal crashes.

It's an OpenGrok standalone setup, and it's not using Docker because the documentation says it should not be used for big repositories. It would need to be adjusted, but I don't want to experiment as this is critical for many devs.

@vladak
Copy link
Member

vladak commented Nov 5, 2021

Okay, this means that the changes for #3077 should help with lowering the indexing time in your environment.

@vladak vladak added the indexer label Apr 25, 2022
@oracle oracle locked and limited conversation to collaborators Jun 2, 2022
@vladak vladak converted this issue into discussion #3965 Jun 2, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants