-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search by History is not working for a Git repository when indexing with git 1.8 #2292
Comments
How exactly slow is slow ? Does the history search working when you let the indexer complete with just git 2.16 ? |
Also, does this issue occur for all your Git repositories or just for some ? |
Hi, Using git 2.16 version, OpenGrok is taking 2 days to complete the indexing for the 12 gb code along with history. The issue is coming for all the git repositories. |
That's not too slow I'd say. So, does the history search work if you let the indexer complete with just Git 2.16 ? |
I will need to do the experiment with this. Currently our people are using it. If I am asked to add one more branch, indexing may takes more time with using git 2.16.0. I am little worried to run with git 2.16 version alone. |
I tried indexing with having git 2.16 alone. It is working fine. But the thing is, it is taking too long time for even doing the subsequent indexing. I also added memory arguments while doing the indexing like below. Is there any way to improve the performance of indexing with having git 2.16 apart from the above increased memory arguments?... |
So how much time does it take to perform incremental reindex ? Do you reindex everything or do you perform per-project reindex ? Is the history generation indeed the thing that takes long time ? Check your indexer logs. |
We added a new branch. All branches will have multiple xmls. One git branch will have around 5K xml, js and css files. When indexing xml files, it is taking more time. Please let me know for any additional information. |
How big are the updates to the repositories usually ? It would be useful to see the times of history generation from the indexer log (run the indexer with the -v option). I believe in your case (at least for incremental reindex) it is rather the I/O of the directory traversal when checking modified times rather than git. |
The total size of one branch is around 13GB. One more observation is, the time taking is more only in the case of one repository of the branch. the path is like this /branch19/speik/speik/*. Might this be the reason for taking more time, if the project name and repo name are same and directory traversal into it.? Thank you Vladak for all the support. Please let me know for any additional information. |
One more question, if I use opengrok-1.1-rc38 version, will the indexing performance of this directory traversal be improved? Just curious to know.. |
First, it would be useful to know where the time is spent. You can glean some information where the indexer spends time by checking the logs for things like:
To have better insight into where the Indexer is spending time the changes #579 are needed. In general (depending on indexer configuration) the indexer runs in 2 phases - first history cache is generated, then individual files are indexed and xref files created. In the first phase basically the history is read using given SCM command (e.g. Currently it is not possible to tell how each project contributes to the second phase of indexing. I guess the calls to If you have good I/O backend (good PCIe layout with SSDs for reading/writing data and/or n-way mirror of disks for reading data) that cannot be easily saturated (meaning more in terms of IOPS rather than bandwidth) or has very good caching mechanisms (think ZFS with ample RAM for ARC, e.g. on our production system ZFS caches occupy 70% of physical RAM) then you can try bumping the thread count using the There might be some areas where the Indexer could limit the number of syscalls (esp. |
Another thing to look at would be to identify weak points in the system using the USE methodology (http://www.brendangregg.com/usemethod.html), e.g. saturation/utilization of system resources during reindex. |
Also, 1.1-rc20 has deferred Lucene operations that should make indexer run faster (#1936). |
Install 1.1-rc40 to get better overview of where the time is spent. Anyhow, I think we should return to the original problem. I think the root of the problem is that when using git 1.8 the history is actually not generated at all. Could you check that after indexing from scratch with just git 1.8 the |
Could you actually share the indexer logs when using git 1.8 somewhere ? (you can redact the file paths) |
My guess is what is happening here is that when running the indexer with git 1.8 no history cache is generated at all. This is because history log executor uses the Again, the indexer logs should tell. |
Hi Vladak, thanks for the response. Yes, with git 2.16 in the second phase, during all the source files are being traversed, it is taking long time. We did indexing one brach with git 2.16 alone yesterday. And it took time while writing the source files in second phase of indexing.. Now the history search is working fine as expected on that branch. |
If history cache is not generated, history will not be stored in the index and therefore could not be searched. Basically you were bitten by #747 (and also by not reading the indexer logs properly) - the whole indexing needs to be terminated if history cache generation fails (at least for SCMs capable of fetching history per directory). Enforcing certain Git version would not work because of said issue and also would create unnecessary maintenance burden. The performance enhancements were mentioned before - just try most recent version and see yourself. In the end it will depend on how powerful your hardware is. If you identify any weak spots (e.g. using the USE methodology), file a new issue with detailed description. The traversal of all files needs to always happen. It is not dependent on history cache generation and therefore is not dependent on git or any SCM at all (unless it enters the behaviour described in the issue above). As for the Python scripts, unless you need to do per project management and indexing, you can run the indexer with bare |
Hi Team,
We are using opengrok-1.1-rc18 and java 1.8. OS is Linux (version 3.10.0-693.21.1.el7.x86_64).
For indexing with history we tried with git 1.8 version. It is indexing, but upon clicking the history link of any project or file, it is giving Page not found. In one of the thread, it is advised to use latest git verison. We upgrade the git client version to 2.16. It is working but the indexing is happening very slow. We observed that the indexing happening very fast using git 1.8.3 but the history is not working. Using git 2.16, indexing is very slow and history is showing. So on a trial attempt, we started indexing our entire code git 1.8 first and git 2.16 later. We used 2 steps. In the first step indexing is done very fast. In the second step indexing, since the code is already indexed, subsequent indexing is done very fast and history is also appearing.
It is very nice and working nicely.
But when searching using History(git comment), it is not giving the result. Though there is the git comment for a specific file.
Can you please help me with this. Please let me know for any additional information.
The text was updated successfully, but these errors were encountered: