-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexer option to run historycache only #3434
Comments
The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the |
Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file. |
When the history cache is being refreshed, the new history is appended to the existing history. In practice this means that for each |
I'd like to know what is it during history cache refresh that takes so much time. PR #3438 provides logging ( |
@vladak Is this question for my runs? Because at the moment, no history indexing is done. |
Yeah, I was wondering if there are any outliers in the history cache generation. |
Ok. Will come back to this issue when the new one I have just opened is fixed on my side. Since actually, even without history cache generation, I am not able to finish to index my code. |
Hi @vladak I am coming back to this issue. Run is still ongoing. It is super long... Indexing seems stuck but it is not: all files are processed but very slowly.
Some files are storing history cache in more than 1h40. For each new file that is logged, duration for this step is at least ~10 minnutes. With htop, I see only threads dealing with GC that are around 100%. Don't know if it is normal. Could you help me? I have no idea how to improve runtime. |
If the GC is eating most of the CPUs cycles, it is time to raise the heap size I'd say even though I'd expect If you want to see what is happening with the heap during reindex, there is a way how to do that via StatsD - see https://github.com/oracle/opengrok/wiki/Monitoring#indexer |
What about the system health metrics ? Free memory ? Is the system swapping perhaps ? The heap size should be accounted in the system memory (RAM + swap), esp. on systems that employ memory reservations. |
No OOM exception. |
Looking at the stack dump, the times of the GC threads are indeed "interesting", e.g.:
How many CPUs are in the VM ? Maybe you can also try lowering the number of indexer+history threads. Also, you can get some basic heap statistics using |
VM has 10 CPUs. |
Good luck. We can recommend YourKit for heap analysis. |
Thanks :-) |
That's complete history of the repository. |
Ok. Coming from a git log command I guess. So, even if history is very long, this file should be created quite quickly. |
Yep. There is definitely some overhead associated with the processing of this file (I/O, XML encoding/decoding). I was contemplating whether to introduce a tunable to disable it. |
When is it used? When in xref part at root level with button History? |
Yes. Also when generating the history during reindex I think. |
Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it). Is it possible you limit the number of commits you want to retrieve for any repo? |
Why is the -m used? I have maybe missed an option to get rid of renaming while getting history. Could you please remind me what it is? |
Yes, for repositories with rich history. It is the complete history of the repository serialized as XML.
It might be, better to perform heap analysis first before drawing conclusions. However the history still needs to be generated for the initial index.
These warnings come from Git I believe.
I'd rather not introduce a tunable that would set it as it could cause confusion.
#3243 tracks the general problem. |
This is Lucene specific tunable. It might give index writing a boost I believe. |
Nop this one.
If I understand well, this argument will provide the list of all files that were updated by a merge commit. In our case, the list can be long (and so the warning of git) and so maybe it increases the size of data to manage into memory for the XML generation. |
Problems described in #3243 are indeed exactly the same as mine. |
Hi @vladak |
First thing to look for is to find how much heap space is used and how much is eligible for garbage collection. Then try to find the biggest chunks of the used space, identify to which objects they correspond. |
FYI. |
Note that 1.6.5 has a tunable to disable Git merge commits. This should help with memory consumption. |
Ok. Will test this on my STG environment. Will let you know. |
Hi @vladak Many thanks for 1.6.5 and new parameter to disable Git merge commits! I am now able to get full history cache data generated :-) The MAT session showed biggest threads were processing these repos with huge history/merge commits/files. I close this issue since my original issue is fixed by 1.6.5. I have still some troubles (OOM killer) pushing the new configuration to OpenGrok service (sometimes it finishes, sometime not) at the end of indexing job. Will open a new issue if I cannot fix this myself. Again, many thanks for your help! |
Is your feature request related to a problem? Please describe.
Let's write yes.
I am reindexing all my company Git repos (30k) with latest OG 1.5.11 / Tomcat 9 / Java 11 / RHEL 8.3.
Historycache generation is extremely long...
Describe the solution you'd like
I would like when reindexing:
1a run indexer without historycache
1b run another indexer process in parallel with historycache only that's to say skipping file indexing since this is done in a short time by process of step 1a
2 each night run indexer without historycache
3 when historycache started in step 1b finishes ok, then and only then, update job run nightly in step 2 adding historycache too
Describe alternatives you've considered
Optimizing the historycache generation? Issue to be opened further to your answer to this one.
But in all case, because of the size of our code sources, historycache process has always been very long.
Additional context
Some of our repos have very long history, with many tags and also many files. This does not help generating quickly the history.
The text was updated successfully, but these errors were encountered: