Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer option to run historycache only #3434

Closed
ChristopheBordieu opened this issue Feb 24, 2021 · 33 comments
Closed

Indexer option to run historycache only #3434

ChristopheBordieu opened this issue Feb 24, 2021 · 33 comments

Comments

@ChristopheBordieu
Copy link

Is your feature request related to a problem? Please describe.
Let's write yes.
I am reindexing all my company Git repos (30k) with latest OG 1.5.11 / Tomcat 9 / Java 11 / RHEL 8.3.
Historycache generation is extremely long...

Describe the solution you'd like
I would like when reindexing:
1a run indexer without historycache
1b run another indexer process in parallel with historycache only that's to say skipping file indexing since this is done in a short time by process of step 1a
2 each night run indexer without historycache
3 when historycache started in step 1b finishes ok, then and only then, update job run nightly in step 2 adding historycache too

Describe alternatives you've considered
Optimizing the historycache generation? Issue to be opened further to your answer to this one.
But in all case, because of the size of our code sources, historycache process has always been very long.

Additional context
Some of our repos have very long history, with many tags and also many files. This does not help generating quickly the history.

@vladak
Copy link
Member

vladak commented Feb 25, 2021

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit #1113) would help things (making it hopefully more I/O bound).

@louie0817
Copy link
Contributor

louie0817 commented Feb 26, 2021

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit) would help things (making it hopefully more I/O bound).

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

@vladak
Copy link
Member

vladak commented Feb 26, 2021

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

When the history cache is being refreshed, the new history is appended to the existing history. In practice this means that for each .gz file it needs to be decompressed, deserialized (from the XML form) into a History object, the new history is appended there in the form of HistoryEntry objects, then it is serialized into XML form and compressed again. This happens for each affected file hence for the top level history as well. What may be happening is that for the the big top level file that holds the complete history of the repository the (de)serialization/(de)compression (combined with the I/O) is taking too long. Maybe there should be a tunable to disable the top-level history. Or find another, more efficient way to represent it.

@vladak
Copy link
Member

vladak commented Feb 27, 2021

I'd like to know what is it during history cache refresh that takes so much time. PR #3438 provides logging (FINER log level) of the refresh times for individual files and the top level directory.

@ChristopheBordieu
Copy link
Author

@vladak Is this question for my runs? Because at the moment, no history indexing is done.

@vladak
Copy link
Member

vladak commented Mar 2, 2021

Yeah, I was wondering if there are any outliers in the history cache generation.

@ChristopheBordieu
Copy link
Author

Ok. Will come back to this issue when the new one I have just opened is fixed on my side. Since actually, even without history cache generation, I am not able to finish to index my code.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Mar 23, 2021

Hi @vladak

I am coming back to this issue.
I have started a new run (OG 1.6.3) allowing history (-P -S -G -H, -m 256 -Xmx20g -server).

Run is still ongoing. It is super long... Indexing seems stuck but it is not: all files are processed but very slowly.
I have logs like:

...
2021-03-23 10:01:20.028+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/gcc/.../file1.c (took 0:52:00)
2021-03-23 10:22:44.594+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file2.c (took 0:20:43)
2021-03-23 10:36:29.517+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file3.adb (took 0:13:13)
... 

Some files are storing history cache in more than 1h40. For each new file that is logged, duration for this step is at least ~10 minnutes.

With htop, I see only threads dealing with GC that are around 100%. Don't know if it is normal.
I attach the threads dump here.

Could you help me?
Are my indexer options badly set? My Xmx too big? Too small? Is it an issue because of the size of history for some repos? Number of Git tags?

I have no idea how to improve runtime.

threads.dump.txt.gz

@vladak
Copy link
Member

vladak commented Mar 23, 2021

If the GC is eating most of the CPUs cycles, it is time to raise the heap size I'd say even though I'd expect OutOfMemoryException to be thrown. Does it take minutes to store history for all files or just for some ? How big is the history size for these files ?

If you want to see what is happening with the heap during reindex, there is a way how to do that via StatsD - see https://github.com/oracle/opengrok/wiki/Monitoring#indexer

@vladak
Copy link
Member

vladak commented Mar 23, 2021

What about the system health metrics ? Free memory ? Is the system swapping perhaps ? The heap size should be accounted in the system memory (RAM + swap), esp. on systems that employ memory reservations.

@ChristopheBordieu
Copy link
Author

No OOM exception.
All files take minutes now to store history. At the very beginning, it was quick.
The history files, once created, are not big. Example: text file with size ~500k took ~1h to write history file .gz with size ~4k.
Swap disabled on the VM. Total 32G. Free ~200M. Used 22G. Buffer/cache ~8G. Available ~8G.

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Looking at the stack dump, the times of the GC threads are indeed "interesting", e.g.:

"GC Thread#4" os_prio=0 cpu=53736794.31ms elapsed=61344.07s tid=0x00007f7b04005000 nid=0x15358b runnable  

"GC Thread#5" os_prio=0 cpu=53767817.04ms elapsed=61344.07s tid=0x00007f7b04006800 nid=0x15358c runnable  

How many CPUs are in the VM ? Maybe you can also try lowering the number of indexer+history threads.

Also, you can get some basic heap statistics using jmap.

@ChristopheBordieu
Copy link
Author

VM has 10 CPUs.
I generated a heap dump (33G!). A colleague will try to look at it.
All long times are coming from 3 repos: I deleted them and will see if it helps.

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Good luck. We can recommend YourKit for heap analysis.

@ChristopheBordieu
Copy link
Author

Thanks :-)
What is the file in each repo at root level OpenGrokDirHist.gz?
Is it an aggregation of all the history of all the files?

@vladak
Copy link
Member

vladak commented Mar 23, 2021

That's complete history of the repository.

@ChristopheBordieu
Copy link
Author

Ok. Coming from a git log command I guess. So, even if history is very long, this file should be created quite quickly.

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Yep. There is definitely some overhead associated with the processing of this file (I/O, XML encoding/decoding). I was contemplating whether to introduce a tunable to disable it.

@ChristopheBordieu
Copy link
Author

When is it used? When in xref part at root level with button History?

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Yes. Also when generating the history during reindex I think.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Mar 23, 2021

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it).
I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing.
History is so huge that it ends when I run the exact same comand as OG with warnings:
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

Is it possible you limit the number of commits you want to retrieve for any repo?
I am interested by a tunable to disable or limit this file.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Mar 23, 2021

Why is the -m used?

I have maybe missed an option to get rid of renaming while getting history. Could you please remind me what it is?
--renamedHistory on|off
Enable or disable generating history for renamed files.
If set to on, makes history indexing slower for repositories
with lots of renamed files. Default is off.
So it is ok. Renaming should not be done. So why the -m?

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it).

Yes, for repositories with rich history. It is the complete history of the repository serialized as XML.

I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing.

It might be, better to perform heap analysis first before drawing conclusions. However the history still needs to be generated for the initial index.

History is so huge that it ends when I run the exact same comand as OG with warnings:
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

These warnings come from Git I believe.

Is it possible you limit the number of commits you want to retrieve for any repo?

I'd rather not introduce a tunable that would set it as it could cause confusion.

I am interested by a tunable to disable or limit this file.

#3243 tracks the general problem.

@vladak
Copy link
Member

vladak commented Mar 23, 2021

Why is the -m used?

This is Lucene specific tunable. It might give index writing a boost I believe.

@ChristopheBordieu
Copy link
Author

This is Lucene specific tunable. It might give index writing a boost I believe.

Nop this one.
The -m which is used when indexer runs git log command for the repository:

2021-03-23 17:19:59.280+0000 FINE t243626 Executor.exec: Executing command [/usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /srv/opengrok-src/PROJ/repo]

If I understand well, this argument will provide the list of all files that were updated by a merge commit. In our case, the list can be long (and so the warning of git) and so maybe it increases the size of data to manage into memory for the XML generation.

@vladak
Copy link
Member

vladak commented Mar 24, 2021

Yes, the git -m option causes all kinds of problems for repositories with large history - #3243 as well as #3367. In the short term I plan to introduce a tunable to disable it.

@ChristopheBordieu
Copy link
Author

Problems described in #3243 are indeed exactly the same as mine.

@ChristopheBordieu
Copy link
Author

Hi @vladak
I have a MAT session ongoing. What would you like I look for?

@vladak
Copy link
Member

vladak commented Mar 25, 2021

First thing to look for is to find how much heap space is used and how much is eligible for garbage collection. Then try to find the biggest chunks of the used space, identify to which objects they correspond.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Mar 25, 2021

FYI.
In //, I have removed all huge repos (2 with their forks): history has been generated and indexing is close to finish.
My issue is these repos I have removed have to be indexed. So, very likely, I will come back to 1.3.16 for my final runs.

@vladak
Copy link
Member

vladak commented Mar 25, 2021

Note that 1.6.5 has a tunable to disable Git merge commits. This should help with memory consumption.

@ChristopheBordieu
Copy link
Author

Ok. Will test this on my STG environment. Will let you know.

@ChristopheBordieu
Copy link
Author

ChristopheBordieu commented Mar 29, 2021

Hi @vladak

Many thanks for 1.6.5 and new parameter to disable Git merge commits! I am now able to get full history cache data generated :-) The MAT session showed biggest threads were processing these repos with huge history/merge commits/files.

I close this issue since my original issue is fixed by 1.6.5.

I have still some troubles (OOM killer) pushing the new configuration to OpenGrok service (sometimes it finishes, sometime not) at the end of indexing job. Will open a new issue if I cannot fix this myself.

Again, many thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants