Indexer option to run historycache only #3434

ChristopheBordieu · 2021-02-24T21:43:53Z

Is your feature request related to a problem? Please describe.
Let's write yes.
I am reindexing all my company Git repos (30k) with latest OG 1.5.11 / Tomcat 9 / Java 11 / RHEL 8.3.
Historycache generation is extremely long...

Describe the solution you'd like
I would like when reindexing:
1a run indexer without historycache
1b run another indexer process in parallel with historycache only that's to say skipping file indexing since this is done in a short time by process of step 1a
2 each night run indexer without historycache
3 when historycache started in step 1b finishes ok, then and only then, update job run nightly in step 2 adding historycache too

Describe alternatives you've considered
Optimizing the historycache generation? Issue to be opened further to your answer to this one.
But in all case, because of the size of our code sources, historycache process has always been very long.

Additional context
Some of our repos have very long history, with many tags and also many files. This does not help generating quickly the history.

vladak · 2021-02-25T08:36:37Z

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit #1113) would help things (making it hopefully more I/O bound).

louie0817 · 2021-02-26T01:28:38Z

The history cache update for Git is incremental so the long historycache run has to be caused by either the sheer amount of repositories and/or the processing overhead. I wonder if replacing the executions of the git binary in the GitRepository class with pure Java implementation (say Eclipse JGit) would help things (making it hopefully more I/O bound).

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

vladak · 2021-02-26T09:04:41Z

Vlad, can you elaborate on "incremental" please. We have a huge GHE repo that has over 50k commits. at the top level, the uncompressed OpenGrokDirHist.gz file is over 4 GB. yesterday just one file was updated in the top level dir and about 8 files change in various subdirs. The history cache generation for just the top level dir, took 1 hour and 40 minutes and this happens everynight as they update the file every night. Incremental to me would to be run git log and stop when I hit the sha1 saved in the OpenGroklatestRev file.

When the history cache is being refreshed, the new history is appended to the existing history. In practice this means that for each .gz file it needs to be decompressed, deserialized (from the XML form) into a History object, the new history is appended there in the form of HistoryEntry objects, then it is serialized into XML form and compressed again. This happens for each affected file hence for the top level history as well. What may be happening is that for the the big top level file that holds the complete history of the repository the (de)serialization/(de)compression (combined with the I/O) is taking too long. Maybe there should be a tunable to disable the top-level history. Or find another, more efficient way to represent it.

vladak · 2021-02-27T12:05:57Z

I'd like to know what is it during history cache refresh that takes so much time. PR #3438 provides logging (FINER log level) of the refresh times for individual files and the top level directory.

ChristopheBordieu · 2021-03-02T13:18:21Z

@vladak Is this question for my runs? Because at the moment, no history indexing is done.

vladak · 2021-03-02T13:20:18Z

Yeah, I was wondering if there are any outliers in the history cache generation.

ChristopheBordieu · 2021-03-02T14:15:13Z

Ok. Will come back to this issue when the new one I have just opened is fixed on my side. Since actually, even without history cache generation, I am not able to finish to index my code.

ChristopheBordieu · 2021-03-23T11:38:15Z

Hi @vladak

I am coming back to this issue.
I have started a new run (OG 1.6.3) allowing history (-P -S -G -H, -m 256 -Xmx20g -server).

Run is still ongoing. It is super long... Indexing seems stuck but it is not: all files are processed but very slowly.
I have logs like:

...
2021-03-23 10:01:20.028+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/gcc/.../file1.c (took 0:52:00)
2021-03-23 10:22:44.594+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file2.c (took 0:20:43)
2021-03-23 10:36:29.517+0000 FINER t243645 Statistics.logIt: Done storing history cache for /PROJ/.../file3.adb (took 0:13:13)
...

Some files are storing history cache in more than 1h40. For each new file that is logged, duration for this step is at least ~10 minnutes.

With htop, I see only threads dealing with GC that are around 100%. Don't know if it is normal.
I attach the threads dump here.

Could you help me?
Are my indexer options badly set? My Xmx too big? Too small? Is it an issue because of the size of history for some repos? Number of Git tags?

I have no idea how to improve runtime.

threads.dump.txt.gz

vladak · 2021-03-23T14:07:02Z

If the GC is eating most of the CPUs cycles, it is time to raise the heap size I'd say even though I'd expect OutOfMemoryException to be thrown. Does it take minutes to store history for all files or just for some ? How big is the history size for these files ?

If you want to see what is happening with the heap during reindex, there is a way how to do that via StatsD - see https://github.com/oracle/opengrok/wiki/Monitoring#indexer

vladak · 2021-03-23T14:10:32Z

What about the system health metrics ? Free memory ? Is the system swapping perhaps ? The heap size should be accounted in the system memory (RAM + swap), esp. on systems that employ memory reservations.

ChristopheBordieu · 2021-03-23T14:43:15Z

No OOM exception.
All files take minutes now to store history. At the very beginning, it was quick.
The history files, once created, are not big. Example: text file with size ~500k took ~1h to write history file .gz with size ~4k.
Swap disabled on the VM. Total 32G. Free ~200M. Used 22G. Buffer/cache ~8G. Available ~8G.

vladak · 2021-03-23T15:05:05Z

Looking at the stack dump, the times of the GC threads are indeed "interesting", e.g.:

"GC Thread#4" os_prio=0 cpu=53736794.31ms elapsed=61344.07s tid=0x00007f7b04005000 nid=0x15358b runnable  

"GC Thread#5" os_prio=0 cpu=53767817.04ms elapsed=61344.07s tid=0x00007f7b04006800 nid=0x15358c runnable

How many CPUs are in the VM ? Maybe you can also try lowering the number of indexer+history threads.

Also, you can get some basic heap statistics using jmap.

ChristopheBordieu · 2021-03-23T15:17:56Z

VM has 10 CPUs.
I generated a heap dump (33G!). A colleague will try to look at it.
All long times are coming from 3 repos: I deleted them and will see if it helps.

vladak · 2021-03-23T15:25:58Z

Good luck. We can recommend YourKit for heap analysis.

ChristopheBordieu · 2021-03-23T15:29:01Z

Thanks :-)
What is the file in each repo at root level OpenGrokDirHist.gz?
Is it an aggregation of all the history of all the files?

vladak · 2021-03-23T15:30:44Z

That's complete history of the repository.

ChristopheBordieu · 2021-03-23T15:37:27Z

Ok. Coming from a git log command I guess. So, even if history is very long, this file should be created quite quickly.

vladak · 2021-03-23T15:44:59Z

Yep. There is definitely some overhead associated with the processing of this file (I/O, XML encoding/decoding). I was contemplating whether to introduce a tunable to disable it.

ChristopheBordieu · 2021-03-23T15:50:43Z

When is it used? When in xref part at root level with button History?

vladak · 2021-03-23T15:53:07Z

Yes. Also when generating the history during reindex I think.

ChristopheBordieu · 2021-03-23T17:24:04Z

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it).
I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing.
History is so huge that it ends when I run the exact same comand as OG with warnings:
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

Is it possible you limit the number of commits you want to retrieve for any repo?
I am interested by a tunable to disable or limit this file.

ChristopheBordieu · 2021-03-23T17:41:47Z

Why is the -m used?

I have maybe missed an option to get rid of renaming while getting history. Could you please remind me what it is?
--renamedHistory on|off
Enable or disable generating history for renamed files.
If set to on, makes history indexing slower for repositories
with lots of renamed files. Default is off.
So it is ok. Renaming should not be done. So why the -m?

vladak · 2021-03-23T19:28:17Z

Is this file supposed to have size in the order of GB? For a repo with huge history and huge number of merge commits with long comment (diff; because of the -m option which is used to generate it).

Yes, for repositories with rich history. It is the complete history of the repository serialized as XML.

I think this is the reason why heap memory is consumed and so GC threads always running slowing down the indexing.

It might be, better to perform heap analysis first before drawing conclusions. However the history still needs to be generated for the initial index.

History is so huge that it ends when I run the exact same comand as OG with warnings:
warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 15568 and retry the command.

These warnings come from Git I believe.

Is it possible you limit the number of commits you want to retrieve for any repo?

I'd rather not introduce a tunable that would set it as it could cause confusion.

I am interested by a tunable to disable or limit this file.

#3243 tracks the general problem.

vladak · 2021-03-23T19:29:38Z

Why is the -m used?

This is Lucene specific tunable. It might give index writing a boost I believe.

ChristopheBordieu · 2021-03-23T19:54:06Z

This is Lucene specific tunable. It might give index writing a boost I believe.

Nop this one.
The -m which is used when indexer runs git log command for the repository:

2021-03-23 17:19:59.280+0000 FINE t243626 Executor.exec: Executing command [/usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /srv/opengrok-src/PROJ/repo]

If I understand well, this argument will provide the list of all files that were updated by a merge commit. In our case, the list can be long (and so the warning of git) and so maybe it increases the size of data to manage into memory for the XML generation.

vladak · 2021-03-24T07:18:54Z

Yes, the git -m option causes all kinds of problems for repositories with large history - #3243 as well as #3367. In the short term I plan to introduce a tunable to disable it.

ChristopheBordieu · 2021-03-24T08:25:58Z

Problems described in #3243 are indeed exactly the same as mine.

ChristopheBordieu · 2021-03-25T11:06:19Z

Hi @vladak
I have a MAT session ongoing. What would you like I look for?

vladak · 2021-03-25T12:53:34Z

First thing to look for is to find how much heap space is used and how much is eligible for garbage collection. Then try to find the biggest chunks of the used space, identify to which objects they correspond.

ChristopheBordieu · 2021-03-25T13:41:44Z

FYI.
In //, I have removed all huge repos (2 with their forks): history has been generated and indexing is close to finish.
My issue is these repos I have removed have to be indexed. So, very likely, I will come back to 1.3.16 for my final runs.

vladak · 2021-03-25T13:47:33Z

Note that 1.6.5 has a tunable to disable Git merge commits. This should help with memory consumption.

ChristopheBordieu · 2021-03-25T14:03:23Z

Ok. Will test this on my STG environment. Will let you know.

ChristopheBordieu · 2021-03-29T10:36:04Z

Hi @vladak

Many thanks for 1.6.5 and new parameter to disable Git merge commits! I am now able to get full history cache data generated :-) The MAT session showed biggest threads were processing these repos with huge history/merge commits/files.

I close this issue since my original issue is fixed by 1.6.5.

I have still some troubles (OOM killer) pushing the new configuration to OpenGrok service (sometimes it finishes, sometime not) at the end of indexing job. Will open a new issue if I cannot fix this myself.

Again, many thanks for your help!

ChristopheBordieu added the enhancement label Feb 24, 2021

vladak added the indexer label Feb 25, 2021

vladak mentioned this issue Feb 27, 2021

report duration for file history cache updates #3438

Merged

ChristopheBordieu closed this as completed Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexer option to run historycache only #3434

Indexer option to run historycache only #3434

ChristopheBordieu commented Feb 24, 2021

vladak commented Feb 25, 2021 •

edited

Loading

louie0817 commented Feb 26, 2021 •

edited

Loading

vladak commented Feb 26, 2021 •

edited

Loading

vladak commented Feb 27, 2021

ChristopheBordieu commented Mar 2, 2021

vladak commented Mar 2, 2021

ChristopheBordieu commented Mar 2, 2021

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

vladak commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

vladak commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 24, 2021

ChristopheBordieu commented Mar 24, 2021

ChristopheBordieu commented Mar 25, 2021

vladak commented Mar 25, 2021

ChristopheBordieu commented Mar 25, 2021 •

edited

Loading

vladak commented Mar 25, 2021

ChristopheBordieu commented Mar 25, 2021

ChristopheBordieu commented Mar 29, 2021 •

edited

Loading

Indexer option to run historycache only #3434

Indexer option to run historycache only #3434

Comments

ChristopheBordieu commented Feb 24, 2021

vladak commented Feb 25, 2021 • edited Loading

louie0817 commented Feb 26, 2021 • edited Loading

vladak commented Feb 26, 2021 • edited Loading

vladak commented Feb 27, 2021

ChristopheBordieu commented Mar 2, 2021

vladak commented Mar 2, 2021

ChristopheBordieu commented Mar 2, 2021

ChristopheBordieu commented Mar 23, 2021 • edited Loading

vladak commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021 • edited Loading

ChristopheBordieu commented Mar 23, 2021 • edited Loading

vladak commented Mar 23, 2021

vladak commented Mar 23, 2021

ChristopheBordieu commented Mar 23, 2021

vladak commented Mar 24, 2021

ChristopheBordieu commented Mar 24, 2021

ChristopheBordieu commented Mar 25, 2021

vladak commented Mar 25, 2021

ChristopheBordieu commented Mar 25, 2021 • edited Loading

vladak commented Mar 25, 2021

ChristopheBordieu commented Mar 25, 2021

ChristopheBordieu commented Mar 29, 2021 • edited Loading

vladak commented Feb 25, 2021 •

edited

Loading

louie0817 commented Feb 26, 2021 •

edited

Loading

vladak commented Feb 26, 2021 •

edited

Loading

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

ChristopheBordieu commented Mar 23, 2021 •

edited

Loading

ChristopheBordieu commented Mar 25, 2021 •

edited

Loading

ChristopheBordieu commented Mar 29, 2021 •

edited

Loading