Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search by History is not working for a Git repository when indexing with git 1.8 #2292

Closed
naveen0256 opened this issue Aug 13, 2018 · 19 comments
Labels

Comments

@naveen0256
Copy link

Hi Team,

We are using opengrok-1.1-rc18 and java 1.8. OS is Linux (version 3.10.0-693.21.1.el7.x86_64).

For indexing with history we tried with git 1.8 version. It is indexing, but upon clicking the history link of any project or file, it is giving Page not found. In one of the thread, it is advised to use latest git verison. We upgrade the git client version to 2.16. It is working but the indexing is happening very slow. We observed that the indexing happening very fast using git 1.8.3 but the history is not working. Using git 2.16, indexing is very slow and history is showing. So on a trial attempt, we started indexing our entire code git 1.8 first and git 2.16 later. We used 2 steps. In the first step indexing is done very fast. In the second step indexing, since the code is already indexed, subsequent indexing is done very fast and history is also appearing.

It is very nice and working nicely.

But when searching using History(git comment), it is not giving the result. Though there is the git comment for a specific file.

Can you please help me with this. Please let me know for any additional information.

@vladak
Copy link
Member

vladak commented Aug 15, 2018

How exactly slow is slow ?

Does the history search working when you let the indexer complete with just git 2.16 ?

@vladak vladak changed the title Search by History is not working Search by History is not working for a Git repository Aug 15, 2018
@vladak
Copy link
Member

vladak commented Aug 15, 2018

Also, does this issue occur for all your Git repositories or just for some ?

@naveen0256
Copy link
Author

Hi,

Using git 2.16 version, OpenGrok is taking 2 days to complete the indexing for the 12 gb code along with history.

The issue is coming for all the git repositories.

@vladak
Copy link
Member

vladak commented Aug 28, 2018

That's not too slow I'd say. So, does the history search work if you let the indexer complete with just Git 2.16 ?

@naveen0256
Copy link
Author

I will need to do the experiment with this. Currently our people are using it. If I am asked to add one more branch, indexing may takes more time with using git 2.16.0. I am little worried to run with git 2.16 version alone.

@naveen0256
Copy link
Author

naveen0256 commented Sep 3, 2018

I tried indexing with having git 2.16 alone. It is working fine. But the thing is, it is taking too long time for even doing the subsequent indexing. I also added memory arguments while doing the indexing like below.
OPENGROK_FLUSH_RAM_BUFFER_SIZE="-m 1024" JAVA_OPTS="$JAVA_OPTS -Xmx8192m -d64 -server" bin/OpenGrok index branches.

Is there any way to improve the performance of indexing with having git 2.16 apart from the above increased memory arguments?...

@vladak
Copy link
Member

vladak commented Sep 3, 2018

So how much time does it take to perform incremental reindex ? Do you reindex everything or do you perform per-project reindex ? Is the history generation indeed the thing that takes long time ? Check your indexer logs.

@naveen0256
Copy link
Author

naveen0256 commented Sep 3, 2018

We added a new branch. All branches will have multiple xmls. One git branch will have around 5K xml, js and css files. When indexing xml files, it is taking more time.
For reindexing, it is taking 30 minutes with all the code. If there is any update.
When adding new branch, it is taking around one second for one xml.
We dont perform per-project reindex. We reindex everything.
Below is the indexer log. If we see, time taken to index alert_style.css file is 2 seconds.
2018-09-03 06:22:16.355-0700 INFO t5625 DefaultIndexChangedListener.fileAdd: Add: /branch19/alerts/ui.html/alert_style.css (PlainAnalyzer)
2018-09-03 06:22:18.382-0700 INFO t5625 DefaultIndexChangedListener.fileAdd: Add: /branch19/alerts/ui.html/scripts/tinyAlert.js (JavaScriptAnalyzer).

Please let me know for any additional information.

@vladak
Copy link
Member

vladak commented Sep 3, 2018

How big are the updates to the repositories usually ? It would be useful to see the times of history generation from the indexer log (run the indexer with the -v option). I believe in your case (at least for incremental reindex) it is rather the I/O of the directory traversal when checking modified times rather than git.

@naveen0256
Copy link
Author

naveen0256 commented Sep 4, 2018

The total size of one branch is around 13GB.
it is rather the I/O of the directory traversal when checking modified times rather than git.
Yes it may be. I am also believing the same. The reason I would like to understand why this much of slow. Is this is the reason i.e I/O of the directory traversal ? If this is the reason, is there any way can we improve the indexing performance in the case of indexing this directory traversal?

One more observation is, the time taking is more only in the case of one repository of the branch. the path is like this /branch19/speik/speik/*. Might this be the reason for taking more time, if the project name and repo name are same and directory traversal into it.?
Wanted to share the complete observation..

Thank you Vladak for all the support. Please let me know for any additional information.

@naveen0256
Copy link
Author

One more question, if I use opengrok-1.1-rc38 version, will the indexing performance of this directory traversal be improved? Just curious to know..

@vladak
Copy link
Member

vladak commented Sep 4, 2018

First, it would be useful to know where the time is spent. You can glean some information where the indexer spends time by checking the logs for things like:

2018-09-04 14:43:28.174+0200 INFO t1 Indexer.prepareIndexer: Scanning for repositories...
...
2018-09-04 14:43:33.241+0200 INFO t1 Indexer.prepareIndexer: Done scanning for repositories (5s)
2018-09-04 14:43:33.242+0200 INFO t1 Indexer.prepareIndexer: Generating history cache for all repositories ...
2018-09-04 14:43:33.267+0200 INFO t1 HistoryGuru.createCacheReal: Creating historycache for 17 repositories
2018-09-04 14:43:33.609+0200 INFO t130 HistoryGuru.createCache: Creating historycache for /var/opengrok/src/OpenGrok (GitRepository) without renamed file handling
...
2018-09-04 14:43:33.640+0200 INFO t130 Statistics.report: Done historycache for /var/opengrok/src/OpenGrok (took 31 ms)
...
2018-09-04 14:43:33.721+0200 INFO t1 Statistics.report: Done historycache for all repositories (took 479 ms)
...
2018-09-04 14:43:33.723+0200 INFO t1 Indexer.doIndexerExecution: Starting indexing
...
2018-09-04 14:43:42.264+0200 INFO t1 Statistics.report: Done indexing data of all repositories (took 8.541 seconds)

To have better insight into where the Indexer is spending time the changes #579 are needed.

In general (depending on indexer configuration) the indexer runs in 2 phases - first history cache is generated, then individual files are indexed and xref files created. In the first phase basically the history is read using given SCM command (e.g. git log), parsed and stored to disk. This is both I/O and CPU intensive. In the second phase all the source files are traversed and their modified time stamps compared to what is in the index. If the files are newer, they are indexed afresh. This phase is also I/O and CPU intensive. It became very efficient after the better parallelism was introduced. One of the changes happened in cset b92758c after 1.1-rc18 so upgrading might achieve better utilization (caveat - this might stress other parts of the system more than before so general improvement is not guaranteed).

Currently it is not possible to tell how each project contributes to the second phase of indexing. I guess the calls to indexDown() and indexParallel() in IndexDatabase#update() can be wrapped inside Statistics to be able to see how much time they take.

If you have good I/O backend (good PCIe layout with SSDs for reading/writing data and/or n-way mirror of disks for reading data) that cannot be easily saturated (meaning more in terms of IOPS rather than bandwidth) or has very good caching mechanisms (think ZFS with ample RAM for ARC, e.g. on our production system ZFS caches occupy 70% of physical RAM) then you can try bumping the thread count using the --threads Indexer option and see what happens.

There might be some areas where the Indexer could limit the number of syscalls (esp. stat) it makes. This needs to be thoroughly investigated.

vladak pushed a commit to vladak/OpenGrok that referenced this issue Sep 4, 2018
@vladak
Copy link
Member

vladak commented Sep 4, 2018

Another thing to look at would be to identify weak points in the system using the USE methodology (http://www.brendangregg.com/usemethod.html), e.g. saturation/utilization of system resources during reindex.

@vladak
Copy link
Member

vladak commented Sep 4, 2018

Also, 1.1-rc20 has deferred Lucene operations that should make indexer run faster (#1936).

vladak pushed a commit that referenced this issue Sep 5, 2018
@vladak
Copy link
Member

vladak commented Sep 5, 2018

Install 1.1-rc40 to get better overview of where the time is spent.

Anyhow, I think we should return to the original problem. I think the root of the problem is that when using git 1.8 the history is actually not generated at all. Could you check that after indexing from scratch with just git 1.8 the historycache directory under data root contains some entries ? What is the content of some sample entries in such case ?

@vladak vladak changed the title Search by History is not working for a Git repository Search by History is not working for a Git repository when indexing with git 1.8 Sep 5, 2018
@vladak
Copy link
Member

vladak commented Sep 5, 2018

Could you actually share the indexer logs when using git 1.8 somewhere ? (you can redact the file paths)

@vladak
Copy link
Member

vladak commented Sep 5, 2018

My guess is what is happening here is that when running the indexer with git 1.8 no history cache is generated at all. This is because history log executor uses the --date=iso8601-strict option that is not present in old git log command (compare https://git-scm.com/docs/git-log/1.8.5.2 with https://git-scm.com/docs/git-log/).

Again, the indexer logs should tell.

@naveen0256
Copy link
Author

Hi Vladak, thanks for the response.
As I mentioned in the beginning of this thread, we are indexing the code with git 1.8 and git 2.16 versions to make the indexing faster.
When using git 1.8, the historyCache was not generated, as you mentioned there were errors thrown with --date=iso8601-strict. In the second phase, with git 1.8, all the source files were traversed and written to disc.
After running with 2.16 in the second step, the historycache got generated(first phase process in which historyCache was not generated with 1.8). In the second phase, with 2.16, since the source files were already written to disc using git 1.8 in the first step indexing, this was not executed(already existed source)

Yes, with git 2.16 in the second phase, during all the source files are being traversed, it is taking long time. We did indexing one brach with git 2.16 alone yesterday. And it took time while writing the source files in second phase of indexing.. Now the history search is working fine as expected on that branch.
So if we use opengrok-1.1-rc20 version, will the indexing be fast? I can use 1.1-rc40, but it requires some python libraries and tried. So we would like to go with bash scripts now.
Please advise.

@vladak
Copy link
Member

vladak commented Sep 6, 2018

If history cache is not generated, history will not be stored in the index and therefore could not be searched. Basically you were bitten by #747 (and also by not reading the indexer logs properly) - the whole indexing needs to be terminated if history cache generation fails (at least for SCMs capable of fetching history per directory).

Enforcing certain Git version would not work because of said issue and also would create unnecessary maintenance burden.

The performance enhancements were mentioned before - just try most recent version and see yourself. In the end it will depend on how powerful your hardware is. If you identify any weak spots (e.g. using the USE methodology), file a new issue with detailed description.

The traversal of all files needs to always happen. It is not dependent on history cache generation and therefore is not dependent on git or any SCM at all (unless it enters the behaviour described in the issue above).

As for the Python scripts, unless you need to do per project management and indexing, you can run the indexer with bare java from command line perfectly fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants