Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsequent (2nd/3rd/etc.) indexing taking time #3049

Open
ghost opened this issue Feb 20, 2020 · 5 comments
Open

Subsequent (2nd/3rd/etc.) indexing taking time #3049

ghost opened this issue Feb 20, 2020 · 5 comments

Comments

@ghost
Copy link

ghost commented Feb 20, 2020

Hi OpenGrok dev team,

I think this is a kind of question not a bug report.
I'm expecting that the subsequent (2nd, 3rd, etc.) indexing time is much shorter
than initial (1st) indexing time for git repository branches.
However, the subsequent indexing time is about half of the initial indexing time
or worse (longer) for all of our git repository branches. For example, our most
popular (and huge size) branch takes about 42 hours for initial indexing and the
subsequent indexing take about 20 hours. I was expecting to have much shorter.
Is this expected and normal? I hope it's not and you would find the clue to improve
in my configuration/settings.

Here is my environment:
H/W (vm): CPU Xeon 3.2 GHz 8 core, 96GB RAM
S/W: RHEL 7
tomcat-9.0.13, jdk-11.0.2, opengrok-1.3.8
Universal Ctags download and built as of 02/13/2020
git version 2.19.1
/etc/security/limits.conf: soft/hard nofile set as 65536

I did sample process as follows using Mongo DB source code with my sample scripts.
(Please see attached my sample scripts.)

steps:

1st run:
run "./prep.sh temp1" to setup workspace (temp1) and deploy initial war
cd /opt/pisces/workspace/temp1/src
git clone https://github.com/mongodb/mongo
run "./idx.sh temp1" for indexing
==> It took about 10 minutes.

2nd/3rd/etc. run:
cd /opt/pisces/workspace/temps/src/mongo
git pull
run "./idx.sh temp1" for indexing
==> It also took about 10 minutes.

(I also omitted the "git pull" step in the above but the same result.)

idx.sh.txt
prep.sh.txt

@idodeclare
Copy link
Contributor

The optimize() step of OpenGrok is antiquated, and the Lucene forceMerge() it does is the source of considerable slowness in every run. For very large repos that can mean hours of merging for only a few commits of updates.

I have a branch that retired that merge handling (while accommodating the existence of deleted objects for a time in an index so they don’t appear in search results). The branch doesn’t work anymore after the recent Lucene upgrades, and I haven’t focused on fixing.

I’m a little surprised that for your tests with just Mongo the timings do not change at all on subsequent runs. Can you see if the logs shows a lot of entries on subsequent runs? Normally they would have relatively few entries with just a longer-than-expected optimize() run.

@ghost
Copy link
Author

ghost commented Feb 20, 2020

Thank you for your response. Yes, I see 5000 entries on the log files. I am attaching the last one which was from no "git pull" case. It runs about 10 minutes, also.
opengrok0.0.log

@idodeclare
Copy link
Contributor

Mongo in OpenGrok is afflicted by #2986. The log shows thousands of lines of tag processing.

@louie0817
Copy link
Contributor

louie0817 commented Mar 9, 2020

Mongo in OpenGrok is afflicted by #2986. The log shows thousands of lines of tag processing.

I downloaded Mongo, it only has 790 tags, so that means only 790 calls to "git log" to get info about tags on the initial index run. Any incremental will do about 290, which is due to a different bug, that I am also addressing in #2986 . I confirmed this with strace on the initial and 2nd index.

The different bug being that the any version with multiple tags, current openwork version only records the first one. Once I submit PR, that will be fixed along with the intended fix of removing the one git log per tag. That being said, in the 1.3.6 docker image, the initial index for mongo source took about 10 minutes and subsequent indexes took 50 seconds, so if you are see thousands of tags, perhaps you mean ctag processing and not git tag processing?

UPDATE: I was only looking at uniq tags that were being called with "git log" commands, which the 790 I referred to. But looking back at my trace, there seems to be 3 calls per tag, but that seems to be a function of the threading, as one strace line shows "unfinished". and apparently one of them is the initial exec and another is the resumption. I am not 100 percent sure on this.
So for raw number of execve calls, I see 3434 in the initial run and 2865 in the incremental indexing. still, execution time was 10 minutes and 50 seconds.

UPDATE 2: Using Docker 1.3.9 image, results were the same, about 10 minutes for initial index and 46 seconds for subsequent index.

@louie0817
Copy link
Contributor

I believe closed issue #3067 could be a cause of slowness here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants