-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
History is enabled in index.sh, but no git history in webapp when running indexer without -S #3071
Comments
On a side question - it currently takes each container ~25 minutes to re-index after a git pull. If I fix the history issue, am I risking that number increasing in the extreme? Re-indexing, using the 1.5 year old version consistently took about 7hrs, even if only one file had been changed. At the cost of having to tell the developers "If you want a history - go to the repo and look it up", I'd keep the history off if it'll let me keep the short update time. 25 minutes is doable. 7hrs is not. Will this put me back in 7hrs territory? |
You do not need to alter |
Any Ctags problem cannot be related to history generation because it happens in second stage of indexing after the history was already generated (if enabled). Check the indexer logs (see https://github.com/oracle/opengrok/tree/master/docker#indexer-logs) to see what happened with the history. It should say something like this near the beginning of the indexing when run with
If I run the indexer just with
If I run the indexer without
I.e. without scanning for repositories (or getting them from read-only configuration) there is no way how history cache can be generated. A for the Ctags failure, without seeing more context of the log it is hard to tell what happened. Possibly the |
The simplest way how to find out is to try. With containers this is particularly easy. The history indexing for most of the modern SCMs happens incrementally so it should not add too much time. Depends on how large is each increment of course. Again, the indexer log contains some statistics to see how long each step took (indeed, the log messages say e.g. something like |
The long time might be due to I/O (again assuming that the updates are on the small side) as the indexer traverses the whole source directory tree. (be it source root or project root in case of per project indexing. In Docker environment this is the former case.) You can see the time spent by directory traversal by subtracting the times between #3049 might actually track similar problem even though it focuses on other metrics. Now, for SCMs based on changesets the indexer could somehow take the information about changed files from the mirror script (by looking into the "stats" of the incoming changesets) and process just these files. |
So I have to have -S on to have the history on? I have one repository in most containers... and 124 in one of them. I took out the -S flag and added the repositories to index.sh with the --repository flag so it could skip the "searching for repositories" step which I have seen take up a whole hour - At least in the older version of opengrok. In the above-quoted index.sh I have do one repository;
Would that not be enough to trigger history? It has to actually search, first? As to the mirroring script, I'm running the indexer via "docker exec /scripts/index.sh". I'm not using the scripts. Are they a must now? |
The |
In the current docker image version the mirroring is done using the mirror script: Line 17 in fd17fd1
The |
I see. Mystery solved, then. Thank you. |
Can you see the history link now ? What was the problem then ? |
No, I'm afraid not. :( opengrok-mirror returns
|
The help for
|
Actually, using |
You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp. |
Possibly, we can make |
Perhaps using |
This actually requires couple of steps:
Anyhow, this is getting outside of the Docker image territory. |
Also, it looks like the |
The 1.3.10 release has the -S with optional repository path. The rest of the problems is captured in referenced issues herein so I think that's it for this one. If not feel free to reopen. |
Thanks for this. I'll try it. Sorry about the lack of reply. We've been working less, due to the plague, only just got back to this. |
@vladak What would be the similar steps to reindex one repository? |
I don't think it is possible to reindex just one repository of given project. |
So, I got back to working on this project, this sprint and I tried your suggested instructions and I got:
What am I doing wrong? |
I'm starting to think that you're talking about a completely different thing when you talk about repository. Are we not talking about, for example, something like a git repo? Because, I only HAVE one of those in each project. |
Any directory directly underneath source root is a project (assuming projects are enabled). A project may have zero or more repositories. A repository is a checkout from Source Code Management system such as Mercurial or Git. For example, the directory structure can look like this (directories listing only):
So, there are two projects. The The level of granularity for indexing is a project. You cannot really index just one repository from a project. |
The usage matches the documentation so there is some problem with the web app. Is it actually deployed ? Does a GET request to http://localhost:8080/source/api/v1/projects return anything ? |
The thing is, I checked, and we only HAVE one repository per project. There are no nested repositories. We don't use them. And, the other day, I ran the indexer with the -S option on, to compare it to the run time without it on (roughly 40 minutes) and I ended up with 12 hours of indexing, which is insane. |
It returns the same "file not found" error, I showed you before. But if I open a browser and navigate to the exposed port for the container, it's up and running and I can search it just fine, and if I have the -S on, like I do now, I can even see history. |
I'd start with looking for where the time was actually spent. From the indexer logs this is quite easy - just search for
and spot any outliers. The log messages are per project and also for overall projects/repositories. From the former you can get some idea about "big" projects, from the latter you can see the overall times of the indexing phases. In the above example the Lucene project stands out because prior to indexing I removed all of its data. Without the actual indexer options it is hard to speculate why the sudden jump in time occurred however my guess would be that the added -S option basically enabled history cache creation and it was created for the first time which usually takes a lot of time. |
Assuming 8080 is actually the port exposed by the container, there are 2 things you need to be aware of:
|
No, no... I'm running all of the curl commands from inside the container. The issue seems to have been the "/source". If I run Now, as to this:
How do I add a project's repositories? |
The API call will add the project and its repositories. In fact, as I discovered recently it adds the repositories also in case they are not needed - #3405. |
Ok, I missed this one, on first reading. So, if I understand you correctly, the history cache will take an eon to create for the first time, whether I tell it my projects and repos in advance, or not? |
Yes. Of course, it depends on the size of the history (there are some notable issues like #3416 or #3367). In general, the indexing process creates history cache first (modulo SCMs that do not support history per directory) and then proceeds to create the actual index. So, if all projects are indexed together for the first time, the 1st phase of indexing when the history is created delays the second phase significantly. You can use the per project workflow to make some of the projects available earlier. If you have a pre-existing index and want to enable history you will need to reindex from scratch (modulo xrefs) because the history data is part of the index (so that it can be searched) and per document (a file) updates happen only if given file changes on disk. Possibly this could be an enhancement. |
As for the demo of the per project workflow take a look at PR #3402. This basically converts the Docker image to use the per project workflow while utilizing the Python utilities. |
I'm running in a container based on a docker image built from opengrok/docker, and an altered index.sh that currently looks like this:
The index forms ok, and seems fine to a cursory examination, but the history link doesn't work, and there's no list of "current version" with last commit on the web app front page, which I get with the 1.5 year old version, if I run it on the same files.
The only severe I found in the log is a handful of these:
WARNING: CTags parsing problem:
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at org.opengrok.indexer.analysis.Ctags.readTags(Ctags.java:547)
at org.opengrok.indexer.analysis.Ctags.lambda$doCtags$2(Ctags.java:455)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Mar 16, 2020 11:31:08 PM org.opengrok.indexer.analysis.Ctags readTags
SEVERE: CTag reader cycle was interrupted!
Is this a settings issue? Did I put in the wrong flag somewhere? I mean, as far as I understand it, the -H means history is on, no?
Or is it the severe?
Either way, please advise... and please excuse my being a pain.
The text was updated successfully, but these errors were encountered: