Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

History is enabled in index.sh, but no git history in webapp when running indexer without -S #3071

Closed
Ymoise opened this issue Mar 16, 2020 · 35 comments

Comments

@Ymoise
Copy link

Ymoise commented Mar 16, 2020

I'm running in a container based on a docker image built from opengrok/docker, and an altered index.sh that currently looks like this:

#!/bin/bash

LOCKFILE=/var/run/opengrok-indexer
URI="http://localhost:8080"
OPS=${INDEXER_FLAGS:='-H -P -G'}

if [ -f "$LOCKFILE" ]; then
        date +"%F %T Indexer still locked, skipping indexing"
        exit 1
fi

touch $LOCKFILE

date +"%F %T Indexing starting"
opengrok-indexer \
    -J=-Xmx8g \
    -J=-server \
    -a /opengrok/lib/opengrok.jar -- \
    -i hugeMatlabFile.mat \
    -m 256 \
    -v \
    --repository /opengrok/src/master \
    -s /opengrok/src \
    -d /opengrok/data \
    --remote on \
    -W /opengrok/etc/configuration.xml \
    -U "$URI" \
    $OPS \
    $INDEXER_OPT "$@"
date +"%F %T Indexing finished"

The index forms ok, and seems fine to a cursory examination, but the history link doesn't work, and there's no list of "current version" with last commit on the web app front page, which I get with the 1.5 year old version, if I run it on the same files.


The only severe I found in the log is a handful of these:

WARNING: CTags parsing problem:
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at org.opengrok.indexer.analysis.Ctags.readTags(Ctags.java:547)
at org.opengrok.indexer.analysis.Ctags.lambda$doCtags$2(Ctags.java:455)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Mar 16, 2020 11:31:08 PM org.opengrok.indexer.analysis.Ctags readTags
SEVERE: CTag reader cycle was interrupted!


Is this a settings issue? Did I put in the wrong flag somewhere? I mean, as far as I understand it, the -H means history is on, no?

Or is it the severe?

Either way, please advise... and please excuse my being a pain.

@Ymoise
Copy link
Author

Ymoise commented Mar 17, 2020

On a side question - it currently takes each container ~25 minutes to re-index after a git pull.

If I fix the history issue, am I risking that number increasing in the extreme?

Re-indexing, using the 1.5 year old version consistently took about 7hrs, even if only one file had been changed.

At the cost of having to tell the developers "If you want a history - go to the repo and look it up", I'd keep the history off if it'll let me keep the short update time.

25 minutes is doable. 7hrs is not.

Will this put me back in 7hrs territory?

@vladak
Copy link
Member

vladak commented Mar 17, 2020

You do not need to alter index.sh in order to insert indexer options, there is a env variable for that, see https://github.com/oracle/opengrok/tree/master/docker#environment-variables

@vladak
Copy link
Member

vladak commented Mar 17, 2020

Any Ctags problem cannot be related to history generation because it happens in second stage of indexing after the history was already generated (if enabled).

Check the indexer logs (see https://github.com/oracle/opengrok/tree/master/docker#indexer-logs) to see what happened with the history. It should say something like this near the beginning of the indexing when run with -H -S:

Mar 17, 2020 10:30:09 AM org.opengrok.indexer.util.Statistics report
INFO: Done scanning for repositories, found 7 repositories (took 555 ms)
Mar 17, 2020 10:30:09 AM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Generating history cache for all repositories ...
Mar 17, 2020 10:30:09 AM org.opengrok.indexer.history.HistoryGuru createCacheReal
INFO: Creating historycache for 7 repositories
Mar 17, 2020 10:30:09 AM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Creating historycache for /var/opengrok/src/foo (MercurialRepository) without renamed file handling
Mar 17, 2020 10:30:09 AM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Creating historycache for /var/opengrok/src/sudo (GitRepository) without renamed file handling
...
Mar 17, 2020 10:30:09 AM org.opengrok.indexer.util.Statistics report
INFO: Done historycache for /var/opengrok/src/sudo (took 20 ms)
Mar 17, 2020 10:30:10 AM org.opengrok.indexer.util.Statistics report
INFO: Done historycache for /var/opengrok/src/foo (took 194 ms)
INFO: Done historycache for all repositories (took 2.795 seconds)
Mar 17, 2020 10:30:12 AM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Done...

If I run the indexer just with -S and without -H the logs will look like this:

Mar 17, 2020 10:31:40 AM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Generating history cache for all repositories ...
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.history.HistoryGuru createCacheReal
INFO: Creating historycache for 7 repositories
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Skipping history cache creation of MercurialRepository repository in /var/opengrok/src/foo and its subdirectories
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Skipping history cache creation of GitRepository repository in /var/opengrok/src/opengrok and its subdirectories
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.history.HistoryGuru createCache
...
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.util.Statistics report
INFO: Done historycache for all repositories (took 9 ms)
Mar 17, 2020 10:31:40 AM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Done...

If I run the indexer without -S and -H it will say just:

INFO: Generating history cache for all repositories ...
Mar 17, 2020 10:33:35 AM org.opengrok.indexer.history.HistoryGuru createCacheReal
INFO: Creating historycache for 0 repositories
Mar 17, 2020 10:33:35 AM org.opengrok.indexer.util.Statistics report
INFO: Done historycache for all repositories (took 5 ms)
Mar 17, 2020 10:33:35 AM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Done...

I.e. without scanning for repositories (or getting them from read-only configuration) there is no way how history cache can be generated.

A for the Ctags failure, without seeing more context of the log it is hard to tell what happened. Possibly the ctags process went away for some reason.

@vladak
Copy link
Member

vladak commented Mar 17, 2020

On a side question - it currently takes each container ~25 minutes to re-index after a git pull.

If I fix the history issue, am I risking that number increasing in the extreme?

Re-indexing, using the 1.5 year old version consistently took about 7hrs, even if only one file had been changed.

At the cost of having to tell the developers "If you want a history - go to the repo and look it up", I'd keep the history off if it'll let me keep the short update time.

25 minutes is doable. 7hrs is not.

Will this put me back in 7hrs territory?

The simplest way how to find out is to try. With containers this is particularly easy.

The history indexing for most of the modern SCMs happens incrementally so it should not add too much time. Depends on how large is each increment of course.

Again, the indexer log contains some statistics to see how long each step took (indeed, the log messages say e.g. something like Done indexing of directory /opengrok (took 3 ms)).

@vladak
Copy link
Member

vladak commented Mar 17, 2020

On a side question - it currently takes each container ~25 minutes to re-index after a git pull.

The long time might be due to I/O (again assuming that the updates are on the small side) as the indexer traverses the whole source directory tree. (be it source root or project root in case of per project indexing. In Docker environment this is the former case.)

You can see the time spent by directory traversal by subtracting the times between Starting traversal of directory and Starting indexing of directory. Also, you can run the indexer with --progress to see the file counts.

#3049 might actually track similar problem even though it focuses on other metrics.

Now, for SCMs based on changesets the indexer could somehow take the information about changed files from the mirror script (by looking into the "stats" of the incoming changesets) and process just these files.

@Ymoise
Copy link
Author

Ymoise commented Mar 17, 2020

So I have to have -S on to have the history on?

I have one repository in most containers... and 124 in one of them.

I took out the -S flag and added the repositories to index.sh with the --repository flag so it could skip the "searching for repositories" step which I have seen take up a whole hour - At least in the older version of opengrok.

In the above-quoted index.sh I have do one repository;

--repository /opengrok/src/master \

Would that not be enough to trigger history? It has to actually search, first?


As to the mirroring script, I'm running the indexer via "docker exec /scripts/index.sh". I'm not using the scripts. Are they a must now?

@vladak
Copy link
Member

vladak commented Mar 17, 2020

The --repository should be sufficient. The history generation just needs to have some knowledge of the repostories.

@vladak
Copy link
Member

vladak commented Mar 17, 2020

As to the mirroring script, I'm running the indexer via "docker exec /scripts/index.sh". I'm not using the scripts. Are they a must now?

In the current docker image version the mirroring is done using the mirror script:

opengrok-mirror --all --uri "$URI"

The opengrok-mirror script queries the web app for list of projects with repositories to sync. The indexer updates this list at the end of indexing.

@Ymoise
Copy link
Author

Ymoise commented Mar 17, 2020

I see. Mystery solved, then.

Thank you.

@vladak
Copy link
Member

vladak commented Mar 17, 2020

Can you see the history link now ? What was the problem then ?

@Ymoise
Copy link
Author

Ymoise commented Mar 18, 2020

No, I'm afraid not. :(

opengrok-mirror returns

No repositories for project master

@vladak
Copy link
Member

vladak commented Mar 18, 2020

--repository /opengrok/src/master \

The help for --repository says:

        Path (relative to the source root) to a repository for generating
        history (if -H,--history is on). By default all discovered repositories
        are history-eligible; using --repository limits to only those specified.
        Option may be repeated.

@vladak
Copy link
Member

vladak commented Mar 18, 2020

Actually, using --repository to limit repository detection will not work. This option merely constraints history cache generation after repositories are already detected.

@vladak
Copy link
Member

vladak commented Mar 18, 2020

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

@vladak vladak changed the title History is enabled in index.sh, but no git history in webapp History is enabled in index.sh, but no git history in webapp when running indexer without -S Mar 18, 2020
@vladak
Copy link
Member

vladak commented Mar 18, 2020

Possibly, we can make -S to accept optional argument and allow the option to be repeated. This way one could narrow the repository scan to selected subset of directories.

@idodeclare
Copy link
Contributor

Perhaps using -S but with possible reduction in --depth and multiple --disableRepository <type_name> to speed up any slow "searching for repositories"?

@vladak
Copy link
Member

vladak commented Mar 19, 2020

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

This actually requires couple of steps:

  • add the project (along with its repositories) to the web app:
curl -X POST -d bar -H "Content-Type:text/plain" \
    http://localhost:8080/source/api/v1/projects
  • check the repository was added:
curl http://localhost:8080/source/api/v1/projects/bar/repositories
  • retrieve the configuration from the web app and store it in a file:
curl http://localhost:8080/source/api/v1/configuration \
    > /opengrok/etc/readonly-config.xml 
  • run the indexer with the read-only configuration (so it has the knowledge of the repository) and limit history index to the repository:
opengrok-indexer \
    -J=-Xmx8g \
    -J=-server \
    -a /opengrok/lib/opengrok.jar -- \
    -m 256 \
    -v \
    -H -P \
    --repository foo \
    -s /opengrok/src \
    -d /opengrok/data \
    -R /opengrok/etc/readonly-config.xml \
    -W /opengrok/etc/configuration.xml \
    -U http://localhost:8080/source

Anyhow, this is getting outside of the Docker image territory.

@vladak
Copy link
Member

vladak commented Mar 19, 2020

Also, it looks like the --repository does not properly limit history cache generation in the second phase of indexing. Might be similar to #1022.

@vladak
Copy link
Member

vladak commented Mar 21, 2020

The 1.3.10 release has the -S with optional repository path.

The rest of the problems is captured in referenced issues herein so I think that's it for this one. If not feel free to reopen.

@vladak vladak closed this as completed Mar 21, 2020
@Ymoise
Copy link
Author

Ymoise commented Mar 24, 2020

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

This actually requires couple of steps:

  • add the project (along with its repositories) to the web app:
curl -X POST -d bar -H "Content-Type:text/plain" \
    http://localhost:8080/source/api/v1/projects
  • check the repository was added:
curl http://localhost:8080/source/api/v1/projects/bar/repositories
  • retrieve the configuration from the web app and store it in a file:
curl http://localhost:8080/source/api/v1/configuration \
    > /opengrok/etc/readonly-config.xml 
  • run the indexer with the read-only configuration (so it has the knowledge of the repository) and limit history index to the repository:
opengrok-indexer \
    -J=-Xmx8g \
    -J=-server \
    -a /opengrok/lib/opengrok.jar -- \
    -m 256 \
    -v \
    -H -P \
    --repository foo \
    -s /opengrok/src \
    -d /opengrok/data \
    -R /opengrok/etc/readonly-config.xml \
    -W /opengrok/etc/configuration.xml \
    -U http://localhost:8080/source

Anyhow, this is getting outside of the Docker image territory.

Thanks for this. I'll try it.

Sorry about the lack of reply. We've been working less, due to the plague, only just got back to this.

@jetm
Copy link

jetm commented Dec 22, 2020

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

This actually requires couple of steps:

  • add the project (along with its repositories) to the web app:
curl -X POST -d bar -H "Content-Type:text/plain" \
    http://localhost:8080/source/api/v1/projects
  • check the repository was added:
curl http://localhost:8080/source/api/v1/projects/bar/repositories
  • retrieve the configuration from the web app and store it in a file:
curl http://localhost:8080/source/api/v1/configuration \
    > /opengrok/etc/readonly-config.xml 
  • run the indexer with the read-only configuration (so it has the knowledge of the repository) and limit history index to the repository:
opengrok-indexer \
    -J=-Xmx8g \
    -J=-server \
    -a /opengrok/lib/opengrok.jar -- \
    -m 256 \
    -v \
    -H -P \
    --repository foo \
    -s /opengrok/src \
    -d /opengrok/data \
    -R /opengrok/etc/readonly-config.xml \
    -W /opengrok/etc/configuration.xml \
    -U http://localhost:8080/source

Anyhow, this is getting outside of the Docker image territory.

@vladak What would be the similar steps to reindex one repository?

@vladak
Copy link
Member

vladak commented Jan 4, 2021

@vladak What would be the similar steps to reindex one repository?

I don't think it is possible to reindex just one repository of given project.

@Ymoise
Copy link
Author

Ymoise commented Feb 4, 2021

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

This actually requires couple of steps:

  • add the project (along with its repositories) to the web app:
curl -X POST -d bar -H "Content-Type:text/plain" \
    http://localhost:8080/source/api/v1/projects
  • check the repository was added:
curl http://localhost:8080/source/api/v1/projects/bar/repositories
  • retrieve the configuration from the web app and store it in a file:
curl http://localhost:8080/source/api/v1/configuration \
    > /opengrok/etc/readonly-config.xml 
  • run the indexer with the read-only configuration (so it has the knowledge of the repository) and limit history index to the repository:
opengrok-indexer \
    -J=-Xmx8g \
    -J=-server \
    -a /opengrok/lib/opengrok.jar -- \
    -m 256 \
    -v \
    -H -P \
    --repository foo \
    -s /opengrok/src \
    -d /opengrok/data \
    -R /opengrok/etc/readonly-config.xml \
    -W /opengrok/etc/configuration.xml \
    -U http://localhost:8080/source

Anyhow, this is getting outside of the Docker image territory.

Thanks for this. I'll try it.

Sorry about the lack of reply. We've been working less, due to the plague, only just got back to this.

So, I got back to working on this project, this sprint and I tried your suggested instructions and I got:

> curl -X POST -d bar -H "Content-Type:text/plain" http://localhost:8080/source/api/v1/projects

<!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> &#47;source&#47;api&#47;v1&#47;projects</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/9.0.33</h3></body></html>

What am I doing wrong?

@Ymoise
Copy link
Author

Ymoise commented Feb 4, 2021

@vladak What would be the similar steps to reindex one repository?

I don't think it is possible to reindex just one repository of given project.

I'm starting to think that you're talking about a completely different thing when you talk about repository.

Are we not talking about, for example, something like a git repo? Because, I only HAVE one of those in each project.

@vladak
Copy link
Member

vladak commented Feb 4, 2021

@vladak What would be the similar steps to reindex one repository?

I don't think it is possible to reindex just one repository of given project.

I'm starting to think that you're talking about a completely different thing when you talk about repository.

Are we not talking about, for example, something like a git repo? Because, I only HAVE one of those in each project.

Any directory directly underneath source root is a project (assuming projects are enabled). A project may have zero or more repositories. A repository is a checkout from Source Code Management system such as Mercurial or Git. For example, the directory structure can look like this (directories listing only):

/opengrok/src                   # source root
/opengrok/src/on                # the 'on' project
/opengrok/src/on/.hg            # the 'on' project has top level Mercurial repository
/opengrok/src/on/usr
/opengrok/src/on/usr/man
/opengrok/src/on/usr/man/.hg    # this is a nested Mercurial repository of the 'on' project
/opengrok/src/foo               # the 'foo' project; does not have any repositories
/opengrok/src/foo/bar           # sub-directory of the 'foo' project

So, there are two projects. The on project has 2 repositories, the foo project has none.

The level of granularity for indexing is a project. You cannot really index just one repository from a project.

@vladak
Copy link
Member

vladak commented Feb 4, 2021

So, I got back to working on this project, this sprint and I tried your suggested instructions and I got:

> curl -X POST -d bar -H "Content-Type:text/plain" http://localhost:8080/source/api/v1/projects

<!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> &#47;source&#47;api&#47;v1&#47;projects</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/9.0.33</h3></body></html>

What am I doing wrong?

The usage matches the documentation so there is some problem with the web app. Is it actually deployed ? Does a GET request to http://localhost:8080/source/api/v1/projects return anything ?

@Ymoise
Copy link
Author

Ymoise commented Feb 7, 2021

@vladak What would be the similar steps to reindex one repository?

I don't think it is possible to reindex just one repository of given project.

I'm starting to think that you're talking about a completely different thing when you talk about repository.
Are we not talking about, for example, something like a git repo? Because, I only HAVE one of those in each project.

Any directory directly underneath source root is a project (assuming projects are enabled). A project may have zero or more repositories. A repository is a checkout from Source Code Management system such as Mercurial or Git. For example, the directory structure can look like this (directories listing only):

/opengrok/src                   # source root
/opengrok/src/on                # the 'on' project
/opengrok/src/on/.hg            # the 'on' project has top level Mercurial repository
/opengrok/src/on/usr
/opengrok/src/on/usr/man
/opengrok/src/on/usr/man/.hg    # this is a nested Mercurial repository of the 'on' project
/opengrok/src/foo               # the 'foo' project; does not have any repositories
/opengrok/src/foo/bar           # sub-directory of the 'foo' project

So, there are two projects. The on project has 2 repositories, the foo project has none.

The level of granularity for indexing is a project. You cannot really index just one repository from a project.

The thing is, I checked, and we only HAVE one repository per project. There are no nested repositories. We don't use them.

And, the other day, I ran the indexer with the -S option on, to compare it to the run time without it on (roughly 40 minutes) and I ended up with 12 hours of indexing, which is insane.

@Ymoise
Copy link
Author

Ymoise commented Feb 7, 2021

So, I got back to working on this project, this sprint and I tried your suggested instructions and I got:

> curl -X POST -d bar -H "Content-Type:text/plain" http://localhost:8080/source/api/v1/projects

<!doctype html><html lang="en"><head><title>HTTP Status 404 – Not Found</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 404 – Not Found</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Message</b> &#47;source&#47;api&#47;v1&#47;projects</p><p><b>Description</b> The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.</p><hr class="line" /><h3>Apache Tomcat/9.0.33</h3></body></html>

What am I doing wrong?

The usage matches the documentation so there is some problem with the web app. Is it actually deployed ? Does a GET request to http://localhost:8080/source/api/v1/projects return anything ?

It returns the same "file not found" error, I showed you before.

But if I open a browser and navigate to the exposed port for the container, it's up and running and I can search it just fine, and if I have the -S on, like I do now, I can even see history.

@vladak
Copy link
Member

vladak commented Feb 8, 2021

And, the other day, I ran the indexer with the -S option on, to compare it to the run time without it on (roughly 40 minutes) and I ended up with 12 hours of indexing, which is insane.

I'd start with looking for where the time was actually spent. From the indexer logs this is quite easy - just search for Statistics and it will spit out things like:

2021-02-08 10:15:23.626+0100 INFO t1 Statistics.logIt: Done scanning for repositories, found 20 repositories (took 1.743 seconds)
...
2021-02-08 10:15:23.709+0100 INFO t175 Statistics.logIt: Done historycache for /var/opengrok/src/opengrok (took 66 ms)
2021-02-08 10:16:35.060+0100 INFO t181 Statistics.logIt: Done historycache for /var/opengrok/src/Lucene (took 0:01:11)
2021-02-08 10:16:35.060+0100 INFO t1 Statistics.logIt: Done history cache for all repositories (took 0:01:11)
...

2021-02-08 10:16:38.082+0100 INFO t19 Statistics.logIt: Done traversal of directory /Lucene (took 2.828 seconds)
2021-02-08 10:16:48.975+0100 INFO t24 Statistics.logIt: Done traversal of directory /opengrok (took 746 ms)

...
2021-02-08 10:17:34.492+0100 INFO t24 Statistics.logIt: Done indexing of directory /opengrok (took 45.516 seconds)
2021-02-08 10:18:02.791+0100 INFO t19 Statistics.logIt: Done indexing of directory /Lucene (took 0:01:24)
...
2021-02-08 10:18:10.943+0100 INFO t1 Statistics.logIt: Done indexing data of all repositories (took 0:01:35)

and spot any outliers. The log messages are per project and also for overall projects/repositories. From the former you can get some idea about "big" projects, from the latter you can see the overall times of the indexing phases. In the above example the Lucene project stands out because prior to indexing I removed all of its data.

Without the actual indexer options it is hard to speculate why the sudden jump in time occurred however my guess would be that the added -S option basically enabled history cache creation and it was created for the first time which usually takes a lot of time.

@vladak
Copy link
Member

vladak commented Feb 8, 2021

The usage matches the documentation so there is some problem with the web app. Is it actually deployed ? Does a GET request to http://localhost:8080/source/api/v1/projects return anything ?

It returns the same "file not found" error, I showed you before.

But if I open a browser and navigate to the exposed port for the container, it's up and running and I can search it just fine, and if I have the -S on, like I do now, I can even see history.

Assuming 8080 is actually the port exposed by the container, there are 2 things you need to be aware of:

  • by default /source is merely a redirect to / (assuming URL_ROOT environment variable is not set. If it was set, /source would redirect to the alternative location specified by the environment variable) so you need to actually make the request to http://localhost:8080/api/v1/projects
  • in Docker the localhost:8080 is not really a localhost so if you are accessing most of the API from outside of the container even with the correct URL you will get the HTTP 401 Unauthorized error. You need to perform the API request from inside, e.g. docker exec -it opengrok-test curl http://localhost:8080/api/v1/projects or setup the web app with API token and use the token for the API requests.

@Ymoise
Copy link
Author

Ymoise commented Feb 9, 2021

No, no... I'm running all of the curl commands from inside the container.

The issue seems to have been the "/source". If I run curl http://localhost:8080/api/v1/projects/ I get ["folder1","folder2"]

Now, as to this:

You can add the project and its repositories using the RESTful API (i.e. https://opengrok.docs.apiary.io/reference/0/projects/add-project) before running the indexer and after starting the webapp.

This actually requires couple of steps:

  • add the project (along with its repositories) to the web app:
curl -X POST -d bar -H "Content-Type:text/plain" \
    http://localhost:8080/source/api/v1/projects
  • check the repository was added:
curl http://localhost:8080/source/api/v1/projects/bar/repositories

How do I add a project's repositories?

@vladak
Copy link
Member

vladak commented Feb 9, 2021

How do I add a project's repositories?

The API call will add the project and its repositories. In fact, as I discovered recently it adds the repositories also in case they are not needed - #3405.

@Ymoise
Copy link
Author

Ymoise commented Feb 9, 2021

Without the actual indexer options it is hard to speculate why the sudden jump in time occurred however my guess would be that the added -S option basically enabled history cache creation and it was created for the first time which usually takes a lot of time.

Ok, I missed this one, on first reading.

So, if I understand you correctly, the history cache will take an eon to create for the first time, whether I tell it my projects and repos in advance, or not?

@vladak
Copy link
Member

vladak commented Feb 10, 2021

So, if I understand you correctly, the history cache will take an eon to create for the first time, whether I tell it my projects and repos in advance, or not?

Yes. Of course, it depends on the size of the history (there are some notable issues like #3416 or #3367). In general, the indexing process creates history cache first (modulo SCMs that do not support history per directory) and then proceeds to create the actual index. So, if all projects are indexed together for the first time, the 1st phase of indexing when the history is created delays the second phase significantly. You can use the per project workflow to make some of the projects available earlier. If you have a pre-existing index and want to enable history you will need to reindex from scratch (modulo xrefs) because the history data is part of the index (so that it can be searched) and per document (a file) updates happen only if given file changes on disk. Possibly this could be an enhancement.

@vladak
Copy link
Member

vladak commented Feb 10, 2021

As for the demo of the per project workflow take a look at PR #3402. This basically converts the Docker image to use the per project workflow while utilizing the Python utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants