-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still dups in search with 1.9.4 #4317
Comments
Firstly, it would help to get more information about the dups, i.e. quantify them and see how they change over time. Ideally track when they started appearing after the reindex from scratch, esp. w.r.t. periodic reindex events. |
Very difficult for me to quantify the number of dups and when they start to appear, because of the size of the data indexed. On a case that has been reported on my UAT.
What I don't understand is the data regarding number of commits on the file duplicated:
1154-116=1038 so I would have expected 1038 different commits into history cache file. |
When searching, Tomcat logs for file duplicated:
|
My mistake... I closed the issue. |
Is As for the discrepancy between the |
Also, what do the changes for Given that I am not able to reproduce this locally, I will need some extensive logging and problem pinpointing on your side. |
Yes. Repository azerty into project QWERTY is a Git repository. |
I'd propose the following strategy for identifying the pattern that leads to the duplicate documents:
As for the mirroring, one way would be to let it run organically, i.e. let the incoming changes to the repository brought in as they happen in the origin. The alternative would be to start in a certain changeset in the repository and add bunch of changesets at a time, followed by reindex and index check. Repeat until there are duplicates in the index. |
This is happenning in the OpenGrok version 1.12.12. I will try to provide you the details asked in the previous comment. |
We have hit this problem with a Mercurial repository on 1.12.12 (i.e. history based reindex enabled and in action). The problem become apparent after upgrade to 1.12.28. In 1.12.15 AnalyzerGuru version was bumped due to addition of proper YAML support and this revealed |
Managed to replicate the problem with https://github.com/oracle/solaris-userland/ using this script (needs #4535 to avoid failing the document check due to zero live documents on the initial indexing): #!/bin/bash
#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
#
# based on git-commit-hopping.sh but with added randomness
# set -x
set -e
repo_url="https://github.com/oracle/solaris-userland/"
initial_rev=32c0d9faed7b049872ca9bd78f9bf3e901cff482 # from 2022
src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"
# Assumes built OpenGrok.
function run_indexer()
{
# TODO: store the log somewhere
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
org.opengrok.indexer.index.Indexer \
-c /usr/local/bin/ctags \
--economical \
-H -S -P -s "$src_root" -d "$data_root" \
-W "$data_root/config.xml" \
>/dev/null 2>&1
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
org.opengrok.indexer.index.Indexer \
-H \
-R "$data_root/config.xml" \
--checkIndex documents
}
function get_next()
{
ids=( $(git log --oneline --reverse ..origin/master | awk '{ print $1 }') )
size="${#ids[@]}"
modulo=16
if (( size == 0 )); then
echo ""
return
fi
if (( modulo > size )); then
modulo=$size
fi
n=`expr $RANDOM % $modulo`
if (( n == 0 )); then
n=1
fi
echo ${ids[$n]}
}
project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
echo "Cloning $repo_url to source root"
git clone $repo_url "$project_root"
fi
echo "Removing data root"
rm -rf "$data_root"
cd "$project_root"
git checkout "$initial_rev"
while [[ 1 ]]; do
rev=$(get_next)
if [[ -z $rev ]]; then
break
fi
echo "Checking out $rev"
git checkout -q $rev
run_indexer
done It ended after just 4 iterations with:
which matches my internal observations. For the record, here is the sequence of changesets: 32c0d9fae (no index), 1870f2259, 3b039b16f, a45ee5bf6, 87ffc28c4 |
The cause of the above is that in the last reindex for changeset 87ffc28c4, the list of files returned from the
The ordering of the The trouble lies here: opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileCollector.java Line 42 in 636b3f3
Rather than assuming, the set should be returned unsorted and |
There is another problem on opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java Line 1661 in ba9e18d
This is needed so that e.g. potentially longer paths (in terms of path components) that have e.g. lower character at the point of the path separator of the other path will come out lesser even though they should not. |
There is some trickery involved with fixing this. After making sure the paths coming from the What seems to be happening in the case when the test fails is that the path that is later identified as duplicate goes through I think the reason this is not 100% reproducible is that the heavy parallelization done in the 2nd phase of indexing leads to different ordering of how documents are added/deleted in the index, possibly leading to segment reduction or lack of. For the cases when the same test passes, there are no deleted documents for the paths that otherwise cause document verification failure in the last reindex. For the record, here's the test script: #!/bin/bash
#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
# (more isolation)
#
# based on ~/opengrok-issue-4317.sh sans the randomness.
#
# It tests particular sequence of changes, that **SOMETIMES** leads
# to duplicate documents.
#
set -e
repo_url="https://github.com/oracle/solaris-userland/"
src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"
# Assumes built OpenGrok.
function run_indexer()
{
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
-Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
org.opengrok.indexer.index.Indexer \
-c /usr/local/bin/ctags \
--economical \
-H -S -P -s "$src_root" -d "$data_root" \
-W "$data_root/config.xml" \
>/var/tmp/opengrok-issue-4317-$1.log 2>&1
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
org.opengrok.indexer.index.Indexer \
-H \
-R "$data_root/config.xml" \
--checkIndex documents
}
project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
echo "Cloning $repo_url to source root"
git clone $repo_url "$project_root"
fi
echo "Removing data root"
rm -rf "$data_root"
echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log
cd "$project_root"
git checkout -q master
ids="3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8"
for id in $ids; do
echo "### $id"
git checkout -q $id
#if [[ $id != "4f8fe9ee8" ]]; then
# run_indexer $id
#fi
run_indexer $id
done |
What is causing the latest test failure is uid overlap/clash. For the iteration that leads to the duplicate document, a match for the file is found in The first part of the uid is the path, the second path is the time stamp that is take from the last modified time value of the file. I have yet to figure out whether this is caused by my approach (using |
Git it does not preserve/restore file time stamps so this behavior is sort of expected. Therefore, the last failure was indeed a cause of sloppy test setup. Detailed analysis is on https://gist.github.com/vladak/351f135c1f9b0ddc979be70ceaa20133. That said, to avoid such corruption in the future (not related to this issue, I believe) I added uid/date check to the history based processing. Running the updated script (below) with indexer that performs the check survived through 16 iterations. With the previous version of the script it failed after handful of iterations. #!/bin/bash
#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
# (more isolation)
#
# based on ~/opengrok-issue-4317.sh sans the randomness.
#
# v4: touch the files to avoid duplicate uids
#
# It tests particular sequence of changes, that **SOMETIMES** leads
# to duplicate documents.
#
set -e
repo_url="https://github.com/oracle/solaris-userland/"
src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"
# Assumes built OpenGrok.
function run_indexer()
{
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
-Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
org.opengrok.indexer.index.Indexer \
-c /usr/local/bin/ctags \
--economical \
-H -S -P -s "$src_root" -d "$data_root" \
-W "$data_root/config.xml" \
>/var/tmp/opengrok-issue-4317-$1.log 2>&1
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
-Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
org.opengrok.indexer.index.Indexer \
-H \
-R "$data_root/config.xml" \
--checkIndex documents \
>/var/tmp/opengrok-issue-4317-$1.check.log 2>&1
}
project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
echo "Cloning $repo_url to source root"
git clone $repo_url "$project_root"
fi
echo "Removing data root"
rm -rf "$data_root"
echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log
cd "$project_root"
git checkout -q master
# Establish a common time base line.
find "$src_root/" -type f -exec touch {} \;
ids="3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8"
for id in $ids; do
echo "### $id"
git checkout -q $id
# Git does not preserve/restore file time stamps so simulate a git pull.
# Ideally this should be done only for the "incoming" files, howerver it suffices for this use case.
find "$src_root/" -type f -exec touch {} \;
#if [[ $id != "4f8fe9ee8" ]]; then
# run_indexer $id
#fi
run_indexer $id
done I am going to modify the original randomness based script and will let it run extended time period to see if this really sticks. |
Hi @vladak
On one of my PRD OG instance, I have installed 1.9.4 and reindexed everything from scratch.
Users report they still have dups when searching :-(
Same issue on another UAT instance running 1.9.4 where a user found dup (after he found first a dup on the related (I mean, same code is indexed) PRD one running 1.7.35).
What could I do to try to understand the issue?
The text was updated successfully, but these errors were encountered: