Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check missing source files for documents in document check and fix path ordering in history based reindex #4535

Merged
merged 2 commits into from
Feb 19, 2024

Conversation

vladak
Copy link
Member

@vladak vladak commented Jan 23, 2024

As noted on #4317 (comment) , there is another way how the index could be broken - if there are live (i.e. not deleted) documents that are missing the source file. This change augments the document check to report these.

@vladak vladak added the indexer label Jan 23, 2024
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jan 23, 2024
@vladak vladak changed the title check missing source files for documents in document check check missing source files for documents in document check and fix path ordering in history based reindex Jan 29, 2024
@vladak
Copy link
Member Author

vladak commented Jan 29, 2024

The adjusted document check is needed to test the fix for #4317 , so I added the changes here as well.

@vladak vladak marked this pull request as draft January 30, 2024 09:15
@vladak
Copy link
Member Author

vladak commented Feb 9, 2024

I let the the following script run overnight on my laptop (Intel Core i5 - 8 threads, SSD). It ran 36 times, each run had around 100 iterations. This gives me high level of confidence that the duplicates are gone for good.

#!/bin/bash

#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
#
# based on git-commit-hopping.sh but with added randomness

# set -x
set -e

repo_url="https://github.com/oracle/solaris-userland/"
initial_rev=32c0d9faed7b049872ca9bd78f9bf3e901cff482	# from 2022

src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"

# Assumes built OpenGrok.
function run_indexer()
{
	echo "Indexing $1"

        java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
	    -Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
            org.opengrok.indexer.index.Indexer \
            -c /usr/local/bin/ctags \
	    --economical \
            -H -S -P -s "$src_root" -d "$data_root" \
	    -W "$data_root/config.xml" \
	    >/var/tmp/opengrok-issue-4317-$1.log 2>&1

	java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
            org.opengrok.indexer.index.Indexer \
	    -H \
	    -R "$data_root/config.xml" \
	    --checkIndex documents \
	    >/var/tmp/opengrok-issue-4317-$1.check.log 2>&1
}

function get_next()
{
	ids=( $(git log --oneline --reverse ..origin/master | awk '{ print $1 }') )
	size="${#ids[@]}"
	modulo=16
	if (( size == 0 )); then
		echo ""
		return
	fi
	if (( modulo > size )); then
		modulo=$size
	fi
	n=`expr $RANDOM % $modulo`
	if (( n == 0 )); then
		n=1
	fi
	echo ${ids[$n]}
}

project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
	echo "Cloning $repo_url to source root"
	git clone $repo_url "$project_root"
fi

echo "Removing data root"
rm -rf "$data_root"

echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log

cd "$project_root"
echo "Checking out base rev $initial_rev"
git checkout -q "$initial_rev"
# Establish a common time base line.
find "$src_root/" -type f -exec touch {} \;

run_indexer $initial_rev

while [[ 1 ]]; do
	rev=$(get_next)
	if [[ -z $rev ]]; then
		break
	fi
	echo "Checking out $rev"
	git checkout -q $rev
	# Git does not preserve/restore file time stamps so simulate a git pull.
	# Ideally this should be done only for the "incoming" files, howerver it suffices for this use case.
	find "$src_root/" -type f -exec touch {} \;

	run_indexer $rev
done

@vladak vladak force-pushed the missing_source_document_check branch from d2800be to 0ddf6e0 Compare February 13, 2024 10:47
@vladak vladak force-pushed the missing_source_document_check branch from 826e264 to 1cd7cd0 Compare February 13, 2024 16:57
@vladak vladak marked this pull request as ready for review February 14, 2024 08:43
@vladak vladak merged commit 520ce3b into oracle:master Feb 19, 2024
9 checks passed
@vladak vladak deleted the missing_source_document_check branch February 19, 2024 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
indexer OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants