Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

history based reindex #3951

Merged
merged 88 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
fc53bae
truly incremental reindex
Nov 22, 2021
1665873
next stage
vladak Apr 29, 2022
32b4cd6
next chunk of changes
vladak Apr 29, 2022
f65b382
fix some nits
vladak Apr 29, 2022
6b9b964
IndexDatabase grew too long
vladak Apr 29, 2022
e18d9cb
fix more style nits
vladak Apr 29, 2022
390dbae
add missing whitespace
vladak Apr 29, 2022
b2fc531
fix testXrefGeneration()
vladak Apr 29, 2022
7e740a8
avoid NPE, fix test to be consistent
vladak Apr 29, 2022
12365fb
add global tunable
vladak Apr 30, 2022
37bbf5c
add notes/comments
vladak May 2, 2022
b8ad142
make it work in the basic mode
vladak May 5, 2022
7a438ac
add per project property
vladak May 5, 2022
cab9624
fix deleted files harvesting
vladak May 9, 2022
af69832
even for truly incremental reindex the whole index has to be traversed
vladak May 9, 2022
660da71
fix FileHistoryCacheTest
vladak May 9, 2022
1a0b523
renamed parts should be part of the changed files in HistoryEntry
vladak May 10, 2022
99642fe
remove debug-only code
vladak May 10, 2022
4328c1b
remove unused import
vladak May 10, 2022
f4192d9
handle trailing terms properly for history based reindex
vladak May 10, 2022
ca2549f
check if repository has history enabled
vladak May 10, 2022
2cf4698
refactor truly incremental check for repository
vladak May 10, 2022
8b9069f
convert visitor pattern (use list of visitors)
vladak May 10, 2022
72882ae
remove trailing space
vladak May 10, 2022
f5a0be4
move the CommitInfo construction
vladak May 10, 2022
2d8cba2
make indexDown*() testable
vladak May 11, 2022
f457197
truly incremental -> history based
vladak May 11, 2022
4fdf134
remove redundant public modifier
vladak May 11, 2022
4cad065
remove the VisibleForTesting annotation
vladak May 11, 2022
29312e6
fix nits
vladak May 11, 2022
dda1cb4
allow per-project override of history based reindex
vladak May 11, 2022
3c5b87c
fix wording
vladak May 11, 2022
e215867
split the getHistory() call for better readability
vladak May 11, 2022
fadf908
acquire the list of files during history cache generation
vladak May 11, 2022
238f2f2
apply Path.of()
vladak May 11, 2022
56a6ea2
remove unused imports
vladak May 11, 2022
8db1c39
check global configuration before creating FileCollector instance
vladak May 11, 2022
271cd18
add TODOs
vladak May 11, 2022
de55ca2
fix RepositoryWithPerPartesHistoryTest
vladak May 11, 2022
9717728
parametrize the cleanup test
vladak May 11, 2022
52a5969
assert file deletion
vladak May 11, 2022
655d978
add per partes test param
vladak May 12, 2022
aa32923
add test for forced reindex, fix it for history based reindex
vladak May 12, 2022
af9e959
add negative test
vladak May 16, 2022
7076963
fix style
vladak May 16, 2022
7587f28
overhaul testHistoryBasedReindexVsProjectWithDiverseRepos()
vladak May 16, 2022
bd29b31
add logs
vladak May 16, 2022
cd7eae8
initial reindex should use file traversal
vladak May 17, 2022
d4f69c6
restore the state
vladak May 17, 2022
23a4c82
split history for better readability
vladak May 17, 2022
b26f36f
add TODO
vladak May 17, 2022
a6a884e
address per project history based tunable
vladak May 17, 2022
eae92f1
add merge changeset
vladak May 18, 2022
343997e
refactor diff handling to a new method
vladak May 18, 2022
1a6dea6
simplify the merge changeset check
vladak May 18, 2022
39c1d57
introduce repository tunable
vladak May 18, 2022
5650cf1
fix tests
vladak May 19, 2022
c9bcef4
remove unused imports
vladak May 19, 2022
984a886
test merge changesets, fix project properties
vladak May 19, 2022
857731e
fix style
vladak May 19, 2022
88bcde9
remove TODO, the test does not fail anymore when run standalone
vladak May 19, 2022
5b1b346
rename the option to match the tunable
vladak May 19, 2022
ac7d53a
cleanup, check history is enabled for repository
vladak May 20, 2022
3372675
add checks for history related tunables
vladak May 20, 2022
c54bd05
use single Statistics instance when reporting file collection
vladak May 20, 2022
68ee168
add project-less based test for history based reindex
vladak May 20, 2022
a96f032
unwrap the line for better readability
vladak May 20, 2022
516b9eb
add check for numCommits argument value
vladak May 20, 2022
794d0b7
convert Mercurial to RepositoryWithHistoryTraversal
vladak May 21, 2022
1799a97
add Override annotation
vladak May 21, 2022
001ca5d
limit the visibility
vladak May 21, 2022
5559579
remove unused imports
vladak May 21, 2022
2e99b75
fix style
vladak May 21, 2022
ea354d5
fix style
vladak May 21, 2022
202e6da
do not consider history vs. history based reindex as configuration pr…
vladak May 23, 2022
61dce4c
move configuration check to Configuration class
vladak May 23, 2022
f542b68
reuse already existing copyDirectory()
vladak May 23, 2022
f682e5b
bump year
vladak May 23, 2022
6824825
copy files preserving attributes
vladak May 23, 2022
38dee2a
re-clone the Git repository in setup
vladak May 23, 2022
9cfb33d
make sure the move does not fail on Windows
vladak May 23, 2022
a4a222e
add asserts for Git operations
vladak May 23, 2022
08db34c
close the Git object
vladak May 24, 2022
6dc8614
fix the test
vladak May 24, 2022
7eea590
remove obsolete comment
vladak May 24, 2022
6f227d5
do not use main.o for Git tests
vladak May 24, 2022
b0a8246
fix Windows path
vladak May 24, 2022
855e7d6
use native path separator
vladak May 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions dev/checkstyle/suppressions.xml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ information: Portions Copyright [yyyy] [name of copyright owner]

CDDL HEADER END

Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved.
Copyright (c) 2018, 2002, Oracle and/or its affiliates. All rights reserved.
Portions Copyright (c) 2018-2020, Chris Fraire <[email protected]>.

-->
Expand All @@ -43,7 +43,7 @@ Portions Copyright (c) 2018-2020, Chris Fraire <[email protected]>.
|Context\.java|HistoryContext\.java|Suggester\.java|
|ProjectHelperTestBase\.java|SearchHelper\.java" />

<suppress checks="FileLength" files="RuntimeEnvironment\.java" />
<suppress checks="FileLength" files="RuntimeEnvironment\.java|IndexDatabase\.java" />

<suppress checks="MethodLength" files="Indexer\.java|IndexDatabase\.java|AuthorizationFrameworkTest\.java" />

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,8 @@ public final class Configuration {
private int connectTimeout = -1; // connect timeout in seconds
private int apiTimeout = -1; // API timeout in seconds

private boolean historyBasedReindex;

/*
* types of handling history for remote SCM repositories:
* ON - index history and display it in webapp
Expand Down Expand Up @@ -576,6 +578,7 @@ public Configuration() {
setTagsEnabled(false);
//setUserPage("http://www.myserver.org/viewProfile.jspa?username=");
// Set to empty string so we can append it to the URL unconditionally later.
setHistoryBasedReindex(true);
setUserPageSuffix("");
setWebappLAF("default");
// webappCtags is default(boolean)
Expand Down Expand Up @@ -1412,6 +1415,14 @@ public void setApiTimeout(int apiTimeout) {
this.apiTimeout = apiTimeout;
}

public boolean isHistoryBasedReindex() {
return historyBasedReindex;
}

public void setHistoryBasedReindex(boolean flag) {
historyBasedReindex = flag;
}

/**
* Write the current configuration to a file.
*
Expand Down Expand Up @@ -1524,4 +1535,45 @@ private static Configuration decodeObject(InputStream in) throws IOException {

return conf;
}

public static class ConfigurationException extends Exception {
static final long serialVersionUID = -1;

public ConfigurationException(String message) {
super(message);
}
}

/**
* Check if configuration is populated and self-consistent.
* @throws ConfigurationException on error
*/
public void checkConfiguration() throws ConfigurationException {

if (getSourceRoot() == null) {
throw new ConfigurationException("Source root is not specified.");
}

if (getDataRoot() == null) {
throw new ConfigurationException("Data root is not specified.");
}

if (!new File(getSourceRoot()).canRead()) {
throw new ConfigurationException("Source root directory '" + getSourceRoot() + "' must be readable.");
}

if (!new File(getDataRoot()).canWrite()) {
throw new ConfigurationException("Data root directory '" + getDataRoot() + "' must be writable.");
}

if (!isHistoryEnabled() && isHistoryBasedReindex()) {
LOGGER.log(Level.INFO, "History based reindex is on, however history is off. " +
"History has to be enabled for history based reindex.");
}

if (!isHistoryCache() && isHistoryBasedReindex()) {
LOGGER.log(Level.INFO, "History based reindex is on, however history cache is off. " +
"History cache has to be enabled for history based reindex.");
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
*/

/*
* Copyright (c) 2006, 2021, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2006, 2022, Oracle and/or its affiliates. All rights reserved.
* Portions Copyright (c) 2018, Chris Fraire <[email protected]>.
*/
package org.opengrok.indexer.configuration;
Expand All @@ -34,6 +34,7 @@
import java.util.logging.Logger;
import java.util.regex.PatternSyntaxException;

import org.jetbrains.annotations.VisibleForTesting;
import org.opengrok.indexer.logger.LoggerFactory;
import org.opengrok.indexer.util.ClassUtil;
import org.opengrok.indexer.util.ForbiddenSymlinkException;
Expand Down Expand Up @@ -99,6 +100,11 @@ public class Project implements Comparable<Project>, Nameable, Serializable {
*/
private boolean indexed = false;

/**
* This flag sets per-project reindex based on traversing SCM history.
*/
private Boolean historyBasedReindex = null;

/**
* Set of groups which match this project.
*/
Expand Down Expand Up @@ -289,6 +295,28 @@ public void setMergeCommitsEnabled(boolean flag) {
this.mergeCommitsEnabled = flag;
}

/**
* @return true if this project handles renamed files.
*/
public boolean isHistoryBasedReindex() {
return historyBasedReindex != null && historyBasedReindex;
}

/**
* @param flag true if project should handle renamed files, false otherwise.
*/
public void setHistoryBasedReindex(boolean flag) {
this.historyBasedReindex = flag;
}

@VisibleForTesting
public void clearProperties() {
historyBasedReindex = null;
mergeCommitsEnabled = null;
historyEnabled = null;
handleRenamedFiles = null;
}

/**
* Return groups where this project belongs.
*
Expand Down Expand Up @@ -436,6 +464,10 @@ public final void completeWithDefaults() {
if (reviewPattern == null) {
setReviewPattern(env.getReviewPattern());
}

if (historyBasedReindex == null) {
setHistoryBasedReindex(env.isHistoryBasedReindex());
}
}

/**
Expand Down Expand Up @@ -476,8 +508,7 @@ public static Project getProject(String path) {
* Get the project for a specific file.
*
* @param file the file to lookup
* @return the project that this file belongs to (or null if the file
* doesn't belong to a project)
* @return the project that this file belongs to (or {@code null} if the file doesn't belong to a project)
*/
public static Project getProject(File file) {
Project ret = null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import java.util.Collection;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
Expand All @@ -62,8 +63,10 @@
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.NamedThreadFactory;
import org.jetbrains.annotations.VisibleForTesting;
import org.opengrok.indexer.authorization.AuthorizationFramework;
import org.opengrok.indexer.authorization.AuthorizationStack;
import org.opengrok.indexer.history.FileCollector;
import org.opengrok.indexer.history.HistoryGuru;
import org.opengrok.indexer.history.RepositoryInfo;
import org.opengrok.indexer.index.IndexDatabase;
Expand Down Expand Up @@ -137,6 +140,12 @@ public List<String> getSubFiles() {

private final List<String> subFiles = new ArrayList<>();

/**
* Maps project name to FileCollector object. This is used to pass the list of files acquired when
* generating history cache in the first phase of indexing to the second phase of indexing.
*/
private final Map<String, FileCollector> fileCollectorMap = new HashMap<>();

/**
* Creates a new instance of RuntimeEnvironment. Private to ensure a
* singleton anti-pattern.
Expand Down Expand Up @@ -465,7 +474,7 @@ public List<Project> getProjectList() {
/**
* Get project map.
*
* @return a Map with all of the projects
* @return a Map with all the projects
*/
public Map<String, Project> getProjects() {
return syncReadConfiguration(Configuration::getProjects);
Expand Down Expand Up @@ -1417,6 +1426,27 @@ public void setConnectTimeout(int connectTimeout) {
syncWriteConfiguration(connectTimeout, Configuration::setConnectTimeout);
}

public boolean isHistoryBasedReindex() {
return syncReadConfiguration(Configuration::isHistoryBasedReindex);
}

public void setHistoryBasedReindex(boolean flag) {
syncWriteConfiguration(flag, Configuration::setHistoryBasedReindex);
}

public FileCollector getFileCollector(String name) {
return fileCollectorMap.get(name);
}

public void setFileCollector(String name, FileCollector fileCollector) {
fileCollectorMap.put(name, fileCollector);
}

@VisibleForTesting
public void clearFileCollector() {
fileCollectorMap.clear();
}

/**
* Read an configuration file and set it as the current configuration.
*
Expand Down Expand Up @@ -1491,7 +1521,8 @@ public void writeConfiguration(String host) throws IOException, InterruptedExcep
* Project with some repository information is considered as a repository
* otherwise it is just a simple project.
*/
private void generateProjectRepositoriesMap() throws IOException {
@VisibleForTesting
public void generateProjectRepositoriesMap() throws IOException {
repository_map.clear();
for (RepositoryInfo r : getRepositories()) {
Project proj;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* See LICENSE.txt included in this distribution for the specific
* language governing permissions and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at LICENSE.txt.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/

/*
* Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
*/
package org.opengrok.indexer.history;

import java.util.function.Consumer;

public abstract class ChangesetVisitor implements Consumer<RepositoryWithHistoryTraversal.ChangesetInfo> {
boolean consumeMergeChangesets;

protected ChangesetVisitor(boolean consumeMergeChangesets) {
this.consumeMergeChangesets = consumeMergeChangesets;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* See LICENSE.txt included in this distribution for the specific
* language governing permissions and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at LICENSE.txt.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/

/*
* Copyright (c) 2022, Oracle and/or its affiliates. All rights reserved.
*/
package org.opengrok.indexer.history;

import java.util.Collection;
import java.util.SortedSet;
import java.util.TreeSet;

/**
* This class is meant to collect files that were touched in some way by SCM update.
* The visitor argument contains the files separated based on the type of modification performed,
* however the consumer of this class is not interested in this classification.
* This is because when incrementally indexing a bunch of changesets,
* in one changeset a file may be deleted, only to be re-added in the next changeset etc.
*/
public class FileCollector extends ChangesetVisitor {
private final SortedSet<String> files;

/**
* Assumes comparing in the same way as {@code org.opengrok.indexer.index.IndexDatabase#FILENAME_COMPARATOR}.
*/
public FileCollector(boolean consumeMergeChangesets) {
super(consumeMergeChangesets);
files = new TreeSet<>();
}

public void accept(RepositoryWithHistoryTraversal.ChangesetInfo changesetInfo) {
if (changesetInfo.renamedFiles != null) {
files.addAll(changesetInfo.renamedFiles);
}
if (changesetInfo.files != null) {
files.addAll(changesetInfo.files);
}
if (changesetInfo.deletedFiles != null) {
files.addAll(changesetInfo.deletedFiles);
}
}

public SortedSet<String> getFiles() {
return files;
}

void addFiles(Collection<String> files) {
this.files.addAll(files);
}
}
Loading