Incremental commit-graph files #156

derrickstolee · 2019-06-21T14:26:11Z

This PR matches the commits that Junio is tracking from this patch series.
The only commit required that is extra is a merge between vfs-2.22.0
and ds/commit-graph-incremental because we took ds/close-object-store
a bit earlier (we needed it). Also, there was a small conflict in packfile.[c|h]
related to ds/multi-pack-index-expire.

Here is a copy of the cover letter for that series:

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "chain" of commit-graphs in the
.git/objects/info/commit-graphs folder with name graph-{hash}.graph. The
list of hashes is given by the commit-graph-chain file, and also in a "base
graph chunk" in the commit-graph format. As we read a chain, we can verify
that the hashes match the trailing hash of each commit-graph we read along
the way and each hash below a level is expected by that graph file.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write is less than
half of the commits in the level above. This can be tweaked by the
--size-multiple and --max-commits options.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using
--stdin-commits. Here are the overall perf numbers:

write --stdin-commits:         8m 12s
write --stdin-commits --split:    28s
write --split && verify --shallow: 60s

The parse_commit_buffer() method takes a repository pointer, so it should not refer to the_repository anymore. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method uses die() to report failure and exit when confronted with an unexpected condition. This use of die() in a library function is incorrect and is now replaced by error() statements and an int return type. Return zero on success and a negative value on failure. Now that we use 'goto cleanup' to jump to the terminal condition on an error, we have new paths that could lead to uninitialized values. New initializers are added to correct for this. The builtins 'commit-graph', 'gc', and 'commit' call these methods, so update them to check the return value. Test that 'git commit-graph write' returns a proper error code when hitting a failure condition in write_commit_graph(). Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() and write_commit_graph_reachable() methods currently take two boolean parameters: 'append' and 'report_progress'. As we update these methods, adding more parameters this way becomes cluttered and hard to maintain. Collapse these parameters into a 'flags' parameter, and adjust the callers to provide flags as necessary. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The commit-graph feature began with a long list of planned benefits, most of which are now complete. The future work section has only a few items left. As for making more algorithms aware of generation numbers, some are only waiting for generation number v2 to ensure the performance matches the existing behavior using commit date. It is unlikely that we will ever send a commit-graph file as part of the protocol, since we would need to verify the data, and that is expensive. If we want to start trusting remote content, then that item can be investigated again. While there is more work to be done on the feature, having a section of the docs devoted to a TODO list is wasteful and hard to keep up-to-date. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too large and complex. To simplify it, we should extract several helper functions. However, we will risk repeating a lot of declarations related to progress incidators and object id or commit lists. Create a new write_commit_graph_context struct that contains the core data structures used in this process. Replace the other local variables with the values inside the context object. Following this change, we will start to lift code segments wholesale out of the write_commit_graph() method and into helper functions. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. This extracts fill_oids_from_packs() that reads the given pack-file list and fills the oid list in the context. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. Extract fill_oids_from_commit_hex() that reads the given commit id list and fille the oid list in the context. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. Extract fill_oids_from_all_packs() that reads all pack-files for commits and fills the oid list in the context. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. Extract count_distinct_commits(), which sorts the oids list, then iterates through to find duplicates. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. Extract copy_oids_to_commits(), which fills the commits list with the distinct commits from the oids list. During this loop, it also counts the number of "extra" edges from octopus merges. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The write_commit_graph() method is too complex, so we are extracting helper functions one by one. Extract write_commit_graph_file() that takes all of the information in the context struct and writes the data to a commit-graph file. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The close_commit_graph() method took a repository struct, but then only uses the raw_object_store within. Change the function prototype to make the method more flexible. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The close_all_packs() method is used to close all read handles to pack-files and the multi-pack-index before running 'git gc --auto'. This is particularly important on the Windows platform, where read handles block any writes to those files. Replacing one of these files with a rename() will fail in this situation. The commit-graph also performs a rename, so is susceptable to this problem. We are careful to close the commit-graph before writing, but that doesn't work when a 'git fetch' (or similar) process runs 'git gc --auto' which may write a commit-graph. Here, close the commit-graph as part of close_all_packs(). Reported-by: Johannes Schindelin <[email protected]> Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The close_all_packs() method is now responsible for more than just pack-files. It also closes the commit-graph and the multi-pack-index. Rename the function to be more descriptive of its larger role. The name also fits because the input parameter is a raw_object_store. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Add a basic description of commit-graph chains. More details about the feature will be added as we add functionality. This introduction gives a high-level overview to the goals of the feature and the basic layout of commit-graph chains. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

To prepare for a chain of commit-graph files, augment the commit_graph struct to point to a base commit_graph. As we load commits from the graph, we may actually want to read from a base file according to the graph position. The "graph position" of a commit is given by concatenating the lexicographic commit orders from each of the commit-graph files in the chain. This means that we must distinguish two values: * lexicographic index : the position within the lexicographic order in a single commit-graph file. * graph position: the position within the concatenated order of multiple commit-graph files Given the lexicographic index of a commit in a graph, we can compute the graph position by adding the number of commits in the lower-level graphs. To find the lexicographic index of a commit, we subtract the number of commits in lower-level graphs. While here, change insert_parent_or_die() to take a uint32_t position, as that is the type used by its only caller and that makes more sense with the limits in the commit-graph format. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The helper function commit_compare() actually compares object_id structs, not commits. A future change to commit-graph.c will need to sort commit structs, so rename this function in advance. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Prepare the logic for reading a chain of commit-graphs. First, look for a file at $OBJDIR/info/commit-graph. If it exists, then use that file and stop. Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain. If this file exists, then load the hash values as line-separated values in that file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i] in that file. The file is given in order, so the first hash corresponds to the "base" file and the final hash corresponds to the "tip" file. This implementation assumes that all of the graph-{hash}.graph files are in the same object directory as the commit-graph-chain file. This will be updated in a future change. This change is purposefully simple so we can isolate the different concerns. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

To quickly verify a commit-graph chain is valid on load, we will read from the new "Base Graphs Chunk" of each file in the chain. This will prevent accidentally loading incorrect data from manually editing the commit-graph-chain file or renaming graph-{hash}.graph files. The commit_graph struct already had an object_id struct "oid", but it was never initialized or used. Add a line to read the hash from the end of the commit-graph file and into the oid member. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The number of chunks in a commit-graph file can change depending on whether we need the Extra Edges Chunk. We are going to add more optional chunks, and it will be helpful to rearrange this logic around the chunk count before doing so. Specifically, we need to finalize the number of chunks before writing the commit-graph header. Further, we also need to fill out the chunk lookup table dynamically and using "num_chunks" as we add optional chunks is useful for adding optional chunks in the future. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Extend write_commit_graph() to write a commit-graph chain when given the COMMIT_GRAPH_SPLIT flag. This implementation is purposefully simplistic in how it creates a new chain. The commits not already in the chain are added to a new tip commit-graph file. Much of the logic around writing a graph-{hash}.graph file and updating the commit-graph-chain file is the same as the commit-graph file case. However, there are several places where we need to do some extra logic in the split case. Track the list of graph filenames before and after the planned write. This will be more important when we start merging graph files, but it also allows us to upgrade our commit-graph file to the appropriate graph-{hash}.graph file when we upgrade to a chain of commit-graphs. Note that we use the eighth byte of the commit-graph header to store the number of base graph files. This determines the length of the base graphs chunk. A subtle change of behavior with the new logic is that we do not write a commit-graph if we our commit list is empty. This extends to the typical case, which is reflected in t5318-commit-graph.sh. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Add a new "--split" option to the 'git commit-graph write' subcommand. This option allows the optional behavior of writing a commit-graph chain. The current behavior will add a tip commit-graph containing any commits that are not in the existing commit-graph or commit-graph chain. Later changes will allow merging the chain and expiring out-dated files. Add a new test script (t5324-split-commit-graph.sh) that demonstrates this behavior. Helped-by: Johannes Schindelin <[email protected]> Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

When searching for a commit in a commit-graph chain of G graphs with N commits, the search takes O(G log N) time. If we always add a new tip graph with every write, the linear G term will start to dominate and slow the lookup process. To keep lookups fast, but also keep most incremental writes fast, create a strategy for merging levels of the commit-graph chain. The strategy is detailed in the commit-graph design document, but is summarized by these two conditions: 1. If the number of commits we are adding is more than half the number of commits in the graph below, then merge with that graph. 2. If we are writing more than 64,000 commits into a single graph, then merge with all lower graphs. The numeric values in the conditions above are currently constant, but can become config options in a future update. As we merge levels of the commit-graph chain, check that the commits still exist in the repository. A garbage-collection operation may have removed those commits from the object store and we do not want to persist them in the commit-graph chain. This is a non-issue if the 'git gc' process wrote a new, single-level commit-graph file. After we merge levels, the old graph-{hash}.graph files are no longer referenced by the commit-graph-chain file. We will expire these files in a future change. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In an environment like a fork network, it is helpful to have a commit-graph chain that spans both the base repo and the fork repo. The fork is usually a small set of data on top of the large repo, but sometimes the fork is much larger. For example, git-for-windows/git has almost double the number of commits as git/git because it rebases its commits on every major version update. To allow cross-alternate commit-graph chains, we need a few pieces: 1. When looking for a graph-{hash}.graph file, check all alternates. 2. When merging commit-graph chains, do not merge across alternates. 3. When writing a new commit-graph chain based on a commit-graph file in another object directory, do not allow success if the base file has of the name "commit-graph" instead of "commit-graphs/graph-{hash}.graph". Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As we merge commit-graph files in a commit-graph chain, we should clean up the files that are no longer used. This change introduces an 'expiry_window' value to the context, which is always zero (for now). We then check the modified time of each graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and unlink the files that are older than the expiry_window. Since this is always zero, this immediately clears all unused graph files. We will update the value to match a config setting in a future change. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The split commit-graph feature is now fully implemented, but needs some more run-time configurability. Allow direct callers to 'git commit-graph write --split' to specify the values used in the merge strategy and the expire time. Update the documentation to specify these values. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

If we wrote a commit-graph chain, we only modified the tip file in the chain. It is valuable to verify what we wrote, but not waste time checking files we did not write. Add a '--shallow' option to the 'git commit-graph verify' subcommand and check that it does not read the base graph in a two-file chain. Making the verify subcommand read from a chain of commit-graphs takes some rearranging of the builtin code. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

If we write a commit-graph file without the split option, then we write to $OBJDIR/info/commit-graph and start to ignore the chains in $OBJDIR/info/commit-graphs/. Unlink the commit-graph-chain file and expire the graph-{hash}.graph files in $OBJDIR/info/commit-graphs/ during every write. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Octopus merges require an extra chunk of data in the commit-graph file format. Create a test that ensures the new --split option continues to work with an octopus merge. Specifically, ensure that the octopus merge has parents across layers to truly check that our graph position logic holds up correctly. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We allow sharing commit-graph files across alternates. When we are writing a split commit-graph, we allow adding tip graph files that are not in the alternate, but include commits from our local repo. However, if our alternate is not using the split commit-graph format, its file is at .git/objects/info/commit-graph and we are trying to write files in .git/objects/info/commit-graphs/graph-{hash}.graph. We already have logic to ensure we do not merge across alternate boundaries, but we also cannot have a commit-graph chain to our alternate if uses the old filename structure. Create a test that verifies we create a new split commit-graph with only one level and we do not modify the existing commit-graph in the alternate. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

When writing commit-graph files, we append path data to an object directory, which may be specified by the user via the '--object-dir' option. If the user supplies a trailing slash, or some other alternative path format, the resulting path may be usable for writing to the correct location. However, when expiring graph files from the <obj-dir>/info/commit-graphs directory during a write, we need to compare paths with exact string matches. Normalize the commit-graph filenames to avoid ambiguity. This creates extra allocations, but this is a constant multiple of the number of commit-graph files, which should be a number in the single digits. Further normalize the object directory in the context. Due to a comparison between g->obj_dir and ctx->obj_dir in split_graph_merge_strategy(), a trailing slash would prevent any merging of layers within the same object directory. The check is there to ensure we do not merge across alternates. Update the tests to include a case with this trailing slash problem. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The 'git commit-graph verify' subcommand loads a commit-graph from a given object directory instead of using the standard method prepare_commit_graph(). During development of load_commit_graph_chain(), a version did not include prepare_alt_odb() as it was previously run by prepare_commit_graph() in most cases. Add a test that prevents that mistake from happening again. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Signed-off-by: Derrick Stolee <[email protected]>

jeffhostetler

I'm assuming that there was lots of upstream discussion, so I'll just skim this now.
I'm going to approve it before my battery dies and unblock you.

derrickstolee · 2019-07-01T13:14:24Z

@jeffhostetler This looks to be stable on-list. The only complaint right now is a test style nit. We should probably merge this in now and let it simmer in CI while I'm on vacation (but I'll be back a couple weeks before the release).

The incremental commit-graph format is designed to speed up commit-graph writes, and especially to allow fast verification of the new data. At least, most of the time. Uses the new feature in microsoft/git#156. By only writing the new data to a small "tip" commit-graph file, we can write much less data and only verify the new tip file. Current settings include `--size-multiple=4`: This guarantees each level of the chain is half the size of the level below. This means the chain length will be limited by log_4(N) where N is the number of commits. This means ~6 levels is the maximum theoretical chain length with 4,000,000 commits. We are not currently using the `--max-commits` setting. With these settings, the `git commit-graph verify` command will be much faster on most new tip files. For example, a tip graph with 3,300 commits uses 183 KB of data and took 0.7 seconds to verify.

derrickstolee added 30 commits May 13, 2019 11:52

commit-graph: fix the_repository reference

c794405

The parse_commit_buffer() method takes a repository pointer, so it should not refer to the_repository anymore. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

derrickstolee added 3 commits June 19, 2019 20:46

Merge 'ds/commit-graph-incremental' into vfs-2.22.0

a4618d5

Signed-off-by: Derrick Stolee <[email protected]>

derrickstolee added the enhancement label Jun 21, 2019

derrickstolee requested a review from jeffhostetler June 21, 2019 14:26

derrickstolee mentioned this pull request Jun 21, 2019

Incremental commit graph microsoft/VFSForGit#1283

Merged

jeffhostetler approved these changes Jun 21, 2019

View reviewed changes

derrickstolee merged commit ea22768 into microsoft:vfs-2.22.0 Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental commit-graph files #156

Incremental commit-graph files #156

derrickstolee commented Jun 21, 2019 •

edited

Loading

jeffhostetler left a comment

derrickstolee commented Jul 1, 2019

Incremental commit-graph files #156

Incremental commit-graph files #156

Conversation

derrickstolee commented Jun 21, 2019 • edited Loading

jeffhostetler left a comment

Choose a reason for hiding this comment

derrickstolee commented Jul 1, 2019

derrickstolee commented Jun 21, 2019 •

edited

Loading