Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sparse-index: improve lstat caching of sparse paths
The clear_skip_worktree_from_present_files() method was first introduced in af6a518 (repo_read_index: clear SKIP_WORKTREE bit from files present in worktree, 2022-01-14) to allow better interaction with the working directory in the presence of paths outside of the sparse-checkout cone. The initial implementation would lstat() every single sparse tree to see if it existed, and if one did, then the sparse index would expand and every sparse file would be checked. Since these lstat() calls were very expensive, this was improved in d79d299 (Accelerate clear_skip_worktree_from_present_files() by caching, 2022-01-14) by caching directories that do not exist. However, there are some inefficiencies in that caching mechanism. The caching mechanism stored only the parent directory as not existing, even if a higher parent directory also does not exist. This means that wasted lstat() calls would occur when the sparse files change immediate parent directories but within the same root directory that does not exist. To set up a scenario that triggers this code in an interesting way, we need a sparse-checkout in cone mode and a sparse index. To trigger the full index expansion and a call to the clear_skip_worktree_from_present_files_full() method, we need one of the sparse trees to actually exist on disk. The performance test script p2000-sparse-operations.sh takes the sample repository and copies its HEAD to several copies nested in directories of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior and also lead to some interesting cases for the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do not. This is difficult to notice when running performance tests using the Git repository (or a blow-up of the Git repository, as in p2000-sparse-operations.sh) because Git has a very shallow directory structure. This change reorganizes the caching algorithm to focus on storing both the deepest _existing_ directory and the next-level non-existing directory. By doing a little extra work on the first sparse file, we can short-circuit all of the sparse files that exist in that non-existing directory. When in a repository where the first sparse file is likely to have a much deeper path than the first non-existing directory, this can realize significant gains. The details of this algorithm require careful attention, so the new implementation of path_found() has detailed comments, including the use of a new max_common_dir_prefix() method that may be of independent interest. It's worth noting that this is not universally positive, since we are doing extra lstat() calls to establish the exact path to cache. In the blow-up of the Git repository, we can see that the lstat count _increases_ from 28 to 31. However, these numbers were already artificially low. Using an internal monorepo with over two million paths at HEAD and a typical sparse-checkout cone such that the index contains ~190,000 entries (including over two thousand sparse trees), I was able to measure these lstat counts when one sparse directory actually exists on disk: Sparse files in expanded index: 1,841,997 full_lstat_count (before): 173,259 full_lstat_count (after): 6,521 This resulted in this absolute time change, on a warm disk: Time in full loop (before): 2.527 s Time in full loop (after): 0.071 s (These times were calculated on a Windows machine, where lstat() is slower than a similar Linux machine.) Signed-off-by: Derrick Stolee <[email protected]>
- Loading branch information