Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support git partial clones with sparse checkouts #13232

Closed

Conversation

hmarr
Copy link
Contributor

@hmarr hmarr commented May 3, 2022

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

This adds a new option to enable downloads that use git partial clones with sparse checkouts, which can dramatically speed up clones of large repositories as well as reducing the checkout size on disk. The background and motivation is described in more detail in #13188.

Example (in a relatively small repo so there's not a huge difference)

Without partial clone and sparse checkout: 7.6s.

cask "sparse-test" do
  version :latest
  sha256 :no_check

  url "https://github.com/hmarr-testing/sparse-test",
      using:      :git,
      branch:     "main"

  name "Sparse test"

  artifact "food/cheese", target: '/tmp/cheese'
end
brew $ rm -rf /Users/hmarr/Library/Caches/Homebrew/Cask/sparse-test--git/
brew $ time bin/brew upgrade --cask hmarr/tap/sparse-test
==> Upgrading 1 outdated package:
hmarr/tap/sparse-test latest -> latest
==> Upgrading sparse-test
==> Cloning https://github.com/hmarr-testing/sparse-test
Cloning into '/Users/hmarr/Library/Caches/Homebrew/Cask/sparse-test--git'...
==> Checking out branch main
Already on 'main'
Your branch is up to date with 'origin/main'.
Warning: No checksum defined for cask 'sparse-test', skipping verification.
==> Backing Generic Artifact 'cheese' up to '/Users/hmarr/src/github.com/homebrew/brew/Caskroom/sparse-test/latest/food/cheese'
==> Removing Generic Artifact '/tmp/cheese'
==> Moving Generic Artifact 'cheese' to '/tmp/cheese'
==> Purging files for version latest of Cask sparse-test
🍺  sparse-test was successfully upgraded!
bin/brew upgrade --cask hmarr/tap/sparse-test  2.29s user 2.07s system 56% cpu 7.674 total

With partial clone and sparse checkout: 4.8s.

cask "sparse-test" do
  version :latest
  sha256 :no_check

  url "https://github.com/hmarr-testing/sparse-test",
      only_paths: ["food/cheese"],
      using:      :git,
      branch:     "main"

  name "Sparse test"

  artifact "food/cheese", target: '/tmp/cheese'
end
brew $ rm -rf /Users/hmarr/Library/Caches/Homebrew/Cask/sparse-test--git/                                                                                                                                         git-partial-clone-sparse-checkout
brew $ time bin/brew upgrade --cask hmarr/tap/sparse-test                                                                                                                                                         git-partial-clone-sparse-checkout
==> Upgrading 1 outdated package:
hmarr/tap/sparse-test latest -> latest
==> Upgrading sparse-test
==> Cloning https://github.com/hmarr-testing/sparse-test
Cloning into '/Users/hmarr/Library/Caches/Homebrew/Cask/sparse-test--git'...
==> Checking out branch main
Already on 'main'
Your branch is up to date with 'origin/main'.
Warning: No checksum defined for cask 'sparse-test', skipping verification.
==> Backing Generic Artifact 'cheese' up to '/Users/hmarr/src/github.com/homebrew/brew/Caskroom/sparse-test/latest/food/cheese'
==> Removing Generic Artifact '/tmp/cheese'
==> Moving Generic Artifact 'cheese' to '/tmp/cheese'
==> Purging files for version latest of Cask sparse-test
🍺  sparse-test was successfully upgraded!
bin/brew upgrade --cask hmarr/tap/sparse-test  1.79s user 1.27s system 62% cpu 4.866 total

How it works

It relies on two git features working in tandem:

  • Partial clone with a filter of blob:none and the --no-checkout option to avoid fetching any blobs during the initial clone.
  • Sparse checkout to limit the working tree to only the paths specified. When git checkout is run, only the blobs required for the sparse checkout are fetched (thanks to the partial clone), then only those specified paths appear in the repository working tree.

The sparse checkout is created by enabling the core.sparseCheckout option, and writing the paths to .git/info/sparse-checkout. There is a git-sparse-checkout command that makes this easier, but I've avoided using it as it's relatively new and marked as experimental, and doesn't exist in older versions of git.

Git version compatibility

It's a little hard to find builds for old git versions, but I built sevearl load of versions git from source to see what worked.

Sparse checkouts work with git 2.19, but although there's some support for partial clones, they don't work well. It seems to fetch more blobs than are necessary, and it fetches them one at a time, which is very slow.

From git 2.20, both sparse checkouts and partial clones seem to work well. This is the version of git that comes with macOS 10.15, which is the minimum recommended macOS verison Homebrew supports right now. For earlier versions of git, the option is ignored, meaning the full repository is checked out. This feel in line with the "best effort" support guarantees for older versions of macOS that are promised.

Handling repository updates

I did some testing to make sure that things don't to awry after the initial clone.

  • A regular repository update works fine, including when additional paths are added to the array
  • If the formula starts using the only_paths option for the first time, the repository should be configured correctly to use partial clones and sparse checkouts
  • I haven't handled the reverse case where a formula author stops using only_paths. It's not totally trivial as sparse checkouts set the skip-worktree bit, which needs to be removed to transition from a sparse checkout to a full checkout. I can see three options:
    1. When updating a repository, if only_paths is nil, check to see if the core.sparseCheckout option is enabled. If it is, set the paths in .git/info/sparse-checkout to *. This technically leaves the repository as a sparse checkout but includes all paths.
    2. The same as (1), but afterwards clean up by turning off the core.sparseCheckout option and deleting the .git/info/sparse-checkout file after git checkout has been run.
    3. Add a suffix like -sparse to the cache_tag if only_paths isn't nil, meaning we'd get a different cached repository if the option disabled.

Open questions / things I'd love feedback on

  • How to handle issue described above about handling the transition from a sparse checkout to a full checkout.
  • The option (only_paths) is an array of strings, where each string is a path that will be checked out. I suspect that most of the time people will only provide one path, so we could go with a string rather than an array.
  • Is the name only_paths clear enough? Other options I considered were sparse_checkout_paths, only_checkout_paths, paths_to_fetch.
  • I haven't written any new tests for this behaviour. It looks like there isn't much of a precedent for writing download strategy tests as the existing tests are a bit sparse. I'm happy to take your steer on what to do here.

Thanks for all your hard work keeping homebrew awesome! ❤️

@hmarr hmarr changed the title Git partial clone sparse checkout Support git partial clones with sparse checkouts May 3, 2022
Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, nice work!

@MikeMcQuaid
Copy link
Member

  • I haven't written any new tests for this behaviour. It looks like there isn't much of a precedent for writing download strategy tests as the existing tests are a bit sparse. I'm happy to take your steer on what to do here.

I'd be interested to see a PR to a homebrew-* repository which makes use of this new feature so we can see what the DSL looks like from that side.

@hmarr
Copy link
Contributor Author

hmarr commented May 4, 2022

Thanks for the review @MikeMcQuaid 🙌

Here's a draft PR showing how the new DSL could be used: Homebrew/homebrew-cask-fonts#5758

Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. I have only one question about naming but I'm happy with the implementation here.

Will merge a few other maintainers chime in on their preferred DSL naming here. Great work @hmarr!

@hmarr
Copy link
Contributor Author

hmarr commented May 7, 2022

Thanks @MikeMcQuaid!

Before merging, we should probably also decide on a way of handling upgrades from sparse to full checkouts (which would happen in the rare case someone removes only_paths). I listed the options I can see in the PR body, but they boil down to:

  1. Add something to the cache key to separate sparse checkouts from full checkouts. This is the simpler option to implement, but means that if you remove only_paths you'd re-clone the repo from scratch.
  2. Detect sparse checkouts and explicitly perform the upgrade when only_paths isn't specified. This would let us reuse the sparse checkout when upgrading to a full checkout, but I'm not convinced that marginal speed benefit in what must be a very rare case is worth the extra complexity in the download strategy code.

@MikeMcQuaid
Copy link
Member

  • Add something to the cache key to separate sparse checkouts from full checkouts. This is the simpler option to implement, but means that if you remove only_paths you'd re-clone the repo from scratch.

This seems fine 👍🏻

@apainintheneck
Copy link
Contributor

There is a small change that should be made related to the recently merged in PR #13358 (more information can be found there).

Essentially, Cask#checksumable? needs to be updated to ignore casks using the sparse checkout download strategy otherwise it will try and fail to hash a directory. This what we came up with.

Before

    def checksumable?
      url.using.blank? || url.using == :post
    end

After

    def checksumable?
      (url.using.blank? && url.only_paths.blank?) || url.using == :post
    end

Let me know if that makes sense.

@hmarr hmarr force-pushed the git-partial-clone-sparse-checkout branch from 920bc9b to a8c9d38 Compare June 12, 2022 14:49
@hmarr
Copy link
Contributor Author

hmarr commented Jun 12, 2022

@apainintheneck thanks for the heads up, I made that change in ccdd45d

@MikeMcQuaid I made the change to the cache path for sparse checkouts in a8c9d38. That was the last thing I had on my list, so I we're ready to go now. Once this is merged, we can start switching the font casks over to git from svn.

@MikeMcQuaid
Copy link
Member

@hmarr Just before this gets merged: interested if you have any thoughts on the comments in https://github.com/Homebrew/homebrew-cask-fonts/pull/5758/files. I ask because I want to ensure that we're creating the optimal API here before we merge and are (somewhat) stuck with it.

Thanks!

@apainintheneck
Copy link
Contributor

The url documentation over at the Cask Cookbook should also probably be updated at some point too.

@hmarr
Copy link
Contributor Author

hmarr commented Jun 30, 2022

@MikeMcQuaid good point! I just replied to your comments there (been away for a couple of weeks, sorry for being slow to get back to you 🐢)

@MikeMcQuaid
Copy link
Member

@hmarr Thanks! Replied over there.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale No recent activity label Jul 22, 2022
@github-actions github-actions bot closed this Jul 30, 2022
@github-actions github-actions bot added the outdated PR was locked due to age label Aug 29, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated PR was locked due to age stale No recent activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants