Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GitHubs commit API to check out the tree directly #11176

Closed
wants to merge 2 commits into from

Conversation

Eh2406
Copy link
Contributor

@Eh2406 Eh2406 commented Oct 4, 2022

Cargo often needs to clone or update repos that have a lot of history, when all it needs is the contents of one particular reference. The most dramatic example is the index, which we clone or update regularly. sparse-registry will move us from using git for this particular case entirely, but in the meantime it would be nice to reduce the transfer costs.

GitHub Blog "Get up to speed with partial clone and shallow clone" describe how to use "shallow clone" (--depth=1) or "treeless clone" (--filter=) to make this less expensive. Unfortunately, libgit2 does not support either of these features. When GitOxide is ready, we should talk to GitHub about if these features make sense in our use case.

It turns out, without using any fancy features, you can fetch an arbitrary git object ID and everything it links to. If we knew the ID of the Tree of the commit of the reference we were trying to check out, we could fetch everything we needed without any of the history! GitHub provides an API endpoint for retrieving information about a commit. One of the included pieces of information is the ID of the Tree. Even better, we already hit that API as a fast path for discovering that we are up-to-date. So instead of that fast path returning the sha we return the commit.tree.sha, and we no longer clone the history!

In fact there are more minor implementation details:

  1. We need to store the sha and commit.tree.sha somewhere so that we can use it as an etag, and so that we know what to check out, as git no longer has Commits to refer to. I have chosen to make a blob and a ref point to that blob.
  2. Our checkout code needs to be slightly modified because it doesn't make sense to update the HEAD commit if we don't have history.

There is one failing test, specifically added because this cannot be merged until we talk to GitHub about:

  • Is it acceptable for us to drive a significant portion of the traffic to fetching a Tree object instead of its associated Commit?
  • Is it acceptable for us update a repository that has Tree objects checked out, to a different tree object?
  • Occasionally, the call to https://api.github.com/repos/OWNER/REPO/commits/REF will fail after succeeding in the past. (For example due to rate limiting.) Leading to us normally cloning a repository which had previously only had Tree objects. Is this acceptable?

You have publicly asked people not to do any kind of fetch in a repository that was checked out using "shallow clone", is this the same situation?

Fundamentally this change reduces disk footprint and network bandwidth by an enormous amount. Different repos will have different ratios. But the index is quite significant, in the 80% range, depending on how recently it was squashed.

@rust-highfive
Copy link

r? @epage

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 4, 2022
@Byron
Copy link
Member

Byron commented Oct 5, 2022

I find this method downright genius as it uses the GitHub API in conjunction with the standard git protocol features (namely, fetching objects specified by hash) to emulate a shallow clone which is even more space efficient than an actual shallow clone, as it skip on commits entirely. Let's call it 'super-shallow' clone. Thus, even once shallow cloning is available, assuming there is no blockers on GitHubs side, using this fast-path would be beneficial either way.

From a performance perspective on the server side, judging from my own limited experience, I'd be surprised if GitHub would have an issue with it. It will be good to hear that from them though.

Something I was always interested in was to keep the local crates-index clone as standard as possible to allow users to interact with it like they normally would. With this PR we get a step further away from that as HEAD seems to be bypassed entirely, that is it won't point to a recent commit (we wouldn't have that anymore), nor to a tree. Pointing it to a tree would probably disturb most tools out there as well so there is little value. Furthermore, the requirement to store an etag as alternative to a tree as reference for 'last seen' for GitHub requires more than HEAD for storage.

I hope to get to integrating cloning and fetching the crates index using gitoxide soon and open a PR for that, by then I should understand all the details which should allow to maybe provide more helpful comments here as well. (Something I'd like to understand more is how the transition from this 'super-shallow' implementation works back to the 'normal' one, and vice-versa)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Oct 5, 2022

One thing I forgot to do in this PR and forgot to mention, we can easily change the hash used by Cargo in the file path where the git db is stored. This can prevent old and new Cargos from interacting with the same git dbs. Hopefully then the only time we transition back to a normal clone is on a failed api call.

And if updates are the expensive part then we can make it opt in, so people can use it where updates are not likely (on CI).

Copy link
Contributor

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. As I wasn't as familiar with this, the area I was most concerned about was that invalidation still works correctly and it seems so.

src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
Comment on lines 213 to 221
let id = if tree {
shalow_data.tree
} else {
shalow_data.etag
};
Some(id.parse::<Oid>().ok()?)
})
{
return Ok(reference);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned over the brittleness of this API.

If tree is true, it will resolve to a tree if and only if we are in the fast path. All (transitive) callers need to know this is just a generic object id and not specifically a commit id (which is an easy assumption to make).

In my ideal world

  • We would have separate resolve_to_object and resolve_to_commit functions.
    • object is more accurate than tree as we can't assume its a tree
    • I feel like we should make this more obvious in the caller than just using a true or false
    • separate functions as the difference in which to use is a design-time decision and not a runtime decision
  • resolve_to_object would not do any peels, making it more obvious when someone misuses the API

My main concern with this ideal is how much code would be duplicated. I guess its not so bad if

  • We have a _resolve_to_object function that is just the old code path without peels. ``resolve_to_commit` would call it and do a peel.
  • We have a _find_shallow_blob function that just gets the blob and each of the top-level functions extract the relevant field from that.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with your point that this is a compile time distinction. I will rework the interface to match that (when I next have time to code). I think the two relevant functions are resolve_to_tree (which always points to a tree that git has on disk) and resolve_to_commit (which will always point to a commit, but may not be something git has on disk).

src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
src/cargo/sources/registry/remote.rs Show resolved Hide resolved
src/cargo/sources/registry/remote.rs Outdated Show resolved Hide resolved
src/cargo/sources/registry/remote.rs Outdated Show resolved Hide resolved
@Eh2406
Copy link
Contributor Author

Eh2406 commented Oct 7, 2022

Ready for the next review

Comment on lines +224 to +230
#[test]
fn check_with_git_hub() {
panic!(
r#"nonstandard shallow clone may be worse than a full check out.
This test is here to make sure we do not merge until we have official signoff from GitHub"#
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising visibility of this

@larsxschneider
Copy link

👋 Hello from GitHub!

Using the commits API certainly works. However, if you are requesting a lot of files then you might run into our API rate limits.

If you are only interested in the content of the HEAD commit of a particular reference, then the most efficient way to get this data is our archive API.

Here is an example which downloads a tarball of the HEAD of the master branch of your crates.io-index repository and extracts it into the current directory:

$ curl -L https://api.github.com/repos/rust-lang/crates.io-index/tarball/master | tar zxf -

@Eh2406
Copy link
Contributor Author

Eh2406 commented Oct 21, 2022

Switching to using tarballs is a larger structural change. One I will definitely continue to look into. For some of Cargos needs we do in fact want to extract the output, where it might be a very good fit.

However, The Index is the place where Cargo puts the most load on GitHubs infrastructure. We do not extract the Index on disk, instead only extracting and parsing the files we happen to need. I don't think tarballs are compatible with random-access. We could get a zipball, but this is also significantly larger. The result would be a whole new kind of Index (Git/Sparse and now ZIP), with significant testing and implementation overhead, which would probably not merge before we stabilize Sparse Indexs.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 7, 2022

We need to store the sha and commit.tree.sha somewhere so that we can use it as an etag, and so that we know what to check out, as git no longer has Commits to refer to. I have chosen to make a blob and a ref point to that blob.

I realize the problem with this plan, git gc has no way of knowing that we are still using the object pointed at OID commit.tree.sha. I am switching to having two refs, one to the commit.tree.sha and one to a blob containing sha.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 8, 2022

Is it acceptable for us update a repository that has Tree objects checked out, to a different tree object?

I built a separate project that creates a new git repository and then fetches into it twice. The goal was to see the first one download a pack file of 64 MB, and the second one only downloading the changes. I used head: "e6c48cfe712d28e17914a2822b8f2f492ff22c80" whose tree: "a792a7874f7033acf3a8d8aacfd914c77c5673fd" as the first fetch and head: "37c5a461c41265670000d91eb76e9256b91f6dc3" whose tree: "63a4fe231ca6fcc6107b41f9c4f56cdbb6381481" for the second. Unfortunately, everything I tried ended up downloading two pack files that were both 64 MB.

Things I tried:

  • .fetch(&[&format!("+{tree}")]) where tree is the oid above.
  • .fetch(&[&format!("+{tree}:{old_tree}")]) where old_tree is the oid of the tree from the previous fetch.
  • .fetch(&[&format!("+{tree}:{old_head}")]) where old_head is the oid of the head from the previous fetch.
  • .fetch(&[&format!("+{tree}:refs/ref_that_point_to_old_tree")]) where refs/ref_that_point_to_old_tree does what it says on the tin.
  • .fetch(&[&format!("+{tree}:refs/ref_that_point_to_a_commit_that_points_to_old_tree")]) where refs/ref_that_point_to_a_commit_that_points_to_old_tree does what it says on the tin.
    git2::trace_set(git2::TraceLevel::Trace, |a, b| {
        if b.contains("have") {
            eprintln!("{a:?} {b}")
        }
    });

Was helpful in figuring out what's going on. libgit2only sends a have line when the oid points to a commit. However when it does point to a commit, the server has no way to know what tree that commit is associated with.

Unless we can figure out how to get deltas between two trees, I think this PR is probably dead in the water.

@Byron
Copy link
Member

Byron commented Nov 9, 2022

Unless we can figure out how to get deltas between two trees, I think this PR is probably dead in the water.

Does it have to be between two trees?

I see two distinct stages

  1. the first clone
  2. consecutive fetches

1) could be done, I think, by fetching single {tree}, and keeping the hash of the commit that refers to it without fetching it. That way, we have all data associated with the tree and can work with it.

For 2), and in order to get deltas, one would have to get libgit2 to produce a have line in the protocol, which it does only if the right-hand side of the refspec points to a commit object, which unfortunately we don't have in the object database. The left-hand side must also be a commit, and it's something that we know from using the GitHub API, and it doesn't have to exist locally either.

This narrows the problem down to: How can we make a commit object (that we know, but don't have) so that a ref points to it and it is picked up as have line in the protocol?

A look at the code reveals that the way libgit2 negotiates requires a revision walk, so it needs a real commit to work. But what constitutes a real commit? It's just a file like this:

tree <hash>
parent <our_previous_fake_commit>
author name <email> 1666095079 +0100
committer GitHub <[email protected]> 1666095079 +0100


commit message

The idea is to manufacture a commit that:

  • points to our previous known fake-commit. It's fake as it has a hash that doesn't match its contents
  • points to our actual tree.

Thus we create a commit graph made up of fake commits that is 'one-commit-per-fetch'. That way the server will get enough information, even without mentioning the intermediate commits that we don't know about. I know this is fine as gitoxide currently implements a naive negotiation which only mentions haves based on the tracking branches.

This sounds like a lot of workaround'ing to have another stab at getting this to work, but I think it could indeed work. The question is if it's worth the effort (and maintenance) in the light of gitoxide being around the corner which would bring true shallow clones and fetches.

Edit: It's worth noting that these 'fake' commits would be loose objects which have been renamed to match the required hash. During normal operation, git or libgit2 will never validate a hash so it's fine. I guess this makes it important that gc doesn't trip up on these either, so there is certainly more to be validated even if this workaround would work at first.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 9, 2022

Unfortunately there are two problems with that suggestion. One problem is that libgit2 does appear to check. In my experiments if the right hand side of a refspec point at a loose object that has been renamed to have the hash of head, then libgit2 gives "object hash mismatch" error.

The second issue is more subtle, what happens if the new tree contains a file whose contents existed in an old commit, but not in the tree associated with the old head. What would happen is:

  1. Initial clone of only old tree.
  2. Download Delta between old head and new tree. (This does not include the new version of the file, because we claim to have had old head which implies that we have the entire history, and so we have the blob already.)
  3. Attempts to look at the object in the new tree, and get an error because it does not exist in our object database. (As it was not in the old tree nor in the delta.)

@Byron
Copy link
Member

Byron commented Nov 9, 2022

So this is how it feels to be sad and happy at the same time 😅.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 11, 2022

Seems like the low-level control needed to make this work will have to wait on gitoxide. Which will probably also support more normal/standard operations, making this scheme unnecessary. Also hopefully sparse indexes will be stable by then, which also makes such extraordinary measures unjustified.

@Eh2406 Eh2406 closed this Nov 11, 2022
@Eh2406 Eh2406 deleted the pond branch November 29, 2022 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants