Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GITSHA1 value to supported digest functions #257

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions build/bazel/remote/execution/v2/remote_execution.proto
Original file line number Diff line number Diff line change
Expand Up @@ -1890,6 +1890,71 @@ message DigestFunction {
// The BLAKE3 hash function.
// See https://github.com/BLAKE3-team/BLAKE3.
BLAKE3 = 9;

// Use git-sha1 hahsing function prefixed with a one byte marker.
//
// Since git hashes blob content differently from trees, this type of
// information has to be transmitted in addition to the content and the
// hash. To this aim, the git hash values passed over the wire are prefixed
// with a single-byte marker, thus allowing the remote side to distinguish a
// blob from a tree without inspecting the (potentially large) content. The
// markers are
//
// - 0x62 for a git blob (0x62 corresponds to the character b)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which marker is used for Action and Command messages? Are those treated as plain blobs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No special markers :)

The marker is introduced just to distinguish a tree object from a blob because git itself hashes them differently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying that Action and Command messages are stored as plain SHA-1 hashed objects, using regular SHA-1 hashes in their digests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I was unclear. They get a marker. Action, for example is a blob and so will be marked with 62.

Everything is consistenly hashed according to the git-sha1 (which is different from plain sha1). If the object is a blob the hash is prefixed with 0x62, if it is a tree (for example the input root), with 0x74

Did I add even more entropy? :)

// - 0x74 for a git tree (0x74 corresponds to the character t)
//
// So
//
// Hash(blob) = 0x62<git sha1>
// Hash(tree) = 0x74<git sha1>
//
// Blob Upload
//
// A blob is uploaded to the remote side by passing its raw content as well as its
Copy link
Collaborator

@EdSchouten EdSchouten May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "raw content" in this case? Does this include the "blob ${size}\0" prefix? I think it sort of has to, right? Otherwise you can't validate the hash of a file in a streaming manner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the raw content is, for example, the real content of the file. blob ${size}\0 must be added by the hashing function

$ printf "foo" | git hash-object --stdin
19102815663d23f8b75a47e7a01965dcdc96468c

$ printf "blob 3\0foo" | sha1sum
19102815663d23f8b75a47e7a01965dcdc96468c  -

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. So combined with the fact that sizes are not stored in directories, this means that hash validation can only take place after a file has been fully downloaded, right? That's pretty brutal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say so.. that's the format used by git :/

// Digest containing the git hash value for a blob prefixed by 0x62.
//
// Tree Upload
//
// Whenever a Digest of a Directory message is expected, it can be
// convenientely replaced by the Digest of the corresponding git tree.
//
// Using this git tree representation makes tree handling much more
// efficient since the effort of traversing and uploading the content of a
// git tree occurs only once and for each subsequent request, and the git
Comment on lines +1922 to +1923
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that git trees would be uploaded in their entirety, even if only a single new file is included in the tree?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mostynb , yes. Git trees are simple files (aka blobs) that contain the ordered list of the files and trees that are directly included with their names and hashes (i.e., the whole tree is not flatten as in Directory messages). So, if a file change, the hash of that blob will change, so the git tree will change as well.

git trees are blobs whose content is constructed in a precise way, and its digest is computed
tree <size of the blob>\0<blob's content> | sha1sum

The content of a tree is a sequence of entries. Each entry is defined as follows
<hash bytes><object type> <object name>\0
where <object type> is an octal number that identifies a blob, an executable, a tree, or a symlink. <object name> is the name of the file or directory.

For example, given the following directory

$ tree tree-playground/
tree-playground/
├── bar.txt
└── foo
    └── foo.txt

1 directory, 2 files

the git tree object can be read

5716ca5987cbf97d6bb54920bea6adde242d87e6 0o10644 bar.txt
fcf0be4d7e45f0ef9592682ad68e42270b0366b4 0o40000 foo

From thins you can imagine how more complicated tree will look like.

If the content of bar.txt will change, its hash will change as well and so the tree that contains it. However, the tree foo will not change. So the client only needs to upload:

  • the new bar.txt
  • the new git tree, which contains the new hash for bar.txt.

HTH :)

// tree id can be directly passed around. We require the invariant that if a
// tree is part of any CAS, then all its content is also available in this
// CAS. To adhere to this invariant, the client side has to prove that the
// content of a tree is available in the CAS before uploading this tree
// (i.e., leaves must be uploaded before the root). The server side has to
// ensure this invariant holds. In particular, if the remote side implements
// any pruning strategy for the CAS, it has to honor this invariant when an
// element gets pruned.
//
// Another consequence of this efficient tree handling is that it improves action
// digest calculation noticeably, since known git trees referred by the root
// directory do not need to be traversed. This in turn allows to faster determine
// whether an action result is already available in the action cache or not.
//
// Tree Download
//
// Once an action is successfully executed, it might have generated output files or
// output directories in its staging area on the remote side. Each output file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Staging area" is a git concept, that does not necessarily exist on the server/worker side. If we make this part of REAPI I think it would need to be defined here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the comment is simply taking about the directory in which the action was executed. Maybe "action execution directory" is the better term.

// needs to be uploaded to its CAS with the corresponding git blob hash. Each
// output directory needs to be translated to a git tree object and uploaded to the
// CAS with the corresponding git tree hash. Only if the content of a tree is
// available in the CAS is the server side allowed to return the tree to the
// client.
//
// In case of a generated output directory, the server only returns the
// corresponding git tree id to the client instead of a flat list of all
// recursively generated output directories as part of a Tree Protobuf
// message. The remote side promises that each blob and subtree contained in
// the root tree is available in the remote CAS. Such blobs and trees must
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should state that the blobs must be available via their "GITSHA1" digest? Or is it expected that they are also available by other digests supported by the server?

// be accessible, using the streaming interface, without specifying the size
// (since sizes are not stored in a git tree). Due to the Protobuf 3
// specification, which is used in this remote execution API, not specifying
// the size means the default value 0 is used.
GITSHA1 = 10;
Comment on lines +1938 to +1957
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand it correctly, OutputDirectory.tree_digest will now be used in two different ways:

  1. In case digest_function != GITSHA1, it will point to a single REv2 Tree object, which has a height of at most 1.
  2. In case digest_function == GITSHA1, it will point to a single Git Tree object, which is the equivalent of an REv2 Directory object. It can thus point to a hierarchy of CAS objects whose height is unconstrained.

In my opinion this is a feature that is independent of supporting Git's hashing algorithm and directory format. In #229 we essentially redefined REv2 Tree objects as being a simple binary container format around a sequence of directories. It is generic enough that you could also use that to pack multiple Git directory objects into a single CAS object.

But this is obviously not what you're interested in. You're interested in actually having them stored as separate CAS objects, so that it's possible to embed an output directory into the input root of a subsequent action in a fully opaque manner. And this is a valid request. I think that this is one of the shortcomings of the protocol as stands.

Could you please take a look at PR #258? In that PR I'm proposing that we add a new enumeration to the Command message that clients can use to specify the format in which they want workers to store output directories. By letting Justbuild set that to DIRECTORY_ONLY, it can access output directories in Git's native format.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, besides the desire to make use of the fact that git naturally has files and directories prehashed, the other desire is to avoid communication overhead in case the client only needs very few artifacts locally (say, the logs of failed tests) and otherwise wants to leave everything on the remote-execution side as it is only used as inputs for other actions. The latter can well be seen as an independent feature that also works with other hashes and directory representations.

}
}

Expand Down