-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote: re-implement merkle tree buildling #7583
Conversation
src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnRunner.java
Outdated
Show resolved
Hide resolved
DigestUtil digestUtil) throws IOException { | ||
InputTree tree = | ||
InputTree.build(inputs, metadataProvider, execRoot, digestUtil); | ||
return MerkleTree.build(tree, digestUtil); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/MerkleTree.//?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} | ||
|
||
private final Map<PathFragment, DirectoryNode> tree; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way to just take a plain list of ActionInputs and efficiently build the merkle tree directly from that. This is using a map, not a list, and the DirectoryNode instances seem to contain all nested elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is for a lexicographically sorted list and it should be fairly easy to implement. However, what InputTree
mostly is for is to expand directory inputs. I think after this change we could move the directory expansion to the SpawnInputExpander
and after that we can build a merkle tree directly from a list of ActionInputs
.
@benjaminp iirc you mentioned that you are not using the |
Yeah, we replace the merkle tree hashing with the following:
A subtle but important property of the above is that it exploits input aggregation that Bazel has already done. In particular, since runfiles trees are behind a middlemen artifact, you don't end up rehashing every runfile for every action a particular runfiles tree is in. This optimization greatly helps static asset pipelines, which often involve thousands of actions consisting of one 40,000 file tool that converts one input file to one output file. (See #3217 for a related example of how the wonderful world of web static asset processing can stress Bazel.) It seems it would be hard to easily do better than our implementation in pure hashing efficiency given that it's a straightforward |
@benjaminp thanks! I suppose we could also use the code snippet you posted for remote caching (which I understand you use), as technically it's really only remote execution that requires a merkle tree. I'd be happy to review a patch :-). |
src/main/java/com/google/devtools/build/lib/remote/RemoteSpawnCache.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/merkletree/InputTree.java
Show resolved
Hide resolved
The TreeNodeRepository served us well, but has become a bit dated over time and it's increasingly hard to understand and to fix bugs. Additionally, it was written with the initial intend of incrementally updating the merkle tree between actions, however internal performance analysis has shown that doing so is 1) hard to implement reliably and 2) would increase memory consumption multifold. This is a rewrite of the code that also fixes bugs like bazelbuild#4663. The merkle tree construction is split into two: - InputTree is an intermediate tree representation of a (sorted) list of inputs. - MerkleTree implements a visitor pattern over an InputTree and is responsible for serializing the InputTree to the protobuf merkle tree that's used by remote caching / execution. Benchmarks on my local machine have shown this implementation to consumes between 2-10x less CPU time than the current implementation. Tests by users have shown equally encouraging speedups [1]. [1] https://groups.google.com/d/msg/bazel-discuss/wPHrqm2z8lU/mzoRF236GQAJ
293d30d
to
5d7c238
Compare
5d7c238
to
f92c6b6
Compare
…tory InputTree is a tree representation of a list of action inputs, that also expands directory inputs and contains metadata (hash, size) about each input file. This class will be used in a follw up CL to replace the TreeNodeRepository for building merkle trees for remote caching / execution. Also see #7583. RELNOTES: None PiperOrigin-RevId: 237036900
The TreeNodeRepository served us well, but has become a bit dated over
time and it's increasingly hard to understand and to fix bugs. Additionally,
it was written with the initial intend of incrementally updating the merkle
tree between actions, however internal performance analysis has shown
that doing so is 1) hard to implement reliably and 2) would increase
memory consumption multifold.
This is a rewrite of the code that also fixes bugs like #4663. The
merkle tree construction is split into two:
of inputs.
responsible for serializing the InputTree to the protobuf merkle
tree that's used by remote caching / execution.
Benchmarks on my local machine have shown this implementation to
consumes between 2-10x less CPU time than the current implementation.
Tests by users have shown equally encouraging speedups [1].
[1] https://groups.google.com/d/msg/bazel-discuss/wPHrqm2z8lU/mzoRF236GQAJ