Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

Closed
AkihiroSuda opened this issue Jul 7, 2017 · 21 comments
Closed

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

AkihiroSuda opened this issue Jul 7, 2017 · 21 comments

Comments

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Jul 7, 2017


PTAL
https://docs.google.com/presentation/d/18ZJRm_0h25GP0uvDDEugAeeOkB6x8nOWLVUwBroV0X4/edit?usp=sharing

Agenda:

  • BuildKit cluster orchestration
  • Cache placement & Scheduling
  • Artifact output

Highly related issues:

@mlaventure
Copy link

How about (and this is a very swarm centric example):

  • create a docker network (e.g. buildkit)
  • create 2 services:
    1. One with only a single replica that would be the master
    2. Create n workers that would register to the master above via it's container name

The master buildkit binary could start the build as soon as a single worker is connected, and decide to split out the work to other workers as they get added to the queue.

This is just the idea that came to mind after reading your proposal. I am not at all familiar with the BuildKit design though, so this may not make sense at all, in which case just ignore it ;-)

@AkihiroSuda
Copy link
Member Author

@mlaventure
Thank you for the comment.
BuildKit worker containers (moby/buildkit:worker) need to create real workload containers (e.g. golang:1.8) via bind-mount containerd.sock (or "CinC"), and also need to connect them to an arbitrary CNI/CNM network.
I haven't looked into CNI/CNM deeper, but is it easily possible?

@mlaventure
Copy link

@AkihiroSuda not sure for CNI/CNM, you would have to ask the network people :)

My comment was about a 3rd mode compared to the ones you proposed. Mainly, to have a fix set of workers assigned instead of requiring buildkit to have access to the containerd socket so it can spawn them.

@dnephin
Copy link
Member

dnephin commented Jul 7, 2017

My 2 cents: there is very little value in splitting a single build across multiple nodes. If I'm understanding the proposal correctly, it seems to suggest that.

Most builds will not be any faster, and many builds will actually be slower because they have to wait for the cache to be distributed to a different node. Most builds are slow because of data transfer, so moving individual steps to different nodes is not actually going to speed anything up. We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

I think the more valuable "distributed build" feature is a "worker cluster" where a single master can be sent multiple builds, and each build might end up on a different node, but the entire build happens on a single node, and any artifacts are streamed back to the client.

I think that makes Topic 2: Cache placement much easier because you only need to distribute it before/after a build, instead of after each step.

Question about Topic 3, why would the artifacts be stored by buildkit? Wouldn't they be streamed back to the client that requested the build? An option to send to a registry also sounds pretty good.

@tonistiigi
Copy link
Member

topic1: I had something similar in mind as @mlaventure . We shouldn't spawn orchestration workers, that's a responsibility of some other tooling. I'm not sure I get the networking comment. The buildkit-workers themselves need to be on the same network but why would this be required for containers executing users processes? The workers do need to have access to containerd(or equal permissions) but it could be a secure subset of containerd api as well.

topic2: I'm not sure if using registry only would give the performance we are after. Without the registry, at least in theory, the overhead could be minimal in the future with better snapshot drivers. With the registry, we always have a cost of 2 additional writes and push/pull sync problem. It would simplify things a lot though. We should leave master HA out of this for now as it is a different topic.

topic3: I agree with @dnephin that this mostly depends on the exporter, worker itself should only need to worry about where it gets/puts the cache. For the push to a registry(or any central storage option), buildkit already should support exporting cache to any registry. So maybe this is just a way to configure master so that it automatically makes the current cache state HA(and removing local worker cache) in the background.

@dnephin

As I understand from topic2 example, the metadata of the cache is still kept on the master, and it is used for assigning workers. So possible copy operations only happen on branch splits. If the user has a single threaded definition, it should almost always only use one worker. Our tooling should discourage creating these single threaded definitions, but if the user still chooses to use them, their performance should be at least same as current docker build.

I think the more valuable "distributed build" feature is a "worker cluster" where a single master can be sent multiple builds, and each build might end up on a different node, but the entire build happens on a single node,

The current solver already doesn't separate requests from different clients and uses a shared graph. The logic would be same, just it is based on dependency chain, not on request identity.

We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

We should encourage bigger definitions because this gives us more information to make better decisions. The goal should be that splitting graph to more vertexes always improves performance and cache. Resulting artifact size rarely has a correlation with the graph size.

@AkihiroSuda
Copy link
Member Author

@dnephin

We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

I agree, but I think the future trend in the container world will be encouraging users to build "bundle" of images (i.e. DAB/Compose, or Helm) rather than a single image.
Such bundles would have larger LLB DAG and users will benefit from multi-node builder.

Question about Topic 3, why would the artifacts be stored by buildkit? Wouldn't they be streamed back to the client that requested the build? An option to send to a registry also sounds pretty good.

If "client" stands for something like /usr/bin/docker, streaming back to the "client" would cause extra network traffic.

What we need would be a mechanism to tell client "The worker X holds the cache you requested to build, so you can call Export() against the worker X and push the cache to a registry as an image (or other kind of exportation operation)."

@tonistiigi

The buildkit-workers themselves need to be on the same network but why would this be required for containers executing users processes? The workers do need to have access to containerd(or equal permissions) but it could be a secure subset of containerd api as well.

docker build --network would require connecting workload containers (e.g. golang:1.8) to an arbitrary network, which could not be just achived by just bind-mounting containerd.sock into moby/buildkit:worker containers.

Even if we can launch moby/buildkit:worker containers with --privileged (which is not supported in the current Docker Swarm-mode), we might still need some orchestrator-specific driver?

(cc @ijc PTAL if you are interested in - moby/swarmkit#2299 seems slightly related to this topic)

I'm not sure if using registry only would give the performance we are after.

My suggestion is not to use registry, and transfer caches across workers directly.
I just mentioned using registry as a possible alternative design.

Also, can I hear your opinion about another alternative design: using gossip-like protocol?
Using a gossip-like protocol would remove dependency for etcd / shared volume, and it might simplify deployment & operation. (though we need to take care of the bootstrap node unless we can rely on mDNS)

For implementation, Serf might be used.
According to https://groups.google.com/forum/#!topic/serfdom/GlTqAshmK_w , Serf has good scalability w.r.t. number of nodes: "for clusters of even thousands of nodes, delivery is still sub-second. "
However, due to limitation of UDP, its event size cannot exceed 512 bytes: https://github.com/hashicorp/serf/blob/60b345945e1a3453f219114823537ee2ceeb3500/serf/serf.go#L226
So not sure it is suitable for cachemap.

@tonistiigi
Copy link
Member

docker build --network would require connecting workload containers (e.g. golang:1.8) to an arbitrary network, which could not be just achived by just bind-mounting containerd.sock into moby/buildkit:worker containers.

We knew that --network would not be portable and that's why we resisted against that as long as we could. I wouldn't make any design decision based on that obscure feature. If the user wants to build with these specific settings with docker build, docker can prepare a custom worker instance for it that is connected to some local namespace and that can never be used for distributed workflows.

@AkihiroSuda
Copy link
Member Author

Can we make this forward with master-worker model?
The first task would be to split the master daemon (solver, session mgr) from the worker daemon.
Then we can incrementally add support for multi-worker and multi-master.

Or do we need more design discussion?

@tonistiigi
Copy link
Member

Can we make this forward with master-worker model?

yes

The first task would be to split the master daemon (solver, session mgr) from the worker daemon.
Then we can incrementally add support for multi-worker and multi-master.

Maybe first step would be to just have multiple instances or worker/snapshotter inside the same binary. Then, even when splitting up a worker binary it could (in the beginning) be an optional binary for using the remote cases. This makes sure that we don't get stuck on implementing the other features because grpc limitations. I'd like the grpc API to influence the rest of the design as little as possible.

@AkihiroSuda
Copy link
Member Author

@tonistiigi

Maybe first step would be to just have multiple instances or worker/snapshotter inside the same binary.

Sorry for my recent inactivity.
Here is my WIP branch to put multiple controller instances to a single buildd binary: https://github.com/moby/buildkit/compare/master...AkihiroSuda:plugin.20170807.1?expand=1

Please let me know the design SGTY before opening a PR?

@tonistiigi
Copy link
Member

@AkihiroSuda What's the use case for specifying a controller on the client side. My impression was that there would be a single controller, solver, instructioncache. And multiple snapshot/worker/contenthash implementations. Constraints for finding a worker would be defined per vertex, not for build job. Is there anything I'm missing that makes it impossible?

@AkihiroSuda
Copy link
Member Author

My first idea was to incrementally add ExecuteVertex() to api/services/control and let the master daemon to call ExecuteVertex() with GRPC metadata against worker daemons.
But on the second thought I found this design is not straightforward; we should just set metadata against per vertex as you pointed out, and also we shouldn't put daemon-to-daemon RPCs to api/services/control.

I'll update my WIP branch.

@AkihiroSuda
Copy link
Member Author

opened #114 for initial per-vertex metadata

@AkihiroSuda
Copy link
Member Author

opened #160 for more detailed roadmap

@AkihiroSuda
Copy link
Member Author

Children issues:

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Apr 23, 2019

Poor man's distributed mode is here: #956 (consistent hashing for each of build invocation)

@drizzd
Copy link

drizzd commented May 1, 2019

I am using Gitlab CI with the Kubernetes executor to create build pods in a Kuberentes cluster. I configure the builds to execute docker build on the Kubernetes node (by mounting /var/run/docker.sock hostpath) so that the builds can re-use the local Docker cache from previous builds on the same node. Kubernetes takes care of load balancing the build pods. It does not consider the actual resource usage, but at least it will distribute the pods evenly across nodes.

The trouble with this approach is that the very same build needs to run on each of the nodes until the build is cached on all of them. Note that I have many rebuilds without changes because we are using a monorepo/microservice structure.

The consistent hash approach is a possible solution. But I have some Dockerfiles which take much longer to build than others. The build for a given Dockerfile will always runs on the same buildkitd pod, and therefore the load will not be balanced across nodes. Furthermore, there is a chance that some nodes will not be used at all.

Instead, I would like to synchronize caches via the registry, as discussed in #723. I could run buildkitd as a daemonset and configure build pods to connect to the buildkitd which is running on the same node.

What do you think?

@AkihiroSuda
Copy link
Member Author

registry cache is already implemented https://github.com/moby/buildkit#tofrom-registry

@drizzd
Copy link

drizzd commented May 5, 2019

Yes, registry cache is implemented. I am just pointing out that it is not clear how to do the load balancing on Kubernetes. Is it best to run one buildkitd instance per node (i.e., daemonset)? Or does it make sense to have multiple instances per node?

Instead of the daemonset, each build pod could have a buildkitd and a buildctl container. The buildctl container would have to wait for the buildkitd container to become ready. I would also have to give up on local caches, but the bigger issue is that it seems like an abuse of the buildkit architecture.

This would be much easier, for example, with img, since I can run the build directly in a pod as a standalone container. But img does not support caching the build layers in a registry.

@drizzd
Copy link

drizzd commented May 5, 2019

Oh, and it would be awesome if buildkit could be used with the Horizontal Pod Autoscaler. Again, I think this only works if the build is executed in the context of the build pod.

Actually, I want to use buildkitd/buildctl just like I would use dockerd/docker-run. The daemon runs on each node in the Kubernetes cluster. And the resources used by each build are associated with the build, and not with the daemon process.

@AkihiroSuda
Copy link
Member Author

https://github.com/docker/buildx already has support for deploying BuildKit on Kubernetes.
Probably we can close this issue.

More info: https://www.slideshare.net/AkihiroSuda/kubeconus2019-docker-inc-booth-distributed-builds-on-kubernetes-with-buildkit-and-docker-buildx-196169581

goller pushed a commit to goller/buildkit that referenced this issue Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants