Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simultaneous builds against the same builder drop status messages #4330

Closed
nicks opened this issue Oct 13, 2023 · 11 comments · Fixed by #4347
Closed

simultaneous builds against the same builder drop status messages #4330

nicks opened this issue Oct 13, 2023 · 11 comments · Fixed by #4347
Assignees

Comments

@nicks
Copy link
Contributor

nicks commented Oct 13, 2023

Repro steps:

  • Create a Dockerfile with a big context (like 200MB is good)
  • Open 3 separate terminal windows
  • In each window, execute
docker build -t tmp --load .
docker build -t tmp2 --load .
docker build -t tmp3 --load .

simultaneously

Expected result
All status messages correctly show in all windows

Actual result
One window shows all the status messages.

The other two only shows the first and last step of the build. The intermediate cached status messages are not shown.

See attached screenshot
Screenshot from 2023-10-13 12-24-04

Additional info
Obviously the repro case is a bit contrived. In practice, this comes up more frequently if you have a build script that's using a remote or ssh builder.

@tonistiigi
Copy link
Member

I've confirmed this. @jsternberg before taking this on, check with some older buildkit to make sure this is not a regression.

@jsternberg
Copy link
Collaborator

When I do this, I get the following message on two of the builds:

=> [internal] load build context                                                                                                                                8.6s
 => => transferring context: 409.71MB                                                                                                                            8.5s
 => CANCELED [2/4] ADD . .

When I do the same thing but use the shell to launch them all at the same time, I get 3 successful builds. So it seems like the small delay between switching to different terminals matters.

What I'm not sure of is why you're not seeing canceled messages. This is probably influenced by the docker buildx cli version. Mine is github.com/docker/buildx v0.11.2-desktop.1 986ab6afe790e25f022969a18bc0111cff170bc2. You can find this by doing docker buildx version.

I just mostly want to confirm I'm seeing the same behavior. I still wouldn't expect to see CANCELED but, if an older version was somehow not printing this correctly, it might explain why the status messages are missing.

@jsternberg
Copy link
Collaborator

Retried using --load since I didn't in my initial test. When --load is used, I see the canceled message on two of the builds and then it waits until the last one completes. Then all 3 will load the tarball.

I suspect what's happening is that buildkit is recognizing the 3 builds are the same and is canceling two of them. It's then waiting for the other process to complete and then performing the load. From what I can see, the most recent version of these seems to print the canceled message. I'd like to confirm that this is consistent with what you're seeing before I continue and understand if the current behavior is correct or if something else should be happening.

@jsternberg
Copy link
Collaborator

When I run the same builds on docker desktop rather than an external builder, I don't see the canceled messages so I believe the canceled messages must be getting dropped from the embedded version of buildkit in docker.

@tonistiigi
Copy link
Member

I did not see any "canceled" messages. I tried with container builder because I saw --load being used in repro but I just checked with docker driver and got the same output.

I suspect what's happening is that buildkit is recognizing the 3 builds are the same and is canceling two of them.

Starting same build would not cancel a previous one. There are some edge cases where a step can become canceled when after the step already started a new cache chain appears before it that makes the step cached. But that does not cancel the build.

This was the Dockerfile I used:

from golang:alpine
workdir /src
copy . .
run go env > env.txt
run go env GOCACHE > cachepath.txt

@jsternberg
Copy link
Collaborator

To clarify, the build didn't get canceled. Just the specific step was canceled. The build then went on to complete normally and the images were tagged. It seems the reason why the second and third build don't produce any status messages are because they're just waiting for the first build to finish.

@tonistiigi
Copy link
Member

Another data point. I modified the Dockerfile to

from golang:alpine
workdir /src
run sleep 3
copy . .
run go env > env.txt
run sleep 2 && go env GOCACHE > cachepath.txt

For the workdir and run sleep 3 I always get the progress for all builds (even if all builds are running much before these commands start). But it is the COPY . that breaks it and the commands after it. Context transfer itself is also always on all outputs.

@tonistiigi
Copy link
Member

Ah, I think I'm starting to understand what is going on. I think it is the "merging edges" case that is hit in here that is confusing the progress https://github.com/moby/buildkit/blob/master/docs/dev/solver.md#merging-edges . This means that is could be more complicated than I initially thought. Need to look for details and actual code path later.

@jsternberg I don't know what is causing the "canceled" condition for you. Presumably, you are using a different Dockerfile.

@jsternberg
Copy link
Collaborator

That seems to be consistent with this thing you mentioned earlier:

There are some edge cases where a step can become canceled when after the step already started a new cache chain appears before it that makes the step cached. But that does not cancel the build.

We might be having a situation where the copy starts and a cache chain shows up that invalidates the copy operation so the copy gets canceled. It seems the cancel message is either a bit flaky or shows up sometimes and doesn't in other times. What seems to be consistent is that the status messages for future cached steps don't get sent if the cache entry shows up after the build has started.

I tried randomizing the run sleep 3 step with this:

FROM golang:1.20-alpine
RUN sh -c 'sleep $(($RANDOM % 10))'
ADD . .
RUN go run main.go
RUN find . | wc -l > count.txt

When I do this, I don't see the canceled message on one of the builds (the other build completes as normal).

 => [internal] load build context                                                                                                                              5.6s
 => => transferring context: 409.71MB                                                                                                                          5.4s
 => [2/5] RUN sh -c 'sleep $(($RANDOM % 10))'                                                                                                                  0.8s

I'm pretty consistently able to trigger the bug with only 2 builds so 3 seems to be unnecessary.

@jsternberg I don't know what is causing the "canceled" condition for you. Presumably, you are using a different Dockerfile.

Possibly I have a faster computer in some way because I started working more recently? When I add the randomized sleep, I don't get the canceled message anymore. This seems to indicate to me that the canceled message happens because the copy operations are happening simultaneously. When they're not, it's just missing the cached status messages.

@jsternberg
Copy link
Collaborator

Talked with @tonistiigi a bit outside of the issue. This issue seems to happen due to the edge merging behavior. When edges get merged, the progress writers don't seem to be merged so the original edge's progress writer never gets informed about the edge's status. It's tied to the old edge which is discarded.

The reason why canceled sometimes appears is because an in-progress edge can get canceled if the merging happens while it's running. The cancel message shows up because the progress writer is still tied to the original edge, but the actual build isn't using that edge anymore so the cancel message is cosmetic and doesn't affect the build.

@jsternberg
Copy link
Collaborator

Reassigning this to @tonistiigi as he already has a PR that fixes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants