-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simultaneous builds against the same builder drop status messages #4330
Comments
I've confirmed this. @jsternberg before taking this on, check with some older buildkit to make sure this is not a regression. |
When I do this, I get the following message on two of the builds:
When I do the same thing but use the shell to launch them all at the same time, I get 3 successful builds. So it seems like the small delay between switching to different terminals matters. What I'm not sure of is why you're not seeing canceled messages. This is probably influenced by the docker buildx cli version. Mine is I just mostly want to confirm I'm seeing the same behavior. I still wouldn't expect to see |
Retried using I suspect what's happening is that buildkit is recognizing the 3 builds are the same and is canceling two of them. It's then waiting for the other process to complete and then performing the load. From what I can see, the most recent version of these seems to print the canceled message. I'd like to confirm that this is consistent with what you're seeing before I continue and understand if the current behavior is correct or if something else should be happening. |
When I run the same builds on docker desktop rather than an external builder, I don't see the canceled messages so I believe the canceled messages must be getting dropped from the embedded version of buildkit in docker. |
I did not see any "canceled" messages. I tried with container builder because I saw
Starting same build would not cancel a previous one. There are some edge cases where a step can become canceled when after the step already started a new cache chain appears before it that makes the step cached. But that does not cancel the build. This was the Dockerfile I used:
|
To clarify, the build didn't get canceled. Just the specific step was canceled. The build then went on to complete normally and the images were tagged. It seems the reason why the second and third build don't produce any status messages are because they're just waiting for the first build to finish. |
Another data point. I modified the Dockerfile to
For the |
Ah, I think I'm starting to understand what is going on. I think it is the "merging edges" case that is hit in here that is confusing the progress https://github.com/moby/buildkit/blob/master/docs/dev/solver.md#merging-edges . This means that is could be more complicated than I initially thought. Need to look for details and actual code path later. @jsternberg I don't know what is causing the "canceled" condition for you. Presumably, you are using a different Dockerfile. |
That seems to be consistent with this thing you mentioned earlier:
We might be having a situation where the copy starts and a cache chain shows up that invalidates the copy operation so the copy gets canceled. It seems the cancel message is either a bit flaky or shows up sometimes and doesn't in other times. What seems to be consistent is that the status messages for future cached steps don't get sent if the cache entry shows up after the build has started. I tried randomizing the FROM golang:1.20-alpine
RUN sh -c 'sleep $(($RANDOM % 10))'
ADD . .
RUN go run main.go
RUN find . | wc -l > count.txt When I do this, I don't see the canceled message on one of the builds (the other build completes as normal).
I'm pretty consistently able to trigger the bug with only 2 builds so 3 seems to be unnecessary.
Possibly I have a faster computer in some way because I started working more recently? When I add the randomized sleep, I don't get the canceled message anymore. This seems to indicate to me that the canceled message happens because the copy operations are happening simultaneously. When they're not, it's just missing the cached status messages. |
Talked with @tonistiigi a bit outside of the issue. This issue seems to happen due to the edge merging behavior. When edges get merged, the progress writers don't seem to be merged so the original edge's progress writer never gets informed about the edge's status. It's tied to the old edge which is discarded. The reason why canceled sometimes appears is because an in-progress edge can get canceled if the merging happens while it's running. The cancel message shows up because the progress writer is still tied to the original edge, but the actual build isn't using that edge anymore so the cancel message is cosmetic and doesn't affect the build. |
Reassigning this to @tonistiigi as he already has a PR that fixes this issue. |
Repro steps:
simultaneously
Expected result
All status messages correctly show in all windows
Actual result
One window shows all the status messages.
The other two only shows the first and last step of the build. The intermediate cached status messages are not shown.
See attached screenshot
Additional info
Obviously the repro case is a bit contrived. In practice, this comes up more frequently if you have a build script that's using a remote or ssh builder.
The text was updated successfully, but these errors were encountered: