-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker build freeze at exportLayer phase #696
Comments
@jstarks Sorry for pinging. You are the original author of this code, could you please provide some insight? |
I think we need to understand why os.RemoveAll is not returning. What version of Go is this? |
@jstarks The problem can be reproduced with the official binary for docker. The problem looks similar to: I basically looked at source code, and saw that the only relevant code in docker-ci-zap was the DestroyLayer call that I added in my patch. |
@jhowardmsft any insight? Looks weird to me that nobody is aware of this issue, as this is pretty easy to reproduce for me on two different machines. |
hi @jterry75 I see you are new main developer of hcsshim component. Do you have insight on this issue? Its sounds weird to me that I am the only one to see this issue. My assumption is that this is a kernel issue. Is there any channel were to report this issue with the kernel team? |
FWIW, I can repro the issue on both v10.0.19603.1000 and v10.0.19608.1000. Like @tardyp suggested, docker-ci-zap.exe on the hcs folders inside C:\ProgramData\Docker\tmp resolves the apparent race condition and allows image building to resume. I had no idea which hcs folder to remove, so I just tried randomly until it worked. LMK if I can provide any further info to help debug the root cause. |
I'm having the same issue. |
Same issue here |
Please, please fix this... |
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Fix: microsoft#696 Signed-off-by: Pierre Tardy <[email protected]>
Same issue here. After running
|
Repros on docker engine 20.10.11 running on Windows 10 Enterprise 18363.1977 |
This still repros on 20.10.23 installing MSYS2. |
Based on microsoft/Windows-Containers#213 (comment), it looks like the underlying issue here is actually a Go issue, in that So it should be possible to reproduce this hang with a small Go binary that just calls The proposed fix in that comment is replacing If that's not the case, then perhaps there is a problem in the layer-export flow that is creating an invalid filename in the temporary staging area, and that would need to be fixed. However, since apparently the export works when you use docker-ci-zap/ |
This is what worked for me. After that I put a long sleep in after the MSYS2 install and I haven't been able to replicate it. I'm thinking it might be the OS (or some process) holding on to a lock on a file? What does Go do in the case that it can't delete something in |
It's possible both things are true, and there's a badly-named temporary file that gets cleaned up while you sleep, so doesn't end up in the export directory. tl;dr: I think the underlying problem is an invalid-UTF16-named file triggering an unterminating loop in golang/go#36375 is the upstream bug for " I'm not sure, but I wonder if the supposed invalid filename causes this loop in So this actually might be a Golang stdlib bug when trying to delete large, undeleteable directories... Edit: But I couldn't reproduce a hang with 1030 locked files in a directory. Looking more closely, I can't see a way to make this loop endless, unless you can have Which actually does point back at an "invalid filename" being able to cause this, I guess? Not sure how to reproduce this. Looking into this, if
golang/go#36375 is the hang I'm talking about, in this case triggered by long paths, and best-solved with Go 1.17 and "modern" Windows (1607 or later, so all the versions we care about are "modern"), and possibly a linker flag for the Go binary (unless it's set by-default?). I also just noticed that the PR contains an example invalid filename, so maybe such a file does break Hmm, on consideration, I suspect that if the filename is invalid UTF-16, then whatever |
tl;dr: It's an underlying cygwin/msys bug when pacman.exe tries to delete itself during upgrade. Aaaaaaargh. Yup, repro'd it and under Process Monitor, you can see Go trying to delete a file repeatedly with a faulty name from the recycle bin (in vim says it's FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN choco install -y msys2 Windows 11, Docker Desktop 4.16.1, fresh pull of the base image. Which actually means this is another symptom of moby/moby#32838 (comment) (see hcsshim/internal/wclayer/legacy.go Lines 113 to 120 in 144a587
And that previous occurrence was also with msys2, but not chocolatey. So I guess the msys2 first I wonder if a simpler fix (i.e. in hcsshim) would be to not include the Recycle bin in the staging directory, or manually remove it from the staging dir before export, since we don't export it anyway? (Not sure that's possible though, without messing with timestamps etc. Maybe targeted It might be worth trying to work out what msys2 is doing that creates this file in the first place, but that's fixing the known-trigger, not the underlying cause... Oh, FFS. It's an msys2/cygwin-specific bug: docker/for-win#8910 and msys2/MSYS2-packages#1253. It's putting deliberately-invalidly-named files directly into the Recycle bin's folder. MSYS2 implementation at https://github.com/msys2/msys2-runtime/blob/9acb704afe9d9d8a090f2391e2bb88add63997c6/winsup/cygwin/syscalls.cc#L299-L322, Cygwin (different prefix) at https://github.com/cygwin/cygwin/blob/a97fbb58e209586f92381e90aaec8fb211287938/winsup/cygwin/syscalls.cc#L299-L322. I am cross about this, it was first reported (AFAICT) to msys2 just short of five years ago, and was marked as "expected behaviour". Supposedly it shouldn't be happening on Windows 10 1809 or later under Cygwin 3.0 (per msys2/MSYS2-packages#1253 (comment) and https://cygwin.com/cygwin-ug-net/ov-new.html#ov-new3.0) but clearly it is. Poking through the cygwin code, at a minimum running executables end-up falling back to Technically, a Go bug that it hangs when it sees this file as part of Elevated |
I had a sneaking suspicion it was something like this. Nice work tracking it all down!
Also, I can imagine. That's just 😵💫 |
I also don't understand why it's using the recycling bin to delete temporary files. |
I will say I 100% did not expect this to turn out to be an actual msys/cygwin bug; I expected to find msys triggering some other bug somewhere. Like, who just writes invalidly-unicode-named files to disk for no particular reason? My guess is that they picked the Recycle Bin because one exists on each drive (so they can use rename to do the move) while The directory isn't really a problem (it's weird, but okay); it's the filename that just breaks random things in the world. Poking around, I noticed node.js also has an open bug report for being unable to handle files with invalid UTF-16 on Windows. I imagine anything cross-platform using UTF-8 to hold file paths internally will have exactly this problem. Side-note: I suspect this is the most-trivial workaround: FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN choco install -y msys2 && del /f /s /q "C:\$Recycle.Bin" I wonder if it's worth requesting that the chocolatey msys2 installer maintainers attempt to delete files starting with Anyway, barring a surprise fix in Go or Cygwin/MSYS, I think the best long-term fix/workaround will be microsoft/go-winio#261. Potentially making the It might be possible to prevail upon Cygwin/MSYS to just use |
OMG - I've had to workaround this issue for years by manually firing up another shell beside the process running docker build and deleting crap from its /c/ProgramData/Docker/tmp directory. This trick of doing && del /f /s /q "C:$Recycle.Bin" at the end of your msys install (whether by choco or mysys installer directly) seems to work like a charm!!! |
Good news: golang/go#59971 has a proposed Go-level fix, which I tested with the following Dockerfile: FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN ECHO "NOHANG" && choco install -y msys2 (The To test, I built dockerd.exe from Go 1.20.4:
They're tagged unsupported because these were local builds (not inside the build container) and because I had to hack on a couple of build scripts to work with PowerShell 7 and to use I slipped each build under Docker Desktop as Anyway, the Go 1.20.4 build reproduced the hang. I didn't check Process Monitor (I was cooking noodles, so it had like 10 minutes to unstick itself, and didn't), but removing the problematic files from
The
So yeah, not actually a hcsshim bug after all, just something likely to affect anyone calling Oh, blast. I just realised I could have tested the fix with just |
The fix has indeed landed in Go 1.21.0, so it's up to hcsshim downstreams to update to Go 1.21 to apply the fix this issue.
|
Use chocolatey to install VisualStudio VC build tools. Install msys2 and handle post-install hang: microsoft/hcsshim#696
This is follow-up of debug I did on docker/for-win#3229
It seems that my issue is slightly different from that one actually.
Setup
How to reproduce
When building a large docker container:
Actually my setup is a bit more complicated, as I am building behind corporate proxies, and I have to customize both choco and msys2 package to work in my env.
And.. I have no way to work without the proxies.
I have the same issue with other large docker images (dockerizing Matlab is the first usecase I have been experiencing this with, but It is even less reproducible in an open-source collaboration environment)
Symptom
When trying to build this container, the docker build freezes just after finishing the "RUN choco install msys2" command.
After many tries, I had some times where the build actually finished pretty much instantly.
I have been trying to reprod this with a simpler setup (a dockerfile creating thousands of file), but was unable to do so.
So I don't know exactly what triggers what happens to be a race condition
After doing some stack-traces, I observe that the code is stuck in
hcsshim/internal/wclayer/exportlayer.go
Line 74 in bd9b255
This code is using exportLayer syscall from the winfilter directory to the tmp directory.
Then when the tarfile has been produced, it will remove the tmp directory (like
\\\\?\\C:\\ProgramData\\docker\\tmp\\hcs425012433
) version of the layer.Workaround
After a bit of back and forth, I did try docker-ci-zap.exe on the hcs425012433 folder, then the docker build command will instantly unfreeze.
So I hacked a new dockerd-dev.exe using following patch.
I have no idea if this is the right solution for this problem or this is rather an issue with the windows kernel.
I can submit that patch as a PR if requested
The text was updated successfully, but these errors were encountered: