Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker build freeze at exportLayer phase #696

Open
tardyp opened this issue Sep 18, 2019 · 22 comments
Open

docker build freeze at exportLayer phase #696

tardyp opened this issue Sep 18, 2019 · 22 comments

Comments

@tardyp
Copy link

tardyp commented Sep 18, 2019

This is follow-up of debug I did on docker/for-win#3229

It seems that my issue is slightly different from that one actually.

Setup

  • OS: Windows Server 2019(Headless)
  • docker version: 19.03.2 (client and server)

How to reproduce

When building a large docker container:

FROM mcr.microsoft.com/windows/servercore
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))" 
RUN choco install msys2

Actually my setup is a bit more complicated, as I am building behind corporate proxies, and I have to customize both choco and msys2 package to work in my env.
And.. I have no way to work without the proxies.

I have the same issue with other large docker images (dockerizing Matlab is the first usecase I have been experiencing this with, but It is even less reproducible in an open-source collaboration environment)

Symptom

When trying to build this container, the docker build freezes just after finishing the "RUN choco install msys2" command.

After many tries, I had some times where the build actually finished pretty much instantly.

I have been trying to reprod this with a simpler setup (a dockerfile creating thousands of file), but was unable to do so.

So I don't know exactly what triggers what happens to be a race condition

After doing some stack-traces, I observe that the code is stuck in

os.RemoveAll(r.root)

os.RemoveAll(0xc000d1c000, 0x2a, 0x0, 0x0)
        C:/.GOROOT/src/os/path.go:67 +0x3c
github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer.(*legacyLayerReaderWrapper).Close(0xc0000ba980, 0xc0000ba980, 0x2546fe0)
        C:/go/src/github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer/exportlayer.go:74 +0x95
github.com/docker/docker/daemon/graphdriver/windows.(*Driver).exportLayer.func1.1(0x5f8, 0xc00078e000)
        C:/go/src/github.com/docker/docker/daemon/graphdriver/windows/windows.go:672 +0x120

This code is using exportLayer syscall from the winfilter directory to the tmp directory.
Then when the tarfile has been produced, it will remove the tmp directory (like \\\\?\\C:\\ProgramData\\docker\\tmp\\hcs425012433) version of the layer.

Workaround

After a bit of back and forth, I did try docker-ci-zap.exe on the hcs425012433 folder, then the docker build command will instantly unfreeze.

So I hacked a new dockerd-dev.exe using following patch.

diff --git a/internal/wclayer/exportlayer.go b/internal/wclayer/exportlayer.go
index 0425b33..0753ff2 100644
--- a/internal/wclayer/exportlayer.go
+++ b/internal/wclayer/exportlayer.go
@@ -71,6 +71,10 @@ type legacyLayerReaderWrapper struct {

 func (r *legacyLayerReaderWrapper) Close() error {
        err := r.legacyLayerReader.Close()
+       // if the layer is not Destroyed at hcs level before removing
+       // we might enter in a race-condition for large containers
+       // which end-up in a hang of the os.RemoveAll() call
+       DestroyLayer(r.root)
        os.RemoveAll(r.root)
        return err
 }

I have no idea if this is the right solution for this problem or this is rather an issue with the windows kernel.

I can submit that patch as a PR if requested

@tardyp
Copy link
Author

tardyp commented Sep 26, 2019

@jstarks Sorry for pinging. You are the original author of this code, could you please provide some insight?

@jstarks
Copy link
Member

jstarks commented Sep 26, 2019

I think we need to understand why os.RemoveAll is not returning. What version of Go is this?

@tardyp
Copy link
Author

tardyp commented Sep 26, 2019

@jstarks The problem can be reproduced with the official binary for docker.
trying to remove the folder with powershell leads to same hang (rm won't return)
So there must be some os level locks involved that is beyond my knowledge.

The problem looks similar to:
docker/for-win#1544
and moby/moby#26873
which suggest the use of docker-ci-zap by @jhowardmsft .

I basically looked at source code, and saw that the only relevant code in docker-ci-zap was the DestroyLayer call that I added in my patch.

@tardyp
Copy link
Author

tardyp commented Oct 2, 2019

@jhowardmsft any insight? Looks weird to me that nobody is aware of this issue, as this is pretty easy to reproduce for me on two different machines.

@tardyp
Copy link
Author

tardyp commented Jan 9, 2020

hi @jterry75 I see you are new main developer of hcsshim component. Do you have insight on this issue?
Do you have validation team testing with containers with lots of file?

Its sounds weird to me that I am the only one to see this issue.

My assumption is that this is a kernel issue. Is there any channel were to report this issue with the kernel team?

@CarlosMendonca
Copy link

FWIW, I can repro the issue on both v10.0.19603.1000 and v10.0.19608.1000. Like @tardyp suggested, docker-ci-zap.exe on the hcs folders inside C:\ProgramData\Docker\tmp resolves the apparent race condition and allows image building to resume. I had no idea which hcs folder to remove, so I just tried randomly until it worked.

LMK if I can provide any further info to help debug the root cause.

@Weltgeist
Copy link

I'm having the same issue.

@131
Copy link

131 commented Dec 16, 2020

Same issue here

@tprk77
Copy link

tprk77 commented Jan 11, 2021

Please, please fix this...

tardyp added a commit to tardyp/hcsshim that referenced this issue Jan 13, 2021
tardyp added a commit to tardyp/hcsshim that referenced this issue Jan 13, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Jan 22, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Jan 22, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Jan 22, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 11, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 11, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 11, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 17, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 17, 2021
TBBle pushed a commit to TBBle/hcsshim that referenced this issue Feb 20, 2021
@dry4ng
Copy link

dry4ng commented Mar 24, 2021

Same issue here. After running choco install -y git in mcr.microsoft.com/dotnet/framework/sdk:3.5-windowsservercore-1909 container, the build hangs for 5 mins, before proceeding.

Chocolatey installed 3/3 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
<hangs for 5 mins>
Removing intermediate container 4debda07c57c

@mwfriedm
Copy link

mwfriedm commented Jan 5, 2022

Repros on docker engine 20.10.11 running on Windows 10 Enterprise 18363.1977

@pbrubaker
Copy link

This still repros on 20.10.23 installing MSYS2.

@TBBle
Copy link
Contributor

TBBle commented Mar 21, 2023

Based on microsoft/Windows-Containers#213 (comment), it looks like the underlying issue here is actually a Go issue, in that os.RemoveAll hangs when it hits something in the msys2 install, probably an invalid filename.

So it should be possible to reproduce this hang with a small Go binary that just calls os.RemoveAll on an msys2 install, I think, outside of containers.

The proposed fix in that comment is replacing os.RemoveAll with a handcoded os.RemoveAll-alike which doesn't get stuck, which is roughly what the DestroyLayer workaround in the initial commit (and using docker-ci-zap which is a small wrapper around DestroyLayer) is doing, but way down in the OS level, and mostly by conincidence that it doesn't confirm that you've actually given it a WCOW container image layer to destroy.

If that's not the case, then perhaps there is a problem in the layer-export flow that is creating an invalid filename in the temporary staging area, and that would need to be fixed. However, since apparently the export works when you use docker-ci-zap/DestroyLayer, I assume that's not the case.

@pbrubaker
Copy link

However, since apparently the export works when you use docker-ci-zap/DestroyLayer, I assume that's not the case.

This is what worked for me. After that I put a long sleep in after the MSYS2 install and I haven't been able to replicate it. I'm thinking it might be the OS (or some process) holding on to a lock on a file? What does Go do in the case that it can't delete something in os.RemoveAll?

@TBBle
Copy link
Contributor

TBBle commented Mar 26, 2023

It's possible both things are true, and there's a badly-named temporary file that gets cleaned up while you sleep, so doesn't end up in the export directory.


tl;dr: I think the underlying problem is an invalid-UTF16-named file triggering an unterminating loop in os.RemoveAll, when it assumes No such file (caused by failing to roundtrip the invalid UTF16 to string and back) means No such file (caused by the file already having been deleted).

golang/go#36375 is the upstream bug for "No such file for the wrong reasons causes os.RemoveAll to hang".


I'm not sure, but I wonder if the supposed invalid filename causes this loop in os.RemoveAll to keep attempting to delete the same invalid file, not break out of the loop, and then try again. I'm not certain that's possible though, unless there's more than 1024 undeletable files/directories in the one place, in which case this block will never be entered, and the loop will repeatedly fail to delete the same 1024 files endlessly.

So this actually might be a Golang stdlib bug when trying to delete large, undeleteable directories...

Edit: But I couldn't reproduce a hang with 1030 locked files in a directory. Looking more closely, I can't see a way to make this loop endless, unless you can have os.Remove return nil or IsNotExist for a file that does exist, in which case I think this code will loop endlessly given a directory with 1024 or more files after that file in the directory list.

Which actually does point back at an "invalid filename" being able to cause this, I guess? Not sure how to reproduce this.

Looking into this, if fixLongPath doesn't do its job for whatever reason, it's possible that the DeleteFileW call can return ERROR_FILE_NOT_FOUND if given a long file name in non-universal format.

fixLongPath has a bunch of early-out cases (although none of them are obviously applicable here), so I wonder if it's just that the MSYS installer has a very deep directory tree (again, possibly in some temporary state), and when combined with os.MkdirTemp("", "hcs") it causes fixLongPath to early-out, and breaks DeleteFileW.

golang/go#36375 is the hang I'm talking about, in this case triggered by long paths, and best-solved with Go 1.17 and "modern" Windows (1607 or later, so all the versions we care about are "modern"), and possibly a linker flag for the Go binary (unless it's set by-default?).

I also just noticed that the PR contains an example invalid filename, so maybe such a file does break DeleteFileW into returning ERROR_FILE_NOT_FOUND, in which case the same hang seen in golang/go#36375 would occur.

Hmm, on consideration, I suspect that if the filename is invalid UTF-16, then whatever string value comes out of fd.Readdirnames won't match the file on disk, producing the ERROR_FILE_NOT_FOUND result and hence triggering the hang. So nothing to do with fixLongPath.

@TBBle
Copy link
Contributor

TBBle commented Mar 26, 2023

tl;dr: It's an underlying cygwin/msys bug when pacman.exe tries to delete itself during upgrade. Aaaaaaargh.


Yup, repro'd it and under Process Monitor, you can see Go trying to delete a file repeatedly with a faulty name from the recycle bin (in C:$Recycle.Bin\S-1-5-93-2-1, i.e. ContainerAdministrator's recycle bin).

image

vim says it's .<dc6d><dc73><dc79><dc73>000100000000061201256577a4f815a1. That's the same prefix as the example from microsoft/go-winio#261, but that might be a recycle-bin thing? In my case, it's the pacman.exe. I wonder if that's always the case?

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))" 
RUN choco install -y msys2

Windows 11, Docker Desktop 4.16.1, fresh pull of the base image.

Which actually means this is another symptom of moby/moby#32838 (comment) (see

// Indirect fix for https://github.com/moby/moby/issues/32838#issuecomment-343610048.
// Handle failure from what may be a golang bug in the conversion of
// UTF16 to UTF8 in files which are left in the recycle bin. Os.Lstat
// which is called by filepath.Walk will fail when a filename contains
// unicode characters. Skip the recycle bin regardless which is goodness.
if strings.EqualFold(path, filepath.Join(r.root, `Files\$Recycle.Bin`)) && info.IsDir() {
return filepath.SkipDir
}
); we're already skipping such files in the exported image as they caused a direct failure, but their mere presence in the staging directory is triggering a latent Go stdlib bug (infinite loop) during staging area cleanup.

And that previous occurrence was also with msys2, but not chocolatey. So I guess the msys2 first pacman -Syu call is putting invalidly-named (or at least, Go-cannot-handle-them-ly-named) files in the recycle bin.

I wonder if a simpler fix (i.e. in hcsshim) would be to not include the Recycle bin in the staging directory, or manually remove it from the staging dir before export, since we don't export it anyway? (Not sure that's possible though, without messing with timestamps etc. Maybe targeted hcsshim.DestroyLayer for the recycle bin before export would replace the above fix too.)

It might be worth trying to work out what msys2 is doing that creates this file in the first place, but that's fixing the known-trigger, not the underlying cause...

Oh, FFS. It's an msys2/cygwin-specific bug: docker/for-win#8910 and msys2/MSYS2-packages#1253. It's putting deliberately-invalidly-named files directly into the Recycle bin's folder.

MSYS2 implementation at https://github.com/msys2/msys2-runtime/blob/9acb704afe9d9d8a090f2391e2bb88add63997c6/winsup/cygwin/syscalls.cc#L299-L322, Cygwin (different prefix) at https://github.com/cygwin/cygwin/blob/a97fbb58e209586f92381e90aaec8fb211287938/winsup/cygwin/syscalls.cc#L299-L322.

I am cross about this, it was first reported (AFAICT) to msys2 just short of five years ago, and was marked as "expected behaviour".

Supposedly it shouldn't be happening on Windows 10 1809 or later under Cygwin 3.0 (per msys2/MSYS2-packages#1253 (comment) and https://cygwin.com/cygwin-ug-net/ov-new.html#ov-new3.0) but clearly it is. Poking through the cygwin code, at a minimum running executables end-up falling back to try_to_bin, which is why I'm only seeing pacman.exe in that directory, as it probably tried to delete itself during an upgrade.

Technically, a Go bug that it hangs when it sees this file as part of os.RemoveAll, but PowerShell also hits an System.Text.EncoderFallbackException exception trying to interact with it on the command-line, so I can't be too harsh.

Elevated cmd can happily delete that one file, and unstick the export.

@pbrubaker
Copy link

pbrubaker commented Mar 26, 2023

tl;dr: It's an underlying cygwin/msys bug when pacman.exe tries to delete itself during upgrade. Aaaaaaargh.

Yup, repro'd it and under Process Monitor, you can see Go trying to delete a file repeatedly with a faulty name from the recycle

I had a sneaking suspicion it was something like this. Nice work tracking it all down!

I am cross about this, it was first reported (AFAICT) to msys2 just short of five years ago, and was marked as "expected behaviour".

Also, I can imagine. That's just 😵‍💫

@pbrubaker
Copy link

I also don't understand why it's using the recycling bin to delete temporary files.

@TBBle
Copy link
Contributor

TBBle commented Mar 26, 2023

I will say I 100% did not expect this to turn out to be an actual msys/cygwin bug; I expected to find msys triggering some other bug somewhere. Like, who just writes invalidly-unicode-named files to disk for no particular reason?

My guess is that they picked the Recycle Bin because one exists on each drive (so they can use rename to do the move) while %TEMP% only exists on one drive; these are the only directories I can think of that will be cleaned up by system processes, unless you leverage something like the registry keys that can delete files on next boot.

The directory isn't really a problem (it's weird, but okay); it's the filename that just breaks random things in the world. Poking around, I noticed node.js also has an open bug report for being unable to handle files with invalid UTF-16 on Windows. I imagine anything cross-platform using UTF-8 to hold file paths internally will have exactly this problem.

Side-note: I suspect this is the most-trivial workaround:

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN choco install -y msys2 && del /f /s /q "C:\$Recycle.Bin"

I wonder if it's worth requesting that the chocolatey msys2 installer maintainers attempt to delete files starting with .<dc6d><dc73><dc79><dc73> from the current user's recycle bin after running the "system update" steps.

Anyway, barring a surprise fix in Go or Cygwin/MSYS, I think the best long-term fix/workaround will be microsoft/go-winio#261. Potentially making the os.RemoveAll work like that PR on Windows (i.e. upstreaming it) would work too, but that has a longer delivery time and probably more inertia to overcome.

It might be possible to prevail upon Cygwin/MSYS to just use .cyg and .msys everywhere (like they do on SMB shares, which probably don't suffer this sort of nonsense) but again, this has lead-time before users benefit, and no one seemed interested in fixing it five years ago when it was first raised. That also only fixes this case, although I don't know if anyone's ever reported this issue with something other than pacman in MSYS, so hopefully no one else in the world is sticking non-paired surrogate code-points into files we might try and delete from Go.

@LewisPringle
Copy link

OMG - I've had to workaround this issue for years by manually firing up another shell beside the process running docker build and deleting crap from its /c/ProgramData/Docker/tmp directory.

This trick of doing && del /f /s /q "C:$Recycle.Bin" at the end of your msys install (whether by choco or mysys installer directly) seems to work like a charm!!!

@TBBle
Copy link
Contributor

TBBle commented May 6, 2023

Good news: golang/go#59971 has a proposed Go-level fix, which I tested with the following Dockerfile:

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN ECHO "NOHANG" && choco install -y msys2

(The ECHO is there so I can invalidate just that line from the build cache)

To test, I built dockerd.exe from v20.10.23 (i.e. the version used by the Docker Desktop version I happened to have installed) with both Go 1.20.4 (latest release) and Go with the relevant fix. This produced the following docker version info:

Go 1.20.4:

Server: Docker Desktop 4.17.1 (101757)
 Engine:
  Version:          0.0.0-dev
  API version:      1.41 (minimum version 1.24)
  Go version:       go1.20.4
  Git commit:       6051f14291-unsupported
  Built:            05/06/2023 05:02:47
  OS/Arch:          windows/amd64
  Experimental:     true

gotip download 493036:

Server: Docker Desktop 4.17.1 (101757)
 Engine:
  Version:          0.0.0-dev
  API version:      1.41 (minimum version 1.24)
  Go version:       devel go1.21-373489978b Fri May 5 18:29:34 2023 +0200
  Git commit:       6051f14291-unsupported
  Built:            05/06/2023 05:00:53
  OS/Arch:          windows/amd64
  Experimental:     true

They're tagged unsupported because these were local builds (not inside the build container) and because I had to hack on a couple of build scripts to work with PowerShell 7 and to use gotip.

I slipped each build under Docker Desktop as C:\Program Files\Docker\Docker\resources\dockerd.exe because I'm too lazy to configure a dockerd.exe running myself.

Anyway, the Go 1.20.4 build reproduced the hang. I didn't check Process Monitor (I was cooking noodles, so it had like 10 minutes to unstick itself, and didn't), but removing the problematic files from C:\ProgramData\Docker\tmp\hcs4184426182\Files\$Recycle.Bin\S-1-5-93-2-1\ immediately unhung it.

C:\Users\paulh\Documents\GoRemoveAll
> docker build -t msys120 .\context\
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM mcr.microsoft.com/windows/servercore:ltsc2022
 ---> d84fae21da7a
Step 2/3 : RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
 ---> Using cache
 ---> 256ff2405873
Step 3/3 : RUN ECHO "HANG" && choco install -y msys2
 ---> Running in db0445b3d5e8
"HANG"
Chocolatey v1.3.1
...
Chocolatey installed 3/3 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
Removing intermediate container db0445b3d5e8
 ---> 819c033a0f2a
Successfully built 819c033a0f2a
Successfully tagged msys120:latest

The gotip download 493036 build did not hang and completed without complaint.

C:\Users\paulh\Documents\GoRemoveAll
> docker build -t msys493036 .\context\
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM mcr.microsoft.com/windows/servercore:ltsc2022
 ---> d84fae21da7a
Step 2/3 : RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
 ---> Using cache
 ---> 256ff2405873
Step 3/3 : RUN ECHO "NOHANG" && choco install -y msys2
 ---> Running in bd2fecc8b6f7
"NOHANG"
Chocolatey v1.3.1
...
Chocolatey installed 3/3 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
Removing intermediate container bd2fecc8b6f7
 ---> a1bd6742367a
Successfully built a1bd6742367a
Successfully tagged msys493036:latest

So yeah, not actually a hcsshim bug after all, just something likely to affect anyone calling hcsshim.NewLayerReader or hcsshim/pkg/ociwclayer.ExportLayerToTar which calls it internally. AFAICT from GitHub Search that's just hcsshim itself, Docker, containerd, https://github.com/containers/storage, https://github.com/cloudfoundry/diff-exporter, and forks thereof. As such, I think microsoft/Windows-Containers#213 will be the better bug to track the fix migrating out into the ecosystem, once we know what Go versions it'll actually be available in.

Oh, blast. I just realised I could have tested the fix with just wclayer export rather than a whole build of Docker. >_<

@TBBle
Copy link
Contributor

TBBle commented Aug 27, 2023

The fix has indeed landed in Go 1.21.0, so it's up to hcsshim downstreams to update to Go 1.21 to apply the fix this issue.

mobigliani added a commit to jrbe228/bazel-windows-docker-container that referenced this issue Nov 23, 2023
Use chocolatey to install VisualStudio VC build tools.
Install msys2 and handle post-install hang: microsoft/hcsshim#696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.