Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: remove 5ms sleep on Windows in (*Process).Wait #25965

Closed
bradfitz opened this issue Jun 19, 2018 · 37 comments
Closed

os: remove 5ms sleep on Windows in (*Process).Wait #25965

bradfitz opened this issue Jun 19, 2018 · 37 comments
Assignees
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Windows Performance
Milestone

Comments

@bradfitz
Copy link
Contributor

In our Windows implementation of os/Process.Wait, we always do a 5 millisecond sleep:

func (p *Process) wait() (ps *ProcessState, err error) {
	handle := atomic.LoadUintptr(&p.handle)
	s, e := syscall.WaitForSingleObject(syscall.Handle(handle), syscall.INFINITE)
	switch s {
	case syscall.WAIT_OBJECT_0:
		break
	case syscall.WAIT_FAILED:
		return nil, NewSyscallError("WaitForSingleObject", e)
	default:
		return nil, errors.New("os: unexpected result from WaitForSingleObject")
	}
	var ec uint32
	e = syscall.GetExitCodeProcess(syscall.Handle(handle), &ec)
	if e != nil {
		return nil, NewSyscallError("GetExitCodeProcess", e)
	}
	var u syscall.Rusage
	e = syscall.GetProcessTimes(syscall.Handle(handle), &u.CreationTime, &u.ExitTime, &u.KernelTime, &u.UserTime)
	if e != nil {
		return nil, NewSyscallError("GetProcessTimes", e)
	}
	p.setDone()
	// NOTE(brainman): It seems that sometimes process is not dead
	// when WaitForSingleObject returns. But we do not know any
	// other way to wait for it. Sleeping for a while seems to do
	// the trick sometimes. So we will sleep and smell the roses.
	defer time.Sleep(5 * time.Millisecond)
	defer p.Release()
	return &ProcessState{p.Pid, syscall.WaitStatus{ExitCode: ec}, &u}, nil
}

Sleeps in code are always gross. They're either too fast (and still flaky) or too slow (incurring extra delays on everybody when not needed).

This bug is about figuring out what we're doing wrong on Windows that made us need that sleep in the first place.

https://go-review.googlesource.com/c/go/+/84175 was one attempt to remove it, but the goal then was fixing a Windows XP test failure (for #17245), and it didn't fix the XP failure. We no longer support XP, though, so maybe that CL is okay now.

But maybe it's not correct.

There was also https://golang.org/cl/84896 from @johnsonj to use jobs to wait for process completion on windows, but it was "not working as expected on Server 2008" and @alexbrainman had concerns in #17245 (comment) (that whole thread is worth a read, even if it's mostly about debugging an XP issue, which might've been related to this bug.)

@bradfitz bradfitz added help wanted OS-Windows NeedsFix The path to resolution is known, but the work has not been done. labels Jun 19, 2018
@bradfitz bradfitz added this to the Unplanned milestone Jun 19, 2018
@alexbrainman
Copy link
Member

If you do not delete executable file immediately after its process exits, then you could remove that 5ms sleep, and nothing will change for you.

So we really need that extra sleep mainly to make cmd/go work properly, because cmd/go runs programs and deletes them immediately.

Perhaps we could move that 5ms sleep from os package and into cmd/go code.

Alex

@bradfitz
Copy link
Contributor Author

It should be possible to delete a file after it finishes executing, for any program, not just cmd/go.

This bug is about figuring out what synchronization we're missing.

@alexbrainman
Copy link
Member

This bug is about figuring out what synchronization we're missing.

I understand that. I was suggesting what to do in case we cannot find such synchronization.

Alex

@iamoryanmoshe
Copy link
Contributor

Is the sleep needed on every process we want to wait on?
WinApi docs state that

... if you have a thread that creates windows, use MsgWaitForMultipleObjects or MsgWaitForMultipleObjectsEx, rather than WaitForSingleObject.

We might want to use a different wait method on windows?

@bradfitz
Copy link
Contributor Author

When I tried to remove the sleep in https://go-review.googlesource.com/c/go/+/84175 I was doing it mostly for debugging Windows XP problems (#2866 and #17245), and it apparently didn't work.

But we don't support Windows XP anymore, so maybe we can remove it now?

But I forget everything about Windows I had learned while researching that patch.

/cc @alexbrainman @jordanrh1

@jordanrh1
Copy link
Contributor

I've seen a lot of code that waits on a process handle, and none of it sleeps. It would be a major bug if you couldn't trust a process handle to be signaled at the right time, and would surely be fixed. The sleep must be masking a problem somewhere else. I'm all for removing the sleep and seeing what shakes loose.

@alexbrainman
Copy link
Member

I'm all for removing the sleep and seeing what shakes loose.

SGTM

Alex

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/145221 mentions this issue: os: remove sleep in windows Process.Wait

@alexbrainman
Copy link
Member

@bradfitz @jordanrh1

CL 145221 have broken windows-2008 builder - latest snapshot of https://build.golang.org/

image

Even one failure on windows-2012 https://build.golang.org/log/d37c2bad6c49d20fe4bec6f1a8d380e583ea79e8

What should we do now?

Alex

@bradfitz
Copy link
Contributor Author

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Is there some other handle we're not closing quickly?

@jordanrh1, could you audit some of these code paths and make sure we're doing things right? (clearly we aren't)

@jordanrh1
Copy link
Contributor

I'm not seeing anything obviously wrong with the code, but these are the things I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

@alexbrainman
Copy link
Member

Haven't we seen those errors in the past, though? I guess removing the 5ms sleep just makes them more likely?

Yes, these errors happens more often. Maybe it was happening on Windows XP is even more often.

Is there some other handle we're not closing quickly?

I think that once process handle is signaled - syscall.WaitForSingleObject returns - the process is exited. So opened handles could not matter here. But I could be wrong about that.

I'm suspecting:

  1. might be behavior of OS that it signals process handle before files are released
  2. might have to do with stdout/stderr handles
  3. might have to do with child processes
  4. might be a code defect in os.exec

To rule out (1), I want to see if I can repro the issue with a simple C program. If it is (1), then a Sleep may be the best solution, especially if Go's contract is that all files used by the process are released by the time Wait() returns.

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

I don't see how it could be related to (2). Like I said above, we wait for syscall.WaitForSingleObject on process handle to return, so whatever handles process opens, they should be all closed now. And when new process starts, we use syscall.DuplicateHandle to create stdin, stdout and stderr handles for child process and close the handles immediately - see syscall.StartProcess for details.

I am not sure what to say about (3) and (4).

Should I revert CL 145221 in the meantime?

Alex

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 9, 2018

Should I revert CL 145221 in the meantime?

Sure? Super sad, but we don't have good short-term leads.

If you do, please reopen this bug and reference both this bug in the commit, and #23171.

/cc @aclements

@aclements
Copy link
Member

I do not see how to write similar C program for (1). The program that fails is cmd/go tests - how do you write C program that imitate cmd/go test?

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop? If the theory is that Windows signals the process handle before releasing the file, that should test it.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

@jordanrh1
Copy link
Contributor

Yes, what @aclements suggested.

I should mention that I don't have any cycles to test this until the 19th.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/148957 mentions this issue: Revert "os: remove sleep in windows Process.Wait"

@alexbrainman
Copy link
Member

reopen this bug and reference both this bug in the commit, and #23171.

https://go-review.googlesource.com/c/go/+/148957

How about creating an executable file (probably by just making a copy of an existing executable), run it, WaitForSingleObject, and then try to delete it, and do that in a loop?

I will try to do what you have asked. I would not be surprised, if the system configuration or OS version will also matter here - imagine, if you have an antivirus program eager to catch executable with virus in them, or another "helpful" service.

Is there any way to wait for locks on a file to be released? We can't just do that unconditionally here (there could be other copies of the binary running, too), but maybe we could make remove willing to block for a little while if the file is locked before failing?

I am not sure what locks you are talking about. Please, explain more.

All errors, we see here, happen during cmd/go test. Perhaps, the exe in questions starts itself as a child, and then parent process exists, and we notice parent exit, but do not wait for the child? But all examples of faults are simple run of a program - they do not look like starting any child processes.

Alex

@alexbrainman alexbrainman reopened this Nov 9, 2018
gopherbot pushed a commit that referenced this issue Nov 9, 2018
This reverts CL 145221 (commit 5c35973)

Reason for revert: breaks the build occasionally.

Updates #23171
Updates #25965

Change-Id: Ie1e3c76ab9bcd8d28b6118440b5f80c76f9b1852
Reviewed-on: https://go-review.googlesource.com/c/148957
Run-TryBot: Alex Brainman <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
alexbrainman added a commit to alexbrainman/winapi that referenced this issue Apr 11, 2019
@alexbrainman
Copy link
Member

I will try to do what you have asked.

I created github.com/alexbrainman/winapi/tree/master/issue25965.TestIssue25965

The test runs required exercise for both Go version (only syscall package is used) and C version. The test fails immediately on my Windows 7 computer. But it was much harder to make it fail on my Windows 10 computer - I managed to make it fail after I added a loop of 100 iterations around the test - so be warned that the test might take a while to run. Anyway my Windows 10 output looks like this:

c:\>u:\test
PASS

c:\>u:\test
--- FAIL: TestIssue25965 (84.83s)
    main_test.go:166: running c exe failed: exit status 1 iteration 56: CopyFile failed (5)
FAIL

c:\>u:\test
--- FAIL: TestIssue25965 (92.83s)
    main_test.go:143: iteration 83 failed: CopyFile: Access is denied.
FAIL

c:\>u:\test
--- FAIL: TestIssue25965 (64.27s)
    main_test.go:166: running c exe failed: exit status 1 iteration 14: CopyFile failed (5)
FAIL

c:\>

Let me know, if you want me to do anything else.

Alex

@bcmills
Copy link
Contributor

bcmills commented May 7, 2021

(CC @bufflig: this is currently one of the larger sources of test flakes on the Windows builders.)

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/369017 mentions this issue: os/signal: remove delay on Wait() for a process for windows

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/371296 mentions this issue: testing: retry spurious errors from RemoveAll for temp directories

gopherbot pushed a commit that referenced this issue Dec 14, 2021
This works around what appears to be either a kernel bug or a Go
runtime or syscall bug affecting certain Windows versions
(possibly all pre-2016?).

The retry loop is a simplified version of the one used in
cmd/go/internal/robustio. We use the same 2-second arbitrary timeout
as was used in that package, since it seems to be reliable in practice
on the affected builders. (If it proves to be too short, we can
lengthen it, within reason, in a followup CL.)

Since this puts a higher-level workaround in place, we can also revert
the lower-level workaround added to a specific test in CL 345670.

This addresses the specific occurrences of the bug for users of
(*testing.T).TempDir, but does not fix the underlying bug for Go users
outside the "testing" package (which remains open as #25965).

Fixes #50051
Updates #48012
Updates #25965

Change-Id: I35be7125f32f05c8350787f5ca9a22974b8d0770
Reviewed-on: https://go-review.googlesource.com/c/go/+/371296
Trust: Bryan Mills <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Patrik Nyblom <[email protected]>
Trust: Patrik Nyblom <[email protected]>
Run-TryBot: Patrik Nyblom <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/423194 mentions this issue: net/http: add a test to verify form tempfiles are deleted

gopherbot pushed a commit that referenced this issue Aug 12, 2022
The HTTP/1 server deletes multipart form tempfiles after ServeHTTP
returns, but the HTTP/2 server does not. Add a test to verify
cleanup happens in both cases, temporarily disabled for the
HTTP/2 path.

For #20253
Updates #25965

Change-Id: Ib753f2761fe73b29321d9d4337dbb5090fd193c2
Reviewed-on: https://go-review.googlesource.com/c/go/+/423194
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Damien Neil <[email protected]>
Reviewed-by: Robert Griesemer <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/463840 mentions this issue: os: remove unconditional wait sleep on Windows

@qmuntal
Copy link
Member

qmuntal commented Jan 28, 2023

I've given another shot with CL 463840. It's novel approach is that it calls WaitForSingleObject not on the handle itself but on a handle with SYNCHRONIZE rights. Well, this thread is quite long, so it might also been tried before, but if someone did, I haven't seen it.

@bobjalex
Copy link

bobjalex commented Jan 31, 2023 via email

@bcmills
Copy link
Contributor

bcmills commented Jan 31, 2023

@bobjalex

Tools that have to delete executables when their processes complete
should incorporate retry loops similar to the example above.

The problem with that philosophy is that many programs don't know whether the data they are deleting may include recently-executed programs. That forces either a layering violation (the use-agnostic tool needing to know about executables) or additional latency (the use-agnostic tool performing spurious retries).

This is not a theoretical problem: the Go testing package is one such tool. (See CL 371296.)

@bcmills
Copy link
Contributor

bcmills commented Jan 31, 2023

That also raises the problem of: how long do you retry for? If you attempt only a small number of retries, you'll have false-positive failures; but if you attempt a larger number of retries, then the latency in the cases where the retries are spurious gets worse and worse. For a project like Go that aims to support a wide variety of hardware and platforms, there isn't an obvious setting that addresses both of those constraints.

@bobjalex
Copy link

bobjalex commented Jan 31, 2023 via email

@qmuntal
Copy link
Member

qmuntal commented Jul 6, 2023

I took another stab at this from another side.

I'm pretty sure that all boils down to Windows not deleting a file when syscall.DeleteFile is used, but just marking it for deletion. This behavior is known to be problematic (e.g. dotnet/runtime#27958 and PowerShell/PowerShell#8211) when implementing recursive directory removal, i.e. os.RemoveAll, because a deleted file is deleted asynchronously. This can potentially make deleted files appear multiple times when traversing a directory for deletion, and what's worst is that the second syscall.DeleteFile call on the same file will return ERROR_ACCESS_DENIED. This could explain a bunch of Go issues: #50051, #45672, #23171, and #30789. It is important to note that this race condition is in part due to the way os.RemoveAll is implemented, so let me show how it can happen:

  1. The directory to be removed is opened with os.Open. source
  2. Child files are retrieved using Readdirnames. source
  3. Each child file is successfully marked for deletion using os.RemoveAll, but imagine foo.exe take 10s to be (asynchronously) deleted. source
  4. The directory is deleted using os.Remove. It fails because foo.exe is still not deleted. source
  5. The process starts again from the first step. Step 2 returns foo.exe again and step 3 fails with ERROR_ACCESS_DENIED.

This behavior is not documented, but it's what I could gather from exploring different sources, including the aforementioned dotnet and PowerShell issues.

Going back to the 5ms sleep, it seems that deleting an executable file whose process has just been released makes the asynchronous deletion process take more than for normal files. If we add the fact that testing cleanup calls os.RemoveAll on folders that are normally just a handful of files, the chances to trigger the race condition are high. Thus, the additional sleep time gives to the OS more time to do whatever cleanup is necessary before trying to remove the file using is.RemoveAll, reducing the chances to trigger the race condition.

Luckily for us, the syscall.DeleteFile behavior changed in Windows 10 1903. It no longer deleted files asynchronously but follows the POSIX semantics of deleting files synchronously, making it impossible to trigger the race condition in os.RemoveAll. Therefore, we can drop the 5ms sleep when running on newer Windows versions. Again, this behavior is not documented, but multiple frameworks already depend on it, see for example dotnet/corefx#38186.

This is inline with @alexbrainman and @bradfitz findings while investigating this issue, where they found that it doesn't occur in Windows 10. What I'm doing now is trying to explain those findings so we don't just remove the 5ms seconds based on experimentation but also on a good reasoning.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/509335 mentions this issue: os: remove 5ms sleep on Windows in (*Process).Wait

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-Windows Performance
Projects
None yet
Development

No branches or pull requests