Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/syscall/windows/registry: registry subkey enumeration cannot handle thread migrations #49320

Closed
mknyszek opened this issue Nov 3, 2021 · 18 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Nov 3, 2021

In trying to land #44167, I discovered that an additional well-placed GC cycle in the internal/syscall/windows/registry tests caused one of the tests to mysteriously fail in a RegEnumKeyExW syscall with "access denied." This syscall is called multiple times in a row, and the first call never was what failed.

The total amount of code involved in the test is fairly small, so it was hard to see what could possibly be going wrong. Furthermore, the RegEnumKeyExW documentation doesn't indicate with certainty under what conditions it could fail (https://docs.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regenumkeyexw#remarks). After looking at some execution traces, I stumbled upon the issue: if the goroutine executing these syscalls migrates between OS threads (not interrupting the syscall mind you, just in between consecutive calls), then subsequent calls will fail.

I confirmed this by adding runtime.LockOSThread to the ReadSubKeyNames method on Key, and that seems to fix it.

Steps forward here I think are:

  1. Call runtime.LockOSThread in ReadSubKeyNames.
  2. Document that RegEnumKeyExW requires the goroutine to be locked to its OS thread.

Another potential third step could be to expose a safer abstraction for this in the syscall package itself, but since the API is frozen, I don't think this is possible.

@mknyszek mknyszek self-assigned this Nov 3, 2021
@mknyszek mknyszek added this to the Go1.18 milestone Nov 3, 2021
@mknyszek mknyszek added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 3, 2021
@bcmills
Copy link
Contributor

bcmills commented Nov 3, 2021

Another potential third step could be to expose a safer abstraction for this in the syscall package itself, but since the API is frozen, I don't think this is possible.

The unsafe version of the function can't be removed, but it can be deprecated in favor of a safer replacement if we can define one.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/361154 mentions this issue: syscall: fix and document uses of RegEnumKeyEx

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/362576 mentions this issue: windows/registry: lock OS thread while enumerating keys

gopherbot pushed a commit to golang/sys that referenced this issue Nov 9, 2021
CL 361154 updated the standard syscall package to document that
successive calls to syscall.RegEnumKeyEx must occur on the same OS
thread (after that requirement was empirically discovered in
golang/go#49320).

This use in x/sys needs to be updated to comply with the
newly-discovered requirement.

Fixes golang/go#49466.

Change-Id: Idc45d91f175e1db25c215213fcaa45982c2f5e6e
Reviewed-on: https://go-review.googlesource.com/c/sys/+/362576
Trust: Bryan C. Mills <[email protected]>
Run-TryBot: Bryan C. Mills <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
@alexbrainman
Copy link
Member

@mknyszek I tried to reproduce this issue on my Windows 10 laptop and windows-amd64-2016 builder and I cannot reproduce it.

I build

https://go-review.googlesource.com/c/sys/+/363574

and my test fails with

--- FAIL: TestReadSubKeyNames (0.00s)
    registry_test.go:44: key Accessibility
    registry_test.go:44: key Colors
    registry_test.go:44: key Cursors
    registry_test.go:44: key Desktop
    registry_test.go:44: key Input Method
    registry_test.go:44: key International
    registry_test.go:44: key Keyboard
    registry_test.go:44: key Mouse
    registry_test.go:44: key PowerCfg
    registry_test.go:46: BOOM
FAIL
FAIL	golang.org/x/sys/windows/registry	0.025s

If my change follows your explanation of the problem, the test should fail on line 44 instead of 46.

Where did I make mistake?

Thank you.

Alex

@alexbrainman
Copy link
Member

Pinging @mknyszek . Please reply to #49320 (comment)

Thank you.

Alex

1 similar comment
@alexbrainman
Copy link
Member

Pinging @mknyszek . Please reply to #49320 (comment)

Thank you.

Alex

@alexbrainman alexbrainman reopened this Dec 7, 2021
@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 7, 2021

@alexbrainman The restriction may be limited to windows-386-2008. Perhaps that's a bug that was fixed in later editions. I didn't try to reproduce on later Windows, or windows/amd64.

@alexbrainman
Copy link
Member

The restriction may be limited to windows-386-2008.

My test #49320 (comment) was run on windows-386-2008 and windows-amd64-2016. So still I don't see any "restriction" even on windows-386-2008 builder.

What did you do differently from CL 363574 to reproduce this issue?

Thank you.

Alex

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 8, 2021

Then I don't know. Perhaps that assumption is wrong. What I did was I inserted a call to runtime.GC before the call to syscall.RegEnumKeyEx. An additional GC cycle should not cause a syscall API to fail. Adding in runtime.LockOSThread made it so that, even with the additional GC cycle, it still worked. Maybe it's some other effect of runtime.LockOSThread that resolved it.

It's possible this is a bug in the syscall path instead, maybe. Feel free to update the documentation if you think what I wrote is wrong.

@bcmills
Copy link
Contributor

bcmills commented Dec 8, 2021

@alexbrainman, FWIW this failure mode also reproduced in x/sys/windows/registry.TestReadSubKeyNames on the Go dashboard builders (mainly windows-386-2008, but also twice on windows-amd64-2016). That was #49466, and seems to have been resolved by adding a runtime.LockOSThread in CL 362576.

@alexbrainman
Copy link
Member

It's possible this is a bug in the syscall path instead, maybe.

Yes, it is quite possible the problem is in runtime, and not in Windows. You assumed that the problem is in OS, and that is a possibility. But we need some repro (maybe C code that fails in the same way) or at the very least repro in Go, before we should submit code based on that assumption. Your "fix" most likely just hiding a bug in runtime.

That was #49466, and seems to have been resolved by adding a runtime.LockOSThread in CL 362576.

CL 362576 assumes that there is a bug in Windows. And we cannot reproduce this bug. So we don't really know why CL 362576 fixes #49466. Like I said above, CL 362576 could be masking bug in runtime.

Alex

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 11, 2021

I agree with you that an OS bug is unlikely, but I don't believe it to be a bug. I never did. I believe (as I wrote in CL 362576) that I thought this was an undocumented requirement. My reasoning was the following: in C land (or really any other place), it's difficult to imagine that this syscall would be called in a loop across threads. Go is unique here where a thread switch can happen at almost any time. I think such a scenario is far likelier.

Clearly my original assertion was wrong in some way, however, and I admit that.

And we cannot reproduce this bug.

As I said before, I believe we can. Back when debugging this, I could have it fail nearly every time by adding a call to runtime.GC just before the call to syscall.RegEnumKeyEx while removing the calls to runtime.LockOSThread. Also, it should be reproducible by applying GOEXPERIMENT=heapminimum512kib and again removing the calls to runtime.LockOSThread; this was the original failure mode.

CL 362576 doesn't fix the issue as currently titled, but it does mitigate a real bug. It could be a bug in the runtime, I don't disagree. But the bug is mitigated. Given your attempts at reproduction, we should revert the changes to the documentation. But I don't think we should revert the mitigation.

On Monday I will confirm that I can reproduce it.

@mknyszek
Copy link
Contributor Author

mknyszek commented Dec 13, 2021

With this patch:

diff --git a/src/internal/syscall/windows/registry/key.go b/src/internal/syscall/windows/registry/key.go
index ec38cf9288..1bcc6b9b4c 100644
--- a/src/internal/syscall/windows/registry/key.go
+++ b/src/internal/syscall/windows/registry/key.go
@@ -91,12 +91,6 @@ func OpenKey(k Key, path string, access uint32) (Key, error) {

 // ReadSubKeyNames returns the names of subkeys of key k.
 func (k Key) ReadSubKeyNames() ([]string, error) {
-       // RegEnumKeyEx must be called repeatedly and to completion.
-       // During this time, this goroutine cannot migrate away from
-       // its current thread. See #49320.
-       runtime.LockOSThread()
-       defer runtime.UnlockOSThread()
-
        names := make([]string, 0)
        // Registry key size limit is 255 bytes and described there:
        // https://msdn.microsoft.com/library/windows/desktop/ms724872.aspx
@@ -105,6 +99,7 @@ loopItems:
        for i := uint32(0); ; i++ {
                l := uint32(len(buf))
                for {
+                       runtime.GC()
                        err := syscall.RegEnumKeyEx(syscall.Handle(k), i, &buf[0], &l, nil, nil, nil, nil)
                        if err == nil {
                                break

go test internal/syscall/windows/registry failed immediately for me on a windows-386-2008 gomote with:

--- FAIL: TestReadSubKeyNames (0.00s)
    registry_test.go:39: Access is denied.
FAIL
FAIL	internal/syscall/windows/registry	0.033s
FAIL
Error running run: exit status 1

I've confirmed that LockOSThread does appear to mitigate some issue here.

@alexbrainman
Copy link
Member

With this patch:

diff --git a/src/internal/syscall/windows/registry/key.go b/src/internal/syscall/windows/registry/key.go
index ec38cf9288..1bcc6b9b4c 100644
--- a/src/internal/syscall/windows/registry/key.go
+++ b/src/internal/syscall/windows/registry/key.go
@@ -91,12 +91,6 @@ func OpenKey(k Key, path string, access uint32) (Key, error) {

 // ReadSubKeyNames returns the names of subkeys of key k.
 func (k Key) ReadSubKeyNames() ([]string, error) {
-       // RegEnumKeyEx must be called repeatedly and to completion.
-       // During this time, this goroutine cannot migrate away from
-       // its current thread. See #49320.
-       runtime.LockOSThread()
-       defer runtime.UnlockOSThread()
-
        names := make([]string, 0)
        // Registry key size limit is 255 bytes and described there:
        // https://msdn.microsoft.com/library/windows/desktop/ms724872.aspx
@@ -105,6 +99,7 @@ loopItems:
        for i := uint32(0); ; i++ {
                l := uint32(len(buf))
                for {
+                       runtime.GC()
                        err := syscall.RegEnumKeyEx(syscall.Handle(k), i, &buf[0], &l, nil, nil, nil, nil)
                        if err == nil {
                                break

go test internal/syscall/windows/registry failed immediately for me on a windows-386-2008 gomote with:

--- FAIL: TestReadSubKeyNames (0.00s)
    registry_test.go:39: Access is denied.
FAIL
FAIL	internal/syscall/windows/registry	0.033s
FAIL
Error running run: exit status 1

Thank you @mknyszek I can reproduce this problem pretty much anywhere.

Unfortunately I still doo not know what the problem is.

The test fails because RegEnumKeyEx after a few calls returns ERROR_ACCESS_DENIED.

RegEnumKeyEx is called in a loop starting with index 0, then 1, then 2 and so on. It always succeeds when called with index of 0. I can see it fails with either index 1 or 2.

I also added a call to print GetCurrentThreadId value. And thread id changes every loop iteration.

I also checked that RegEnumKeyEx output parameters don't move in memory between the calls (I printed buf and l addresses).

I also noticed that

https://docs.microsoft.com/en-us/windows/win32/sysinfo/enumerating-registry-subkeys

uses KEY_READ instead of KEY_ENUMERATE_SUB_KEYS. So I also replaced registry.ENUMERATE_SUB_KEYS with registry.READ. And that fixes the problem. But

https://docs.microsoft.com/en-us/windows/win32/api/winreg/nf-winreg-regenumkeyexw

says to use KEY_ENUMERATE_SUB_KEYS, so I don't think my change is appropriate.

@zx2c4 perhaps you have some suggestions here. Perhaps you can step into RegEnumKeyEx to see why it returns ERROR_ACCESS_DENIED in our test.

Thank you.

Alex

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 31, 2021

@zx2c4 perhaps you have some suggestions here. Perhaps you can step into RegEnumKeyEx to see why it returns ERROR_ACCESS_DENIED in our test.

Haven't done anything dynamic yet, but just statically reverse engineering things, it looks like:

RegEnumKeyExW->LocalBaseRegEnumKey->EnumTableGetKeyState->StateObjectListFind(arg->someMember, NtCurrentTeb()->ClientId.UniqueThread), and similarly with a call into EnumTableAddKey which also looks at the TID. There are a bunch of heap allocations and lookup table things going on. That leads me to suspect that a first call is querying quite a bit of info on subkeys, which is then cached by KernelBase.dll and doled out on each successive call to RegEnumKeyExW. This cache seems to be keyed based on the thread ID, so if things are migrated to a new thread, the call fails.

The more interesting question is why KEY_READ works but KEY_ENUMERATE_SUB_KEYS does not. I suspect what's happening is that it first tries to enumerate using NtQueryKey(KeyCachedInformation) to get the number of subkeys, which requires KEY_ENUMERATE_SUB_KEYS. When this fails, it then falls back to using NtEnumerateKey, one by one with no fancy caching, which only requires KEY_READ access.

If that theory holds true, moving to KEY_READ probably isn't good for performance.

I agree this per-thread cache behavior is weird and surprising, and if it can't be fixed in Windows, it should probably at least be documented, so CCing @jstarks.

@alexbrainman
Copy link
Member

Haven't done anything dynamic yet, but just statically reverse engineering things, it looks like:

Thanks for doing this. Your explanation sounds reasonable.

I agree this per-thread cache behavior is weird and surprising, and if it can't be fixed in Windows, it should probably at least be documented, so CCing @jstarks.

Yes. RegEnumKeyExW documentation does not say that all calls must come from the same thread.

I will leave this issue opened until we hear from @jstarks.

Alex

@jstarks
Copy link

jstarks commented Jan 3, 2022

Yes, this appears to be intentional behavior specifically for HKEY_CLASSES_ROOT as of Windows 2000. HKCU is a merged view of the machine-wide CLSID information (from HKLM) and the user-local CLSID information (from HKCU). This merging happens in user mode, and the associated state across calls is maintained in TLS.

There's no obvious way to improve this in the OS without introducing a new API, something that is quite unlikely to occur since there's a reasonable-enough workaround here.

I'll try to get the behavior documented.

@alexbrainman
Copy link
Member

Thank you @jstarks for explaining.

Yes, this appears to be intentional behavior ...

We are talking about RegEnumKeyExW returning ERROR_ACCESS_DENIED when called on a different thread from the thread that opened the key. I don't see how this can be intentional. It is just a broken implementation.

This merging happens in user mode, and the associated state across calls is maintained in TLS.

There's no obvious way to improve this in the OS without introducing a new API, ...

Again this is implementation details. I don't see why you need a new API. Unless you mean that new API is required to build similarly fast implementation. Or something else.

since there's a reasonable-enough workaround here.

Are you referring to the code that keeps internal/syscall/windows/registry.ReadSubKeyNames function execution on the same thread

https://go-review.googlesource.com/c/go/+/361154/

?

Then we had no other choice. And the requirement to stay on the same thread is not documented anywhere. So @mknyszek had a good guess. I did not believe it is possible for such Windows bug.

I'll try to get the behavior documented.

That would be nice. Thank you.

Closing this issue, because I am satisfied this a Windows bug. Nothing we can do here.

Alex

@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants