Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix Windows non-blocking I/O #1555

Merged
merged 14 commits into from
Mar 24, 2023

Conversation

codercms
Copy link
Contributor

Created based on discussion here: #1525 (review)

Fixes #1552

…id that channel leaked in case when `SetBlockingMode` will return error
@jackc
Copy link
Owner

jackc commented Mar 18, 2023

It seems unfortunate for the concept of non-blocking mode to leak out of nbconn into the higher layers (e.g. the copy protocol support).

Would it be possible for this logic to be internal? I think that the only methods that could be affected from the outside are Read, Flush, and BufferReadUntilBlock (Write is not because writes are always buffered). It seems that at any given read or write call nbconn already knows whether it should be a blocking or non-blocking call and it could change mode or do nothing depending on the current mode.

@codercms
Copy link
Contributor Author

codercms commented Mar 19, 2023

@jackc I have updated the PR according to you comment. Please check it again.

  • CopyFrom works perfect with large amount of data
  • But some of tests won't complete (just hangs forever):
Click Me

image

UPD: I decided to use atomic counter to determine should socket be set to blocking or non-blocking mode, but it might not work on some CPUs or in some circumstances. Do you have any thoughts on this?

@codercms codercms changed the title [WIP] Fix Windows non-blocking I/O for CopyFrom [WIP] Fix Windows non-blocking I/O Mar 19, 2023
@drakkan
Copy link
Contributor

drakkan commented Mar 19, 2023

UPD: I decided to use atomic counter to determine should socket be set to blocking or non-blocking mode, but it might not work on some CPUs or in some circumstances. Do you have any thoughts on this?

32 bit atomic should work on any CPU, anyway you can use atomic.Int32.

@codercms
Copy link
Contributor Author

@drakkan thanks for the tip, I've updated PR

@jackc
Copy link
Owner

jackc commented Mar 20, 2023

This logic is tricky and there's a few cases I am concerned about.

What if SetBlockingMode(false) is called more times than SetBlockingMode(true)? It would seem that nbOperCnt could go negative and have unexpected behavior.


If SetBlockingMode can be entered concurrently then I think there may be some sort of race condition. What if there are two concurrent SetBlockingMode(true) calls? It would seem the following sequence of actions is possible:

  1. G1 increments nbOperCnt from 0 to 1 and continues.
  2. G2 increments nbOperCnt from 1 to 2 and returns.
  3. G2 uses the socket thinking it is in non-blocking mode.
  4. G1 calls setSockMode.

I suspect that the whole SetBlockingMode function needs to be guarded by a mutex.


It's possible that neither of these cases actually will occur based on how and where SetBlockingMode is called, but at the very least that would be leaving a landmine for the next person to update nbconn.

@codercms
Copy link
Contributor Author

@jackc I see your point, I've updated PR to use mutex instead of atomic

@drakkan
Copy link
Contributor

drakkan commented Mar 21, 2023

@codercms even with this patch applied some test cases hang as you also noticed

Z:\pgx>go test -v -timeout 30s -run ^TestInternalNonBlockingWriteWithDeadline$ github.com/jackc/pgx/v5/internal/nbconn
=== RUN   TestInternalNonBlockingWriteWithDeadline
=== RUN   TestInternalNonBlockingWriteWithDeadline/Pipe
=== RUN   TestInternalNonBlockingWriteWithDeadline/TCP_with_Fake_Non-blocking_IO
=== RUN   TestInternalNonBlockingWriteWithDeadline/TLS_over_TCP_with_Fake_Non-blocking_IO
=== RUN   TestInternalNonBlockingWriteWithDeadline/TCP_with_Real_Non-blocking_IO
    nbconn_test.go:320:
                Error Trace:    Z:\pgx\internal\nbconn\nbconn_test.go:320
                                                        Z:\pgx\internal\nbconn\nbconn_test.go:143
                Error:          An error is expected but got nil.
                Test:           TestInternalNonBlockingWriteWithDeadline/TCP_with_Real_Non-blocking_IO
=== RUN   TestInternalNonBlockingWriteWithDeadline/TLS_over_TCP_with_Real_Non-blocking_IO
panic: test timed out after 30s
running tests:
        TestInternalNonBlockingWriteWithDeadline (30s)
        TestInternalNonBlockingWriteWithDeadline/TLS_over_TCP_with_Real_Non-blocking_IO (30s)

goroutine 14 [running]:
testing.(*M).startAlarm.func1()
        C:/Program Files/Go/src/testing/testing.go:2241 +0x3b9
created by time.goFunc
        C:/Program Files/Go/src/time/sleep.go:176 +0x32

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0000356c0, {0x4b90e3?, 0x23eb93?}, 0x4dc9c8)
        C:/Program Files/Go/src/testing/testing.go:1630 +0x405
testing.runTests.func1(0x6cde00?)
        C:/Program Files/Go/src/testing/testing.go:2036 +0x45
testing.tRunner(0xc0000356c0, 0xc00006fc88)
        C:/Program Files/Go/src/testing/testing.go:1576 +0x10b
testing.runTests(0xc0000530e0?, {0x6c7c80, 0xd, 0xd}, {0xc00007af68?, 0x100c00006fd10?, 0x6cd400?})
        C:/Program Files/Go/src/testing/testing.go:2034 +0x489
testing.(*M).Run(0xc0000530e0)
        C:/Program Files/Go/src/testing/testing.go:1906 +0x63a
main.main()
        _testmain.go:73 +0x1aa

goroutine 6 [chan receive]:
testing.(*T).Run(0xc000035860, {0x4b830a?, 0xc00000f2a8?}, 0xc00009c048)
        C:/Program Files/Go/src/testing/testing.go:1630 +0x405
github.com/jackc/pgx/v5/internal/nbconn_test.testVariants(0xc000049f60?, 0x4dc9c0)
        Z:/pgx/internal/nbconn/nbconn_test.go:107 +0x165
github.com/jackc/pgx/v5/internal/nbconn_test.TestInternalNonBlockingWriteWithDeadline(0x0?)
        Z:/pgx/internal/nbconn/nbconn_test.go:310 +0x25
testing.tRunner(0xc000035860, 0x4dc9c8)
        C:/Program Files/Go/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
        C:/Program Files/Go/src/testing/testing.go:1629 +0x3ea

goroutine 20 [runnable, locked to thread]:
syscall.SyscallN(0x0?, {0xc0000958f0?, 0x1e4db5acd98?, 0x1e4b5800eb8?})
        C:/Program Files/Go/src/runtime/syscall_windows.go:557 +0x109
syscall.Syscall9(0xc00001e4e0?, 0xc00001e450?, 0xc00001eb10?, 0xc00001eba0?, 0xc0000403b0?, 0xc00001ee10?, 0x20a93f?, 0xc0000520a0?, 0xc000035d40?, 0x0, ...)
        C:/Program Files/Go/src/runtime/syscall_windows.go:507 +0x78
syscall.WSASend(0x1b8?, 0xc000095a28?, 0x1, 0xc000095a1c?, 0x0, 0x26b185?, 0x0?)
        C:/Program Files/Go/src/syscall/zsyscall_windows.go:1311 +0xb9
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).realNonblockingWrite.func1(0xc000130780?)
        Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:51 +0x78
internal/poll.(*FD).RawWrite(0xc000130780, 0xc0000ac2c0)
        C:/Program Files/Go/src/internal/poll/fd_windows.go:1096 +0xbd
net.(*rawConn).Write(0xc000088038, 0xc0004d4000?)
        C:/Program Files/Go/src/net/rawconn.go:55 +0x45
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).realNonblockingWrite(0xc000002180, {0xc000ace000?, 0x471900?, 0x723160?})
        Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:61 +0x122
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).nonblockingWrite(0x524840?, {0xc000ace000?, 0x524840?, 0x723160?})
        Z:/pgx/internal/nbconn/nbconn.go:408 +0x27
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).flush(0xc000002180)
        Z:/pgx/internal/nbconn/nbconn.go:317 +0x212
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).Flush(0x6cd400?)
        Z:/pgx/internal/nbconn/nbconn.go:289 +0xa5
github.com/jackc/pgx/v5/internal/nbconn.(*TLSConn).Flush(0x525a08?)
        Z:/pgx/internal/nbconn/nbconn.go:533 +0x1d
github.com/jackc/pgx/v5/internal/nbconn_test.TestInternalNonBlockingWriteWithDeadline.func1(0x525a08?, {0x527ac0, 0xc00006e0f0}, {0x0?, 0x0?})
        Z:/pgx/internal/nbconn/nbconn_test.go:319 +0x11e
github.com/jackc/pgx/v5/internal/nbconn_test.testVariants.func1(0xc000142820?)
        Z:/pgx/internal/nbconn/nbconn_test.go:143 +0x534
testing.tRunner(0xc000142820, 0xc00009c048)
        C:/Program Files/Go/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
        C:/Program Files/Go/src/testing/testing.go:1629 +0x3ea

goroutine 13 [runnable, locked to thread]:
syscall.SyscallN(0x1e4db5fa9b8?, {0xc00156bcb0?, 0x1e4b580a398?, 0x1e4b5800a28?})
        C:/Program Files/Go/src/runtime/syscall_windows.go:557 +0x109
syscall.Syscall9(0x6375b0?, 0x6d528?, 0x1c78?, 0x11b48?, 0xc00156bd58?, 0x1f1719?, 0x723ea0?, 0x0?, 0xc00156bd98?, 0x0, ...)
        C:/Program Files/Go/src/runtime/syscall_windows.go:507 +0x78
syscall.WSARecv(0x1e4db5fa9b8?, 0x1c00156be10?, 0x1, 0x0?, 0xc00156bde8?, 0x26b125?, 0x67?)
        C:/Program Files/Go/src/syscall/zsyscall_windows.go:1295 +0xb8
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).realNonblockingRead.func1(0xc000130780?)
        Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:91 +0x87
internal/poll.(*FD).RawRead(0xc000130780, 0xc0000ac000)
        C:/Program Files/Go/src/internal/poll/fd_windows.go:1067 +0xda
net.(*rawConn).Read(0xc000088038, 0xc00001e840?)
        C:/Program Files/Go/src/net/rawconn.go:43 +0x45
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).realNonblockingRead(0xc000002180, {0xc0003f6000?, 0xc000002180?, 0xc00156bee8?})
        Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:101 +0x122
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).nonblockingRead(0x2000?, {0xc0003f6000?, 0x524840?, 0x723160?})
        Z:/pgx/internal/nbconn/nbconn.go:441 +0x27
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).BufferReadUntilBlock(0xc000002180)
        Z:/pgx/internal/nbconn/nbconn.go:358 +0xaf
github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).bufferNonblockingRead.func1()
        Z:/pgx/internal/nbconn/nbconn.go:382 +0x3b
created by github.com/jackc/pgx/v5/internal/nbconn.(*NetConn).bufferNonblockingRead
        Z:/pgx/internal/nbconn/nbconn.go:380 +0x10a
FAIL    github.com/jackc/pgx/v5/internal/nbconn 30.596s
FAIL

take a look here

Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:91 +0x87

and

Z:/pgx/internal/nbconn/nbconn_real_non_block_windows.go:51 +0x78

Deadlines don't seem to work, this may be the root cause.

@jackc maybe better suggest to use v4 on windows until this is fixed (I guess it might take some time)

@codercms
Copy link
Contributor Author

codercms commented Mar 21, 2023

@drakkan yes, deadlines are just ignored for Windows non-blocking I/O, because there is raw API calls. But actually I'm not sure how deadlines should work for non-blocking I/O, I guess it should be emulated somehow. @jackc could you explain how deadlines should work for non-blocking I/O?

@codercms
Copy link
Contributor Author

@jackc @drakkan I've added some deadline simulation, please have a look on it
All tests passed, except 2 cases:

  • TestInternalNonBlockingWrite/TLS_over_TCP_with_Fake_Non-blocking_IO (Error: Received unexpected error: local error: tls: bad record MAC)
  • TestInternalNonBlockingWriteWithDeadline/TCP_with_Real_Non-blocking_IO (Error: An error is expected but got nil.)

@codercms
Copy link
Contributor Author

codercms commented Mar 21, 2023

@jackc @drakkan I've made one more update and TestInternalNonBlockingWriteWithDeadline/TCP_with_Real_Non-blocking_IO now runs successfully.

UPD: @jackc should c.readDeadline / c.writeDeadline readings be guarded by mutex?

@drakkan
Copy link
Contributor

drakkan commented Mar 21, 2023

@jackc @drakkan I've made one more update and TestInternalNonBlockingWriteWithDeadline/TCP_with_Real_Non-blocking_IO now runs successfully.

UPD: @jackc should c.readDeadline / c.writeDeadline readings be guarded by mutex?

It looks good, thanks! If test cases cover all possible usages we should be fine now. The only failing test case is TestInternalNonBlockingWrite/TLS_over_TCP_with_Fake_Non-blocking_IO but this failure is unrelated to this PR.

To be honest, I'm not happy with all this platform-specific, complicated and hard-to-maintain code (in general, not only related to this PR), it can hide subtle and hard-to-diagnose bugs (e.g. the remaining failed test case)

@codercms
Copy link
Contributor Author

codercms commented Mar 21, 2023

@drakkan I agree, that platform specific code isn't a good approach, but unfortunately GoLang's stdlib doesn't have non-blocking I/O for Windows.

It's actually normal practice in C/C++ world to have platform specific code, especially for sockets, because Windows sockets API is different from sockets API in UNIX (and behavior is different too).
In UNIX-like systems - sockets are files (actually in UNIX everything is file), but in Windows files are only files and nothing more.

Projects like cURL, nginx, boost.asio (async I/O library for C++), libuv (event loop and non-blocking I/O library, used in Node.js) have different network layer implementation for Windows.

UNIX-like systems have advanced socket polling mechanisms like kqueue, epoll, but Windows doesn't have kqueue nor epoll, it has WaitForMultipleObjects.
UNIX-like systems have fcntl/ioctl syscalls to set socket to non-blocking mode, Windows have ioctlsocket syscall.
In general, there are many differences between sockets in UNIX-like systems and Windows.

@drakkan
Copy link
Contributor

drakkan commented Mar 22, 2023

@drakkan I agree, that platform specific code isn't a good approach, but unfortunately GoLang's stdlib doesn't have non-blocking I/O for Windows.

It's actually normal practice in C/C++ world to have platform specific code, especially for sockets, because Windows sockets API is different from sockets API in UNIX (and behavior is different too). In UNIX-like systems - sockets are files (actually in UNIX everything is file), but in Windows files are only files and nothing more.

Projects like cURL, nginx, boost.asio (async I/O library for C++), libuv (event loop and non-blocking I/O library, used in Node.js) have different network layer implementation for Windows.

UNIX-like systems have advanced socket polling mechanisms like kqueue, epoll, but Windows doesn't have kqueue nor epoll, it has WaitForMultipleObjects. UNIX-like systems have fcntl/ioctl syscalls to set socket to non-blocking mode, Windows have ioctlsocket syscall. In general, there are many differences between sockets in UNIX-like systems and Windows.

First of all I appreciate a lot your efforts.
Sure in C/C++ is normal, but not in Go and this means writing and maintaining very complicated code. I wonder if we can just use goroutines so the code will work the same on every platforms. pgx v4 and lib/pq does not have non-blocking I/O and seem to work fine.
Anyway this issue seems fixed. The only remaining test failure should not matter now that non blocking I/O works also on Windows and that code should not be reached

@codercms
Copy link
Contributor Author

@drakkan for sure we can use just goroutines and simulate non-blocking I/O, but see this comment - #1525 (comment)

I'm half tempted to try using goroutines and normal Go I/O again. I tried it a few years ago and the performance hit was significant, but maybe modern Go has higher performance...

@jackc
Copy link
Owner

jackc commented Mar 24, 2023

I think this is in a mergeable state. But I don't plan on tagging a new release for a while, I think it would be good to let the more adventurous people who run master give us some more real world testing.


@drakkan

To be honest, I'm not happy with all this platform-specific, complicated and hard-to-maintain code (in general, not only related to this PR), it can hide subtle and hard-to-diagnose bugs (e.g. the remaining failed test case)

You and me both. I wrote almost all of nbconn except for this new Windows support and I totally agree it is very complicated and tricky. Unfortunately, I have been unable to find a better way. (TLS support greatly complicates it as well. )

I wonder if we can just use goroutines so the code will work the same on every platforms. pgx v4 and lib/pq does not have non-blocking I/O and seem to work fine.

There were a few reasons for introducing non-blocking IO. One was to be able to efficiently test if a connection was live before sending a query. If a connection has been closed on the server end it is sometimes still possible to write to the connection. The connection closed error is not received until the connection is read from. This causes a problem where is is impossible to say if a insert/update/delete statement actually was executed. By testing the connection first this occurrence can be minimized.

In addition, the CopyFrom logic is much simpler with non-blocking I/O.

@jackc jackc merged commit e9d64ec into jackc:master Mar 24, 2023
@drakkan
Copy link
Contributor

drakkan commented Mar 25, 2023

I think this is in a mergeable state. But I don't plan on tagging a new release for a while, I think it would be good to let the more adventurous people who run master give us some more real world testing.

@drakkan

To be honest, I'm not happy with all this platform-specific, complicated and hard-to-maintain code (in general, not only related to this PR), it can hide subtle and hard-to-diagnose bugs (e.g. the remaining failed test case)

You and me both. I wrote almost all of nbconn except for this new Windows support and I totally agree it is very complicated and tricky. Unfortunately, I have been unable to find a better way. (TLS support greatly complicates it as well. )

I wonder if we can just use goroutines so the code will work the same on every platforms. pgx v4 and lib/pq does not have non-blocking I/O and seem to work fine.

There were a few reasons for introducing non-blocking IO. One was to be able to efficiently test if a connection was live before sending a query. If a connection has been closed on the server end it is sometimes still possible to write to the connection. The connection closed error is not received until the connection is read from. This causes a problem where is is impossible to say if a insert/update/delete statement actually was executed. By testing the connection first this occurrence can be minimized.

In addition, the CopyFrom logic is much simpler with non-blocking I/O.

@jackc thanks for the detailed explanation. I updated SFTPGo to use the latest version and I did several tests on a Windows VM. It works fine for me but SFTPGo is not a database intesive application. @codercms thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The copy from function runs incorrectly on some machines
3 participants