Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to make Windows non-blocking I/O #1525

Merged
merged 3 commits into from
Mar 4, 2023
Merged

Try to make Windows non-blocking I/O #1525

merged 3 commits into from
Mar 4, 2023

Conversation

codercms
Copy link
Contributor

@codercms codercms commented Feb 25, 2023

Resolves #1523 #1482

Do not merge: Possible broken tests

@codercms
Copy link
Contributor Author

codercms commented Feb 25, 2023

@jackc this PR works in real app, but I can't run tests for nbconn (nbconn_test.go) on my machine, because TestInternalNonBlockingWrite continuously eat RAM (10GB+) and I don't have enough RAM. Do you have any ideas?

@codercms codercms changed the title Try to make Windows non blocking io Try to make Windows non blocking I/O Feb 25, 2023
@codercms codercms changed the title Try to make Windows non blocking I/O Try to make Windows non-blocking I/O Feb 25, 2023
@jackc
Copy link
Owner

jackc commented Feb 25, 2023

It looks like TestInternalNonBlockingWrite/Pipe is hanging on master too. So I don't think that has anything to do with this change. Not sure why it's only failing on Windows though. There doesn't seem to be any platform specific code.

@jackc
Copy link
Owner

jackc commented Feb 25, 2023

I fixed TestInternalNonBlockingWrite/Pipe on master. There are still a few failures on TestInternalNonBlockingWrite/* but it doesn't hang now. Looking into those errors now.

@jackc
Copy link
Owner

jackc commented Feb 28, 2023

Looks good to me. I'm inclined to merge this as is.

Regarding the TestInternalNonBlockingWrite tests, I think I have the resource exhaustion you experienced fixed on master. But I still get a local error: tls: bad record MAC failure on a couple tests. And as crazy as it sounds, I think it is actually a bug in Go or in Windows. I spent a number of hours investigating and producing a test case. I filed an issue about it golang/go#58764.

If you get a chance could you help me with that error by:

  1. Run the test case (https://github.com/jackc/golang-windows-dupreads) I produced for that issue and see if you get the same error.
  2. Review my report net: TCP connection erroneously duplicates message on Windows  golang/go#58764. My report was very understandably greeted with a little skepticism. I would be skeptical too. But as far as I can tell, the bug has to be in the net package or below (i.e. Go runtime or Windows).

@codercms
Copy link
Contributor Author

codercms commented Feb 28, 2023

@jackc I tested the master branch, nbconn_test.go, everything is fine except two test cases, I got the same errors as you:

TestInternalNonBlockingWrite/TLS_over_TCP_with_Fake_Non-blocking_IO
Error:      	Received unexpected error:
        	            	local error: tls: bad record MAC

TestInternalNonBlockingWrite/TLS_over_TCP_with_Real_Non-blocking_IO
Error:      	Received unexpected error:
        	            	local error: tls: bad record MAC

P.S. If you need to test something on WIndows, I'm ready to help (and I'll look at this bug)
P.P.S. I also ran tests for nbconn_test.go on Ubuntu 22.04 (WSL 2, Go 1.18.1) and all tests pass, it looks like this bug really occurs on Windows only.

@codercms
Copy link
Contributor Author

codercms commented Mar 2, 2023

@jackc I have the same errors as you in this test repo - https://github.com/jackc/golang-windows-dupreads
Tried in different ways, but the error is always the same.
Look like a bug in Go's stdlib.

@drakkan
Copy link
Contributor

drakkan commented Mar 2, 2023

@jackc I have the same errors as you in this test repo - https://github.com/jackc/golang-windows-dupreads Tried in different ways, but the error is always the same. Look like a bug in Go's stdlib.

@codercms please read the latest comment in the upstream issue. Can you check if anything changes using Go 1.19.5? If you can't, I'll try it myself next weekend (I don't have a Windows test environment right now)

@codercms
Copy link
Contributor Author

codercms commented Mar 2, 2023

@drakkan same error with Go 1.19.5 & 1.18.10 & 1.16 & 1.15

@drakkan
Copy link
Contributor

drakkan commented Mar 2, 2023

@drakkan same error with Go 1.19.5 & 1.18.10 & 1.16 & 1.15

Thanks for your testing. Not sure how to try to help. Based on this comment bad record MAC is a very internal error. I hope Go team could take a look.

What are the pratical implications for pgx? Can we have sporadic errors in Windows with TLS enabled?

@jackc
Copy link
Owner

jackc commented Mar 2, 2023

See golang/go#58764 for my longer description of the problem. But I'm pretty sure the problem is not in the tls package. I made wrapper conns that sit below the tls conn and record the traffic. They are seeing the erroneous duplication of traffic. That is what breaks the tls connection. This also suggests that the problem is not exclusive to tls connections, it's just that's the only way I have managed to trigger the issue.

As far as its affect for pgx on Windows, it's concerning but I think it would be very difficult to trigger in real world usage. We can only duplicate this issue when at least 2.7 MB of writes are buffered on the client side and the server has simultaneously written a large amount of data. The only way I can foresee that happening is if a CopyFrom with a lot of data is executed and there is something on the server that causes a huge amount of data to be sent back while it is receiving the data (e.g. an insert trigger that did a lot of raise notice).

@jackc jackc merged commit 0dbb0a5 into jackc:master Mar 4, 2023
@jackc
Copy link
Owner

jackc commented Mar 4, 2023

I've decided to merge this now. The failing test predates this change and appears to be a Go or Windows issue. This change should be an improvement regardless.

@codercms codercms deleted the windows-non-blocking-io branch March 4, 2023 18:29
if c.nonblockWriteFunc == nil {
c.nonblockWriteFunc = func(fd uintptr) (done bool) {
// Make sock non-blocking
if err := setSockMode(fd, sockModeNonBlocking); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a test case so feel free to ignore this comment.
What if another goroutine ends a write and set the socket to blocking mode after this method ends and before calling WSASend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drakkan Connection isn't considered as thread (goroutine) safe , it should be managed through connection pool, that guarantees that only one goroutine can work with connection.

Here it is - https://github.com/jackc/pgx/wiki/Getting-started-with-pgx#using-a-connection-pool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drakkan Connection isn't considered as thread (goroutine) safe , it should be managed through connection pool, that guarantees that only one goroutine can work with connection.

Here it is - https://github.com/jackc/pgx/wiki/Getting-started-with-pgx#using-a-connection-pool

Ok thanks. I was just trying to figure out if it could be related to #1552, #1548

Copy link
Contributor Author

@codercms codercms Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drakkan hmm, #1552 looks really strange. I started debugging this case and found that there is inconsistent setSockMode calls:

var currSockMode = sockModeBlocking

func setSockMode(fd uintptr, mode sockMode) error {
	if mode == currSockMode {
		fmt.Printf("double setSockMode=%d\n", mode)
	}

	currSockMode = mode

	res, _, err := ioctlsocket.Call(fd, uintptr(FIONBIO), uintptr(unsafe.Pointer(&mode)))
	// Upon successful completion, the ioctlsocket returns zero.
	if res != 0 && err != nil {
		return err
	}

	return nil
}

Here is output when using CopyFrom with large amount of data:

double setSockMode=1
double setSockMode=0
double setSockMode=1
double setSockMode=0
double setSockMode=1
double setSockMode=0
double setSockMode=0
double setSockMode=0
double setSockMode=0
double setSockMode=0
double setSockMode=1
double setSockMode=0
double setSockMode=0
double setSockMode=0
double setSockMode=0
double setSockMode=1
(hangs forever)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added goroutine ids to output:

double setSockMode=1 (id=1)
double setSockMode=0 (id=1)
double setSockMode=0 (id=25)
double setSockMode=0 (id=1)
double setSockMode=0 (id=25)
double setSockMode=1 (id=25)

Looks like there is race condition

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much expertise with Windows network programming, but a bit of doc and Google searching seems to indicate that it should be safe to for one thread to read and one thread to write from the safe socket. However, it appears that some of the winsock API is not thread safe. Perhaps that includes the ioctlsocket function.

...

But now that I think about it a bit more I think I have a more likely cause for the deadlock. pgx will read and write concurrently under certain conditions (such as bufferNonblockingRead when a write is blocked). Both of these functions set non-blocking mode, read or write, and set blocking mode.

What happens if the following sequence of events occurs:

  1. Writer sets non-blocking mode
  2. Reader sets non-blocking mode
  3. Writer writes
  4. Writer sets blocking mode
  5. Reader reads

The socket is unexpectedly in blocking mode for the read. If nothing is ever received it hangs.


As far as Unix goes, I think that since we are not changing the socket options, we are only reading and writing, it is safe.


I really wish the Go stdlib included non-blocking I/O. This system is so complicated...

I'm half tempted to try using goroutines and normal Go I/O again. I tried it a few years ago and the performance hit was significant, but maybe modern Go has higher performance...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackc You're absolutely right about that socket can be used for concurrent read/write (socket is full duplex).
I think implementing critical section (e.g. mutex) for Windows non-blocking I/O will be simplest solution. What do you think will this fix the problem? But additional benchmark is needed, because I/O actually will be not full duplex (you can only write or only read at the same time).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the simplest approach.

But it seems it should be possible to switch to non-blocking mode at the beginning of a sequence of read or writes and turn it off at the end. Not sure how messy that would be though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackc switching to non-blocking mode looks good too, because it doesn't hit performance.
I think non-blocking conn could expose setBlocking(blocking bool) method or something like this, for Windows it will call ioctlsocket, for Unix it will do nothing.

Copy link
Contributor Author

@codercms codercms Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackc I made a draft PR for setting connection to non-blocking mode explicitly - #1555 . Please have a look on it

codercms added a commit to codercms/pgx that referenced this pull request Mar 18, 2023
jackc pushed a commit that referenced this pull request Mar 24, 2023
Created based on discussion here: #1525 (review)

Fixes #1552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PGX v5 Pool poor performance on Windows
3 participants