-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CopyFrom is ~80x slower in v5.2.0 versus v4.17.2 #1481
Comments
There must be something platform or network specific. I'm showing v4 taking ~78% the time of v5. Significant, but not 80x. I'm on MacOS on an M2 Max using a unix socket. The memory allocations are still excessive though.
If I had to guess I would say something with the new non-blocking mode is the problem. Still looking though. |
This was causing allocations every time there was a non-blocking read with nothing to read. #1481
Found a place that was allocating a buffer for a non-blocking read and not releasing it back to the sync.Pool when there was no bytes read. That happens a lot in the CopyFrom path. With that fixed (7941518) pgx v5 is faster than v4 for me. jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem 1 ↵
goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12 51 23224866 ns/op 351749 B/op 53 allocs/op
BenchmarkPgx5-12 51 22054759 ns/op 3926631 B/op 185413 allocs/op
PASS
ok github.com/jackc/pgx_issues/pgx-1481 3.221s YMMV, though, you were starting from a much slower place. Let me know how this change affects you. Also, are you on Windows? True non-blocking reads are only implemented on *nix right now. On Windows it is simulated. That could have a significant impact. |
This significantly reduces memory allocations in paths that repeatedly encode the same type of values such as CopyFrom. #1481
And memoizing the encode plans reduces the allocation count by 1/3. (42a4719)
|
This saves an allocation on every call. #1481
Only create RawConn.Read callback once and have it use NetConn fields. Avoids the closure and some allocations. #1481
Thanks Jack for quick answer and first fixes goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 12 98959325 ns/op 363179 B/op 60 allocs/op
BenchmarkPgx5-6 1 7976769700 ns/op 828320 B/op 563 allocs/op
PASS
ok nph/bench 10.757s I am indeed on a windows env - so the underlying issue is probably windows/network related, as you suggested. Also played with the buffer size in the CopyFrom method, and when setting buf := iobufpool.Get(6553600) , it clearly converged perf-wise with v4 goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 10 101899850 ns/op 363856 B/op 63 allocs/op
BenchmarkPgx5-6 6 171502650 ns/op 14198041 B/op 296 allocs/op
PASS
ok nph/bench 3.825s 100x less roundtrips theoretically, so the fishy part indeed lies within the send/receive logic for windows Came across this Microsoft KB article while looking for potential root causes: |
Tried to tweak windows with their method #4, no luck I timed reads (pgConn.receiveMessage) and writes (pgConn.frontend.SendUnbufferedEncodedCopyData) in PgConn::CopyFrom, reads are clearly the bottleneck: Write time: 1662200ns Inquiring further, looks like the issue is located in fakeNonblockingRead() (used behind pgConn.peekMessage() / pgConn.frontend.Receive()) : waiting for a message that never arrives results in 100ms lost each time (and ErrWouldBlock) Receive: 106774000 ns - msg: , err: would block func (c *NetConn) fakeNonblockingRead(b []byte) (n int, err error) {
c.readDeadlineLock.Lock()
defer c.readDeadlineLock.Unlock()
deadline := time.Now().Add(fakeNonblockingWaitDuration) // => 100ms
if c.readDeadline.IsZero() || deadline.Before(c.readDeadline) {
err = c.conn.SetReadDeadline(deadline)
if err != nil {
return 0, err
}
defer func() {
// Ignoring error resetting deadline as there is nothing that can reasonably be done if it fails.
c.conn.SetReadDeadline(c.readDeadline)
if err != nil {
if errors.Is(err, os.ErrDeadlineExceeded) {
err = ErrWouldBlock
}
}
}()
}
return c.conn.Read(b)
} |
The reason for a high max wait time was to ensure that reads aren't cancelled when there is data waiting for it in Go or the OS's receive buffer. Unfortunately, there is no way to know ahead of time how long this should take. This new code uses 2x the fastest successful read time as the max read time. This allows the code to adapt to whatever host it is running on. #1481
A few more optimizations and now the memory usage is comparable.
But the problem you are having is definitely caused by the fake non-blocking read. ... The problem is the lack of true non-blocking IO for Windows. I think it should be possible, but I'm not sure how. Since we are left with fake non-blocking IO the problem is figuring out a wait time that allows for a read to succeed without waiting too long. Originally, I just chose a large number to ensure that we didn't get in a situation where all reads fail because the read deadline is too short. 898891a should improve this significantly. It keeps track of the time of successful reads and uses 2x the fastest read as the wait time for a fake non-blocking read. This allows it to adapt to the speed of the host and still provides reasonable assurance that reads will work when data is waiting in Go or the OS's receive buffer. With those changes in place, I disabled true non-blocking IO and tried the benchmark again:
I think this should resolve this issue. |
Thanks again Jack, much better indeed, but unfortunately still a 10x gap :/ goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 8 131495475 ns/op 373196 B/op 81 allocs/op
BenchmarkPgx5-6 1 1235863700 ns/op 770944 B/op 361 allocs/op
PASS
ok nph/bench 2.731s I am unfortunately no networking expert, but if non-blocking IO is not available in windows, could it be possible to come back to the same kind of goroutine logic as in CopyFrom v4 (async write / sync receive) - for windows only ? |
To be clear non-blocking IO is definitely possible on Windows, and is almost certainly possible from Go. But I don't have a Windows dev box nor do I have the Windows expertise to easily implement it. The ideal solution is to get that implemented. I'd much rather see that happen than have two separate copy from implementations. I'm going to post a new issue requesting help on this. |
@jeremy-dupre One other thing, can you look at how long the fake non-blocking reads are taking again? And confirm that that is still the problem? I would have expected that to be less than 100µs. And if it is that low I would not expect the impact to be so great. |
I timed the row 454 (=> c.conn.Read(b)) in func (c *NetConn) fakeNonblockingRead: conn.Read: 505600 ns; read 34 byte(s); err: <nil>; fakeNonblockingReadWaitDuration: 100000000 ns
conn.Read: 3889600 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16316900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 14711300 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16246900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16352800 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 17267500 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
... Each read (except the first - and only - one that returns something) takes ~16ms, and ends up in i/o timeout, though the fake read deadline is 1ms |
Good to know, thanks. I suppose the read deadline on Go / Windows isn't very strictly enforced. |
see golang/go#44343 |
var (
winmmDLL = syscall.NewLazyDLL("winmm.dll")
procTimeBeginPeriod = winmmDLL.NewProc("timeBeginPeriod")
)
if c.readDeadline.IsZero() || deadline.Before(c.readDeadline) {
procTimeBeginPeriod.Call(uintptr(1)) // 1ms; probably overkill to call it each time, but at least we can be sure it's not reset by some other process
err = c.conn.SetReadDeadline(deadline)
proposedWait := endTime.Sub(startTime) * 3 / 2 Then v5 is 20% faster than v4 : goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 10 122825480 ns/op 360039 B/op 65 allocs/op
BenchmarkPgx5-6 10 103812350 ns/op 790924 B/op 222 allocs/op
PASS
ok nph/bench 5.414s And when testing with a local postgres server, the bottleneck indeed remains the windows timer minimal 'precision' : conn.Read: 0 ns; read 34 byte(s); err: <nil>; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1053600 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1047000 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1052400 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1026400 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns |
I added a few more optimizations. 384a581 should use the copy buffer more efficiently. 9963c32 causes the fake non-blocking read to only read 1 byte the first 5 times bytes are read. This ensures that there are 4 reads where there is data already in Go or the OS's network receive buffer. In other words this should give us the time necessary to get past Go's deadline system, but not require waiting for an actual network receive. Your points 1 and 2 are working around the fundamental issue -- no real non-blocking I/O. I'd rather not spend the time and add the complexity on a workaround. FWIW, I looked into what it would take to get it implemented on Windows. It doesn't look too hard. Hopefully someone will tackle #1482. Point 3 should be resolved by 9963c32 |
Thanks much for the fixes - greatly appreciated.
FYI, here's my latest bench - pgx v5 perf is still 40-ish % off pgx v4 perf: goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 9 112712144 ns/op 370487 B/op 76 allocs/op
BenchmarkPgx5-6 7 154666243 ns/op 366804 B/op 178 allocs/op
PASS
ok nph/bench 3.622s |
Hello, I found this issue after running some benchmarks to decide whether to switch from INSERT to COPY. I thought it might be useful to share some numbers. I'm on Linux rather than Windows so not everything discussed here applies.
As can be seen, the memory improvements are significant 🎉 @jackc Do you foresee tagging a new release in the near future? No particular rush on my part, just curious what sort of timeline to expect. |
|
@jeremy-dupre Try branch windows-non-blocking-io. |
@jeremy-dupre could you also try this changes - #1525 ? |
Describe the bug
CopyFrom in v5.2.0 is ~80x slower than CopyFrom in v4.17.2
Likely because of the new logic implemented in: func (pgConn *PgConn) CopyFrom(ctx context.Context, r io.Reader, sql string) (CommandTag, error)
Here is the bench result:
goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 10 106087930 ns/op 363762 B/op 66 allocs/op
BenchmarkPgx5-6 1 8003570000 ns/op 2422688 B/op 100559 allocs/op
PASS
ok nph/bench 11.587s
To Reproduce
Run the bench provided here below
Expected behavior
I would expect CopyFrom v5.x to be roughly at par with CopyFrom v4.x performance-wise
Actual behavior
CopyFrom v5.2.0 is 80x slower than v4.17.2
Version
Additional context
The text was updated successfully, but these errors were encountered: