goroutines and tcp errors #723

Virtual-Machine · 2018-03-15T22:48:43Z

Full disclosure I may have something misconfigured with postgres.

max_connections on db = 1000

My understanding is that once you get the db object from the db.Open() that it is a connection pool, and it can be safely used inside goroutines for concurrent access.

However the below minimal example does one of 3 things on my machine.

Very rarely it runs to completion as expected.
Most often it fails with : "read tcp [::1]:port->[::1]:5432: read: connection reset by peer"
Rarely it fails with "write tcp [::1]:port->[::1]:5432: write: broken pipe"

Lowering the goroutines int variable increases the reliability, raising it makes failure a guarantee.
The number of records that prints to the terminal before an error changes each time I run a given binary.

package main

import (
	"database/sql"
	"fmt"
	_ "github.com/lib/pq"
	"os"
)

func main() {
	// Connect to db
	connStr := "postgres://@localhost/test?sslmode=disable"
	db, err := sql.Open("postgres", connStr)
	if err != nil {
		panic(err.Error())
	}
	defer db.Close()

	err = db.Ping()
	if err != nil {
		panic(err.Error())
	}

	goroutines := 300

	// Channel to syncronize goroutines
	channel := make(chan int, goroutines)

	// Generate goroutines to select records and interate them
	for i := 0; i < goroutines; i++ {
		go func(j int) {
			rows, err := db.Query("select name, age from contacts where age = $1", j)
			if err != nil {
				fmt.Println(err)
				os.Exit(-1)
			}
			var name string
			var age int
			for rows.Next() {
				err := rows.Scan(&name, &age)
				if err != nil {
					fmt.Println(err)
					os.Exit(-1)
				}
			}
			fmt.Println(name, age)
			err = rows.Err()
			if err != nil {
				fmt.Println(err)
				os.Exit(-1)
			}
			rows.Close()
			channel <- 0
		}(i)
	}

	// Wait for goroutines to finish
	for i := 0; i < goroutines; i++ {
		<-channel
	}
}

cbandy · 2018-03-20T11:38:21Z

When it fails, what is in the PostgreSQL (server) logs?

Virtual-Machine · 2018-03-20T18:18:13Z

Let me know if you want me to enable more detailed logging or enable another logging option, this is what I see in the log around the time of a pipe breaking:

2018-03-20 14:02:18.853 EDT [16753] LOCATION:  shmem_exit, ipc.c:259
2018-03-20 14:02:18.853 EDT [16753] DEBUG:  00000: proc_exit(-1): 0 callbacks to make
2018-03-20 14:02:18.853 EDT [16753] LOCATION:  proc_exit_prepare, ipc.c:188
2018-03-20 14:02:18.853 EDT [16504] DEBUG:  00000: forked new backend, pid=16756 socket=12
2018-03-20 14:02:18.853 EDT [16504] LOCATION:  BackendStartup, postmaster.c:4099
2018-03-20 14:02:18.853 EDT [16754] DEBUG:  00000: CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2018-03-20 14:02:18.853 EDT [16754] LOCATION:  ShowTransactionStateRec, xact.c:5020
2018-03-20 14:02:18.854 EDT [16754] LOG:  08006: could not send data to client: Broken pipe
2018-03-20 14:02:18.854 EDT [16754] LOCATION:  internal_flush, pqcomm.c:1464
2018-03-20 14:02:18.854 EDT [16754] FATAL:  08006: connection to client lost

dharmjit · 2020-04-17T21:45:00Z

Hi, I am also facing the same issue. Random broken pipe errors. Is there any work around for this.

gnuletik · 2020-05-21T17:03:37Z

@dharmjit See #871 (comment) for workarounds.

I'm seeing "broken pipe" errors when working with CRDB using sqlx. The issue seemed to be the tcp connections were diconnected while the conns in db driver (pq) still has stale connection. It happens more often when the DB is behind a proxy. In our cases, the pods were proxied by the envoy sidecar. There were other instances on the community reporting similar issues, and took different workaround by sebding perodic dummy queries in app mimicing keepalive, enlenghthen proxy idle timeout, or shortening the lifetime of db conn. This has been reported and fixed by the lib/pq upstream in v1.9+ lib/pq#1013 lib/pq#723 lib/pq#897 lib/pq#870 grafana/grafana#29957

To consume fix for lib/pq#723 as it has the same symptom as issues we have seen on Azure DB for PostgreSQL.

dhermes mentioned this issue Nov 19, 2020

Mark net.Conn failed writes as recoverable when 0 bytes were written. #1013

Merged

maddyblue closed this as completed in #1013 Nov 23, 2020

roylee17 mentioned this issue Mar 21, 2021

bump lib/pq ftom 1.2 to 1.10 jmoiron/sqlx#715

Closed

bonzofenix pushed a commit to cloudfoundry/app-autoscaler that referenced this issue Sep 24, 2021

[AAS-21] Update lib/pq to v1.10.0

359002a

To consume fix for lib/pq#723 as it has the same symptom as issues we have seen on Azure DB for PostgreSQL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

goroutines and tcp errors #723

goroutines and tcp errors #723

Virtual-Machine commented Mar 15, 2018

cbandy commented Mar 20, 2018

Virtual-Machine commented Mar 20, 2018

dharmjit commented Apr 17, 2020

gnuletik commented May 21, 2020

goroutines and tcp errors #723

goroutines and tcp errors #723

Comments

Virtual-Machine commented Mar 15, 2018

cbandy commented Mar 20, 2018

Virtual-Machine commented Mar 20, 2018

dharmjit commented Apr 17, 2020

gnuletik commented May 21, 2020