7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) #2112

jcollum · 2020-02-24T22:05:37Z

Code is running in a lambda. When we run the function below it runs fine. But if we run it and then wait 10 minutes then run it again, we get an error every time. We are querying Aurora Postgres on AWS.

Code:

const {Pool} = require('pg');

// Global Connection, can be re-used!!
const pgPool = new Pool({
    user: process.env.PG_USER,
    host: process.env.PG_HOST,
    database: process.env.PG_DATABASE,
    password: process.env.PG_PASSWORD,
    port: process.env.PG_PORT,
    max: process.env.MAX_CLIENTS
});

pgPool.on('error', (err, client) => {
    console.error('Unexpected error in Postgress connection pool', err);
});

async function performQuery(event) {

    let queryString = null;
    let args = null;

    try {
        // some setup code here...  
        const client = await pgPool.connect();
        try {
            const res = await client.query(queryString, args);
            return res.rows;
        } finally {
            client.release();
        }
    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err); // <-- this line is in the log below  
        throw err;
    }
}

This is what I see in the cloudwatch logs:

{
    "errorType": "Error",
    "errorMessage": "Connection terminated unexpectedly",
    "stack": [
    "Error: Connection terminated unexpectedly",
    "    at Connection.<anonymous> (/var/task/node_modules/pg/lib/client.js:255:9)",
    "    at Object.onceWrapper (events.js:312:28)",
    "    at Connection.emit (events.js:223:5)",
    "    at Connection.EventEmitter.emit (domain.js:475:20)",
    "    at Socket.<anonymous> (/var/task/node_modules/pg/lib/connection.js:78:10)",
    "    at Socket.emit (events.js:223:5)",
    "    at Socket.EventEmitter.emit (domain.js:475:20)",
    "    at TCP.<anonymous> (net.js:664:12)"
]
}

I've tried a few variations on this but the constant is the 10 minutes and the use of the Pool. To me this code is almost identical to the code in https://node-postgres.com/features/pooling.

So far it looks like the problem has been solved by using a Client instead:

const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        const res = await client.query(queryString, args);
        await client.end();
        return res.rows;

The text was updated successfully, but these errors were encountered:

jamesdixon · 2020-02-25T00:15:31Z

@jcollum this is also happening for us and a number of others (see this thread knex/knex#3636).

We're seeing this in Node 10 / 12 in the Lambda environment. Only solution at the moment was to run a custom Node 8 runtime for our Lambda functions.

briandamaged · 2020-02-25T00:19:02Z

@jamesdixon : Ah -- I was just about to point you to this thread! lol

592da · 2020-02-25T09:40:32Z

dealing with the same issue on production,
working with serverless, over AWS lambdas, node v12 (over typeorm).

@briandamaged @jamesdixon any solution?
did not manage to set the runtime to node8...

please help

brianc · 2020-02-25T15:46:11Z

It looks like it's related to the lambda environment or something and not pg. According to this it's hitting both pg & mysql....which makes it somewhat unlikely to be caused by this library. I'm not sure there's much I can do w/o having steps to reproduce, unfortunately. Looking at @jcollum's example it looks like it's somehow related to the lambda going idle and killing the open connections to the db. Weirdly if you only run the lambda once every 10 minutes that should be well outside the idle timeout and the pool should have closed all it's connections. Anyways I can make lots of assumptions about what's happening but w/o steps to reproduce I'm not gonna be able to provide much concrete support. I'll follow along if more info comes up I'll look into it. Also: PRs are always welcome for issues like this if you figure out what it is! 😉

jamesdixon · 2020-02-25T17:35:14Z

Thanks for the input, @brianc!

According to this it's hitting both pg & mysql....which makes it somewhat unlikely to be caused by this library

That was the original thought but later on in the thread you referenced, one of the Knex maintainers pointed out that the MySQL error was actually different and may not be related.

The one thing we do know is that the issue isn't present in Node 8 but rears its ugly head in Node 10/12. Are you aware of any major changes in those versions related to this library that might be exacerbated in a Lambda environment?

briandamaged · 2020-02-25T18:02:04Z

@jamesdixon : Howdy! Just to clarify: what I meant was that Knex did not appear to be directly related to the pg-specific control path. However, I suspect that the same lower-level issue is affecting both "pg" and "mysql".

My guess is that either the server or the network is aggressively closing the connections. This is then causing the client-side libraries to "get confused" when they discover that the connection is already closed. (At least, that appears to be the case for "pg". The Client class checks the this._ending flag to see whether or not it was expecting the connection to be closed)

jamesdixon · 2020-02-25T18:34:48Z

Got it! Thanks for the clarification!

…

On Tue, Feb 25, 2020 at 11:02 AM Brian Lauber ***@***.***> wrote: @jamesdixon <https://github.com/jamesdixon> : Howdy! Just to clarify: what I meant was that Knex did not appear to be directly related to the pg-specific control path. However, I suspect that the same lower-level issue is affecting both "pg" and "mysql". My guess is that either the server or the network is aggressively closing the connections. This is then causing the client-side libraries to "get confused" when they discover that the connection is already closed. (At least, that appears to be the case for "pg". The Client class checks the this._ending flag to see whether or not it was expecting the connection to be closed) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2112?email_source=notifications&email_token=AAA24IXYNC5ZRZGUCSJDUOTREVMJ5A5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM44G7I#issuecomment-590988157>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA24ITEVY7NCGYWPMDCCKDREVMJ5ANCNFSM4K2TYUCQ> .

jcollum · 2020-02-25T19:24:15Z

I fixed it by not using a pool:

try {

// trimmed code here 

        const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        const res = await client.query(queryString, args);
        await client.end();
        return res.rows;

    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err);
        throw err;
    }

Minor performance hit there.

briandamaged · 2020-02-25T19:31:24Z

@jcollum : My guess is that the issue isn't actually "fixed", but rather "avoided". As in: whenever your code needs to perform a query, it is:

Establishing a new connection (via the freshly-created Client instance)
Performing the query
Ending the connection cleanly. (by invoking client.end() )

BTW: it looks like there is a connection leak in that snippet. It should probably do this:

try {
// trimmed code here 

        const client = new Client({
            user: process.env.PG_USER,
            host: process.env.PG_HOST,
            database: process.env.PG_DATABASE,
            password: process.env.PG_PASSWORD,
            port: process.env.PG_PORT
        });

        await client.connect();
        try {
          const res = await client.query(queryString, args);
          return res.rows;
        } finally {
          await client.end();
        }

    } catch (err) {
        console.error('Problem executing export query:');
        console.error(err);
        throw err;
    }

This will ensure that the connection will be closed even if an error occurs.

brianc · 2020-02-25T19:34:50Z

Yah opening a new client every time is something you can do, but the latency overhead of opening a client is aprox 20x the overhead of sending a query....so ideally you'd use a pool. You don't have to, particularly if your lambda is going cold then pooling really likely isn't buying you much. There might be a way to simulate this in test by creating a pool, reaching into the internals, and severing the streams manually behind the scenes. Anyone care to make a first pass crack at that in a gist or something? If it reproduces the issue I can spin it into a test, make it pass, and ship a fix!

jcollum · 2020-02-25T19:47:51Z

@jcollum : My guess is that the issue isn't actually "fixed", but rather "avoided". As in: whenever your code needs to perform a query, it is:

Well yes it's not a fix it's a workaround. Thanks for catching that potential leak.

briandamaged · 2020-02-25T19:58:52Z

@brianc : Thinking about this a bit more... I suspect that both "pg" and "knex" are encountering the same conceptual issue in their Pool implementations. Specifically: they are not re-checking the validity of the Client instances before yielding them from the pool. Here's what I'm seeing on the "knex" side, at least:

https://github.com/knex/knex/blob/8c07192ade0dde137f52b891d97944733f50713a/lib/client.js#L333-L335

Instead, this function should probably delegate to a pg-specific function that confirms that the Client instance is still valid.

I'll check the "pg" codebase in a few mins to see if I can spot a similar behavior there.

briandamaged · 2020-02-25T20:12:40Z

@brianc : So, it looks like the "pg" pool has some logic for removing idle clients; however, I suspect that it does not get invoked unless the underlying Client instance actually emits an "error". For ex:

node-postgres/packages/pg-pool/index.js

Line 216 in 1c8b6b9

client.on('error', idleListener)

However, it looks like the Client doesn't necessarily emit the "error" event when it detects that the connection has been closed:

node-postgres/packages/pg/lib/client.js

Lines 252 to 279 in 1c8b6b9

    
           con.once('end', () => { 
        
             const error = this._ending 
        
               ? new Error('Connection terminated') 
        
               : new Error('Connection terminated unexpectedly') 
        
             clearTimeout(connectionTimeoutHandle) 
        
             this._errorAllQueries(error) 
        
             if (!this._ending) { 
        
               // if the connection is ended without us calling .end() 
        
               // on this client then we have an unexpected disconnection 
        
               // treat this as an error unless we've already emitted an error 
        
               // during connection. 
        
               if (this._connecting && !this._connectionError) { 
        
                 if (callback) { 
        
                   callback(error) 
        
                 } else { 
        
                   connectedErrorHandler(error) 
        
                 } 
        
               } else if (!this._connectionError) { 
        
                 connectedErrorHandler(error) 
        
               } 
        
             } 
        
             process.nextTick(() => { 
        
               this.emit('end') 
        
             }) 
        
           })

So, if this analysis is correct, then it means there is a way for the Pool to contain Client instances that have already been disconnected. (Or rather: their underlying connection instance has already been ended)

briandamaged · 2020-02-25T22:55:26Z

@brianc : Here's a snippet that seems to confirm my suspicion:

const {Pool} = require('pg');

function createPool() {
  return new Pool({
      user: process.env.PG_USER,
      host: process.env.PG_HOST,
      database: process.env.PG_DATABASE,
      password: process.env.PG_PASSWORD,
      port: process.env.PG_PORT,
      max: process.env.MAX_CLIENTS
  });
}

async function run() {
  const pool = createPool();

  const c1 = await pool.connect();
  // Place the Client back into the pool...
  await c1.release();

  // Now intentionally end the lower-level Connection while Client is already in the pool
  c1.connection.end();

  // Attempt to obtain the connection again...
  const c2 = await pool.connect();
  try {
    // Explodes
    const res = await c2.query("select * from accounts", null);
    console.log("Yay");
  } catch(err) {
    console.log("Initial error")
    console.log("----------------")
    console.log(err);
  } finally {
    c2.release();
  }


  // Attempt to obtain the connection again...
  const c3 = await pool.connect();
  try {
    // Surprisingly, it succeeds this time.  This is because the pool was already
    // "fixed" thanks to the 'error' event that it overheard.
    const res = await c3.query("select * from accounts", null);
    console.log("Yay");
  } catch(err) {
    console.log("Second error")
    console.log("----------------")
    console.log(err);
  } finally {
    c3.release();
  }
}

run();

Sample output:

Initial error
----------------
Error: write after end
    at writeAfterEnd (_stream_writable.js:236:12)
    at Socket.Writable.write (_stream_writable.js:287:5)
    at Socket.write (net.js:711:40)
    at Connection.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/connection.js:234:15)
    at Query.submit (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/query.js:164:16)
    at Client._pulseQueryQueue (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:446:43)
    at Client.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:541:8)
    at run (/Users/brianlauber/Documents/src/knex/oops.js:43:26)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:57118) UnhandledPromiseRejectionWarning: Error: write after end
    at writeAfterEnd (_stream_writable.js:236:12)
    at Socket.Writable.write (_stream_writable.js:287:5)
    at Socket.write (net.js:711:40)
    at Connection.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/connection.js:234:15)
    at Query.submit (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/query.js:164:16)
    at Client._pulseQueryQueue (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:446:43)
    at Client.query (/Users/brianlauber/Documents/src/knex/node_modules/pg/lib/client.js:541:8)
    at run (/Users/brianlauber/Documents/src/knex/oops.js:43:26)
    at <anonymous>
    at process._tickCallback (internal/process/next_tick.js:189:7)
(node:57118) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:57118) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Yay

brianc · 2020-02-26T00:49:25Z

that's perfect I'll get it looked at soon!

briandamaged · 2020-02-26T00:53:08Z

FYI: It looks like you can trigger the issue one layer deeper as well. Ex:

// Added the `.stream` portion here
c1.connection.stream.end();

jcollum · 2020-02-26T04:26:30Z

I'm impressed with how quickly this is being looked at

jamesdixon · 2020-02-26T04:30:25Z

@jcollum agreed.

Shout out to @briandamaged for digging into this. I'm so thankful for people smarter than I am 🤣

brianc · 2020-02-26T04:33:55Z

I'm impressed with how quickly this is being looked at

I've been trying to work on pg at least 5 days a week this year. I'll be looking at this first thing tomorrow morning. I'm not 100% sure that making the pool evict closed clients will fix the actual specific issue in lamdas but...it's good behavior to have either way.

jcollum · 2020-02-26T04:36:20Z

Thanks for being so on top of this.

…

On Tue, Feb 25, 2020 at 8:33 PM Brian C ***@***.***> wrote: I'm impressed with how quickly this is being looked at I've been *trying* to work on pg at least 5 days a week this year. I'll be looking at this first thing tomorrow morning. I'm not 100% sure that making the pool evict closed clients will fix the actual specific issue in lamdas but...it's good behavior to have either way. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2112?email_source=notifications&email_token=AAFMAAVJEGOUR5O3PGT3XZTREXWLJA5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6YE3I#issuecomment-591233645>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFMAAQNUNKE5UJFFXNV7CDREXWLJANCNFSM4K2TYUCQ> .

jamesdixon · 2020-02-26T04:37:09Z

@brianc thanks for building such an important library. Your efforts are truly appreciated. Cheers!

brianc · 2020-02-26T04:52:01Z

❤️ thanks. Looking forward to making lots of folks node apps faster in the next few months when I do some work on perf!

…

On Tue, Feb 25, 2020 at 10:37 PM James Dixon ***@***.***> wrote: @brianc <https://github.com/brianc> thanks for building such an important library. Your efforts are truly appreciated. Cheers! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2112?email_source=notifications&email_token=AAAMHIMS2WOYO4X37VPYTO3REXWXLA5CNFSM4K2TYUC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6YKYY#issuecomment-591234403>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMHIOKFBCV2UFEUSWWVRLREXWXLANCNFSM4K2TYUCQ> .

briandamaged · 2020-02-26T05:28:37Z

@brianc : Here's a variation on the earlier script that provides some more evidence around the issue:

const {Pool} = require('pg');

function delay(t) {
  return new Promise(function(resolve) {
    setTimeout(resolve, t);
  });
}


function createPool() {
  return new Pool({
      user: process.env.PG_USER,
      host: process.env.PG_HOST,
      database: process.env.PG_DATABASE,
      password: process.env.PG_PASSWORD,
      port: process.env.PG_PORT,
      max: process.env.MAX_CLIENTS
  });
}


async function run() {
  const pool = createPool();

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
    } finally {
      client.release();
    }
  }

  const c1 = await pool.connect();
  await c1.query("select 1", null);
  await c1.release();


  // !!!
  // FYI: If you comment out this event handler, then the script
  //      will fail 100% of the time.  (Unhandled 'error' event)
  pool.on('error', function() {
    console.log("Pool handled error cleanly!")
  })


  // c1 is in the pool.  We're intentionally ending the
  // lower-level connection.
  c1.connection.stream.end();


  // !!!
  // FYI: With a 1 second delay, everything appears to work fine.
  //      However, if you reduce this delay too much, then you'll
  //      observe the errors that were reported.
  console.log("Delaying...");
  await delay(1000);
  console.log("Finished delay");


  await doStuff("ONE");
  await doStuff("TWO");
}

run();

There are 2 key things you can play around with here:

Removing the pool.on('error', ...) expression (which will cause the script to explode)
Adjusting / removing the delay. (Errors will begin to occur when the delay is set too low)

Even with a delay of 1 second, the problem seems to disappear. So, now I'm unsure if this script is accurately recreating the scenario described by this issue.

sehrope · 2020-02-26T14:07:17Z

Wrapping the pool clients would solve a lot of these problems. Right now pool clients are the raw client with additional functions to release the connection back to the pool. That means that code that directly manipulates the connection, ex: c1.connection.stream.end(), can break things internally. Ditto for continuing to use a client after you've told the pool that you're done with it. With a more restricted interface, say just a .query(...) and .release(...), it's harder to shoot yourself in the foot (though with JS it's never completely impossible).

You also need to ensure to call .release(err: Error) to evict broken clients:

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
    } finally {
      client.release();
    }
  }

Calling .release() (without a truthy error) instructs the pool that you're done with client and it's okay to return to the pool for reuse. With proper eviction you'll still get parts of your coding potentially receiving broken clients, but at least they'll eventually evict themselves out of the pool.

briandamaged · 2020-02-26T15:43:32Z

@sehrope : Right. The purpose of the code snippet was to recreate a scenario where the low-level socket was being closed/ended outside of the Pool's managed lifecycle. So, you're correct: normally, you would not want to do that.

As for the .release(..) part: Javascript does not support try / catch / else syntax. So, expecting the developer to call .release(..) differently depending upon whether or not an exception was raised seems like a usability issue. Ex: it's not possible to do this:

  async function doStuff(label) {
    const client = await pool.connect();
    try {
      console.log(`${label}: START`)
      const res = await client.query("select 1", null);
      console.log(`${label}: SUCCEEDED`);
    } catch(err) {
      console.log(`${label}: ERROR  ->  ${err}`)
      client.release(err);
    } else {
      client.release();
    }

    // Other logic that needs to run after doing stuff w/ `client`
    // ....
  }

(Okay, fine. It's technically possible, but it forces the developer to create/check didErrorOccur variables)

In either case: the snippet's error is actually triggered by the first call to client.query(..). So, although the detail you pointed out about client.release(..) appears to be true, I'm not sure if it really impacts this snippet in any way. (Side note: the README.md for pg-pool doesn't seem to mention this detail about the call to .release(..), but the code itself looks like it aligns w/ what you're saying. https://github.com/brianc/node-postgres/tree/master/packages/pg-pool )

The main concept is: it's possible for Clients that are already in the Pool to lose their Connections. It's the Pool's responsibility to double-check the validity of a Client before handing it off to a caller. Based upon the issue reported here, it looks like there might be a corner case where this is not occurring.

(Or, alternatively: perhaps several people missed this detail about .release(..), and everybody has been inserting disconnected Clients back into the pool???)

sehrope · 2020-02-26T16:44:56Z

As for the .release(..) part: Javascript does not support try / catch / else syntax. So, expecting the developer to call .release(..) differently depending upon whether or not an exception was raised seems like a usability issue.

Yes it's a bit tricky and hopefully when the pool interface gets improved it'll be easier to use. That "pass an error to evict" logic for the pool has been for all the years I've used this driver. All my usage of the driver has a local wrapper that handles that logic so it's in one place in each app, but you're right that everybody kind of has to handle it themselves.

(Okay, fine. It's technically possible, but it forces the developer to create/check didErrorOccur variables)

Yes that's the logic to which I'm referring. Though rather than checking if an error occurred the usual approach is to have different release(...) calls based on whether the task resolved or rejected:

const pg = require('pg');
const pool = new Pool({ /* ... */ });

const withClient = async (task) => {
    const client = await pool.connect();
    try {
        const result = await task(client);
        // If we get here then the task was successful so assume the client is fine
        client.release();
        return result;
    } catch (err) {
        // If we get here then the task failed so assume the client is broken
        client.release(err);
        throw err;
    }
};

That logic is a bit over zealous as it'll evict potentially salvageable client errors (ex: after a successful ROLLBACK we'd expect the client to be usable again) but it's a decent default. If you expect to have lots of constraint violations that could thrash your pool you can make it a bit more intelligent to deal with that situation or have a transaction wrapper handle that.

In either case: the snippet's error is actually triggered by the first call to client.query(..). So, although the detail you pointed out about client.release(..) appears to be true, I'm not sure if it really impacts this snippet in any way.

It's important to evict broken connections or your pool will never heal and you'll keep getting the same broken connections. Evicting them will force your pool to (eventually) create new working connections.

(Side note: the README.md for pg-pool doesn't seem to mention this detail about the call to .release(..), but the code itself looks like it aligns w/ what you're saying. https://github.com/brianc/node-postgres/tree/master/packages/pg-pool )

Yes we should fix that. The docs site (https://node-postgres.com/api/pool#releaseCallback) has the right usage but the README must be really out of date.

The main concept is: it's possible for Clients that are already in the Pool to lose their Connections. It's the Pool's responsibility to double-check the validity of a Client before handing it off to a caller. Based upon the issue reported here, it looks like there might be a corner case where this is not occurring.

The end usage always needs to deal with broken connections as even if the pool tests the connection, it's a race condition between the pool testing the connection and the end usage actually doing some work with the connection:

Client asks to check out connection
Pool tests connection and verifies it's okay
Connection is killed
Client receives connection from pool

(Or, alternatively: perhaps several people missed this detail about .release(..), and everybody has been inserting disconnected Clients back into the pool???)

That's possible. For something like Lambda that can freeze your app and effectively kill any open sockets in the background, I'd suggest not using a pool at all if it's an option. Or you can add your own wrapper that repeatedly tries to fetch and test the connection prior to giving it to the rest of your code. Also, setting the idleTimeoutMillis on the pool to something lower than the Lambda freeze-after-inactive-seconds should limit the problem a bit as you'll be less likely to have any clients int he pool when the Lambda gets frozen.

bdbrownell · 2021-06-16T20:21:56Z

I believe this issue is also happening to me on GCP - google cloud functions - I think this is similar to AWS Lamba functions. Just wanted to comment to note that it was happening in other serverless/cloud environments incase others are running into this issue from a google cloud project.

This snippet is what I will be trying, I missed this in the documentation but I have a feeling it will alleviate the problem.

pool.on('error', () => {
console.log('something bad happend')
// re-initialize pool here
})
or some other kind of error handling.

michaelseibt · 2021-06-17T05:39:56Z

Cross-posting from TypeORM: typeorm/typeorm#5112 (comment)

Call this method from inside the event handler function.

const reconnectToDatabase = async () => {
  const connection = getConnection();
  const driver = connection.driver as any;
  for (const client of driver.master._clients) {
    try {
      await client.query('SELECT 1');
    } catch (error) {
      console.info('Reconnecting ...');
      await getConnection().driver.disconnect();
      await getConnection().driver.connect();
      break;
    }
  }
}

Please mind removing type annotations for plain node usage. This is targeting a TypeORM setup, but the concept should be working for node-pg as well. If someone could port this, highly appreciated :-)

We're flying with this since a while now, and it remediated the issue (tho, obviously, not fix it fully). The database connection is established outside the main handler function, allowing it to be re-used. Just this little snippet here checks if everything is still up and running. Might add a little TTL on that check, so it only fires every few minutes.

rols2015 · 2021-09-20T13:10:32Z

please check: #2439 (comment)

rick-seamless · 2022-01-20T15:45:51Z

My issue was that the Aurora reader I was connecting to was rebooting from falling behind the writer too much. Figured I'd post in case someone else was running into this.

balbatross · 2022-03-05T03:47:02Z

@brianc thank you for all your hard work, this to me is another case of big tech offloading it's problems to the open source community.

Running in to the same issue on EKS and seeing how many people are being affected makes me feel a little less crazy.

Gonna make my first sponsorship today to help support you, if I can track down a worthwhile solution I'll open a PR.

Arohanui

balbatross · 2022-03-05T05:10:27Z

For anyone else still struggling with this I think I may have found a resolution.

It seems to be largely an infrastructure provider issue as others have mentioned, AWS has a common practice of setting a 350 second timeout on idle TCP connections, Azure and GCP have different timeouts but similar issues.

From a cursory search ELB and NAT Gateway were the culprits for me.

The keepAlive flag implemented in node-pg follows the right setup from what I can see but it could be that the OS level and node aren't playing nice as the default timer for linux seems to be 7200.

I ended up resolving by setting up pgbouncer and adding a keepAlive timer of 300 to the config

pgbouncer.ini

tcp_keepalive = 1
tcp_keepidle = 300

Thank you all for helping me through the day

sharadpattanshetti · 2022-03-29T16:20:20Z

We added below config as well as the pool config but still get the same error:

'net.ipv4.tcp_keepalive_time 60
net.ipv4.tcp_keepalive_intvl 10
net.ipv4.tcp_keepalive_probes 6'

 database:
    client: pg
    connection:         
      host: 
      user: 
      password: 
    pool: 
      min: 0
      max: 30
      acquireTimeoutMillis: 30000
      idleTimeoutMillis: 30000

Dakuan · 2022-06-03T08:22:37Z

still a thing

ameer2468 · 2023-01-12T21:48:27Z

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

boromisp · 2023-01-13T10:11:44Z

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution.
You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

bdbrownell · 2023-01-13T19:20:47Z

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution. You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

I have had similar experiences with this on Google Cloud using cloud functions, posted previously in here. This is actually the same conclusion that my team has come to as well. The way Cloud Functions are handled, at least on first generation functions on GCP, is that a single request is handled by a single instance. There will never be the ability for a connection pool to exist in this situation regardless of the timeout - Its one of the shortcomings of the FaaS architecture.

ameer2468 · 2023-01-13T22:20:32Z

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

pg.Pool is not a good fit for AWS Lambda. If you have your database and networking set up in a way, that TCP connections can survive without keepalives for long enough, it might work, but you are probably better off creating fresh connections in every lambda execution. You are also losing out on the horizontal scaling this way.

I think the typical setup for serverless API-s connecting to traditional RDBMS is a separate connection pool (eg. pgbouncer).

Hey thanks for the reply, im not on AWS/AWS Lambda. My restful API is based on Fastify, here's a link just incase: https://www.fastify.io/

also im using client, not pool.

Edit: Will be doing this instead: https://github.com/fastify/fastify-postgres

mepc36 · 2023-04-27T17:19:51Z

For anyone else still struggling with this I think I may have found a resolution.

It seems to be largely an infrastructure provider issue as others have mentioned, AWS has a common practice of setting a 350 second timeout on idle TCP connections, Azure and GCP have different timeouts but similar issues.

From a cursory search ELB and NAT Gateway were the culprits for me.

The keepAlive flag implemented in node-pg follows the right setup from what I can see but it could be that the OS level and node aren't playing nice as the default timer for linux seems to be 7200.

I ended up resolving by setting up pgbouncer and adding a keepAlive timer of 300 to the config

pgbouncer.ini
tcp_keepalive = 1
tcp_keepidle = 300
Thank you all for helping me through the day

@balbatross Do you mind explaining in more detail how you set up pgbouncer, and why that fixed your problem? links to docs is fine. Googled but found nothing. Thanks!

bdbrownell · 2023-04-27T17:50:52Z

@mepc36 PGBouncer solves this by moving the pooling to PGbouncer instead of pg-node. It sits between the node application and the database and uses its own internal pooling to keep track of connections, which is transparent to your application. I don't think the configuration you quoted really has anything to do with why this solves the issue, PGbouncer is the solution to the problem. You would setup PGbouncer on a VM - compute engine on GCP - and your application would connect to that instead of postgres directly.

p.s. Nice to see a fellow developer from Philadelphia 😀

moltar · 2023-08-06T20:03:59Z

Quick question to everyone who is still experiencing this.

Do you use Aurora Serverless v1 by any chance?

The reason I'm asking is because I think this happens when the cluster experiences a scaling event. It drops all connections.

This doesn't happen on v2 by the way.

moltar · 2023-08-07T23:38:37Z

I think I got to the bottom of this with at least one particular case.

This is only applicable to Aurora Serverless v1, and maybe v2. I will test v2 later and report back.

Here's what happens:

A scaling event is triggered.

After ~ 2m the scaling event is triggered (cluster status = scaling-capacity), RDS runs this query:

-- 1: long string, looks like base64 encoded, but not
-- 2: The value of `SecondsBeforeTimeout` from `ScalingConfigurationInfo`
-- 3: Not sure what this is, seems to be always `2`

select * from aurora_serverless_prepare_seamless_scaling('long string', 60000, 2)

This query is blocking until all user queries are finished running.

As soon as the last query ends, it will drop all of the active connections within milliseconds.

I have tested this by running select pg_sleep(60). As soon as the query would return, TablePlus would throw:

Query 1 ERROR: SSL connection has been closed unexpectedly

So what actually happens in the Lambda environment?

It's a race condition!

Cluster entered a scaling-capacity state.
But the aurora_serverless_prepare_seamless_scaling query has not started yet. Remember - it runs in ~ 2 minutes.
Lambda gets invoked and an instance of Pool is created
Maybe some database activity happens for a while (> 2m), let's say it's a semi-long job Lambda, not a web request.
The function aurora_serverless_prepare_seamless_scaling is now patiently waiting for its moment to pounce 🐈
All of the transactions and queries are finished for whatever reason (e.g. maybe they run in a loop: fetch data -> save data), and there is a few millisecond window of opportunity.
But the Pool object is still alive and well, and Lambda is still running, just not querying the database.
aurora_serverless_prepare_seamless_scaling triggers scaling
All connections drop. 💥

Error: Connection terminated unexpectedly
    at Connection.<anonymous> (/node_modules/.pnpm/[email protected]/node_modules/pg/lib/client.js:132:73)
    at Object.onceWrapper (node:events:627:28)
    at Connection.emit (node:events:513:28)
    at Socket.<anonymous> (/node_modules/.pnpm/[email protected]/node_modules/pg/lib/connection.js:63:12)
    at Socket.emit (node:events:513:28)
    at TCP.<anonymous> (node:net:301:12)

I have corroborated this with a massive amount of Log Insight queries and a CloudWatch dashboard.

Error Count = number of errors in the log at the given time. The value is hard to see, but it's at 20.
Scaling Events = Cluster uptime (seconds). This is the recommended metric to measure scaling 🤷🏼

bdbrownell · 2023-08-09T03:02:58Z

@moltar I was experiencing this issue over in Google cloud functions - the Google equivalent of AWS lambda. I presume GCP has some timeout in the same way AWS is dropping the connections. It turns out there is no way for GCP cloud functions to handle concurrent requests, so it doesn’t actually make sense to try to start a pool as the function only handles 1 query per invocation. Version 2 has changed that, and allows concurrent requests. In a function as a service environment, IMO, the best way to handle this would be an external pool service like pgbouncer.

This fixes various issues with node-postgres, specifically that the connection pooling implementation does not work well: brianc/node-postgres#1611 brianc/node-postgres#2112

This fixes various issues with node-postgres, specifically that the connection pooling implementation does not work well: brianc/node-postgres#1611 brianc/node-postgres#2112 --------- Co-authored-by: Ivan Chub <[email protected]>

Tadeuchi · 2023-10-18T09:14:35Z

Hi, the issue with connection still seems to be appearing in the latest version: 8.11.3

I have encountered something very similar to what was described above and managed to create some simple code snippet with the problem. Maybe this will help with resolving the issue.

I'm just using pg-pool to connect to local postgres and insert simple json object.
When the object contains bigint the JSON.stringify in the pg lib fails, but the connection is already established.
Here is what I'm doing.

Simple table to store the jsonb.

create table test (
    value jsonb
);

JS inserting the objects:

async function insertUsingPool() {
    // following query works fine
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 1}]
    ).catch((e) => { console.error(e) });

    // this fails with error: TypeError: Do not know how to serialize a BigInt
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 2n}]
    ).catch((e) => { console.error(e) });

    // this fails with error: Error: Connection terminated unexpectedly
    await pgPool.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 3}]
    ).catch((e) => { console.error(e) });
}

async function insertUsingClientPool() {
    const pgPoolClient = await pgPool.connect();

    // following query works fine
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 4}]
    ).catch((e) => { console.error(e) });

    // this throws exception and hangs... TypeError: Do not know how to serialize a BigInt
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 5n}]
    ).catch((e) => { console.error(e) });

    // not even executed...
    await pgPoolClient.query(
        `INSERT INTO test (value) VALUES ($1)`,
        [{'test' : 6}]
    ).catch((e) => { console.error(e) });
}

After running the insertUsingClientPool function,
the test table contains only two objects "{""test"": 1}", "{""test"": 4}"
and the process just hangs.
pg_stat_activity shows there is an active connection ongoing with the insert query.

I don't know, maybe I'm doing something stupid and not using the pg lib correctly, but maybe the JSON.stringify should be called before establishing the connection?

cjnoname · 2024-03-05T18:12:32Z

Any updates?

devodii · 2024-05-27T17:18:34Z

Any solutions? my RESTFUL API randomly disconnects because of this...otherwise ill just setup a cronjob to call the /health endpoint every minute or something.

How can we do this buddy?

jamesdixon mentioned this issue Feb 25, 2020

Connection terminated unexpectedly (Postgresql / AWS Lambda) knex/knex#3636

Open

michaelseibt mentioned this issue Feb 26, 2020

Broken Postgres clients are released back into the pool, but should be removed typeorm/typeorm#5112

Closed

brianc mentioned this issue Feb 26, 2020

Attempt to handle a stream being silently killed in the background #2121

Closed

mepc36 mentioned this issue Apr 27, 2023

Connection terminated unexpectedly with serverless db timgit/pg-boss#381

Closed

sehrope mentioned this issue Jun 22, 2023

Question: AWS Lambda + RDS Proxy and Pool class - Best practices? #3016

Closed

robknight mentioned this issue Aug 10, 2023

Use postgres-pool instead of node-postgres proofcarryingdata/zupass#438

Merged

bajtos mentioned this issue Sep 26, 2023

Handle "Connection terminated unexpectedly" CheckerNetwork/spark-api#78

Closed

geekyayush mentioned this issue Nov 16, 2023

Getting Cannot read properties of undefined (reading 'replace') error kysely-org/kysely-postgres-js#9

Closed

codyebberson mentioned this issue Jan 28, 2024

RDS proxy medplum/medplum#3827

Merged

mattkrick mentioned this issue Apr 22, 2024

fix: support PG reconnects ParabolInc/parabol#9663

Merged

7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) #2112

7.18.2: "Connection terminated unexpectedly" when using client.query with a pool when pool has been idle for 10 minutes (running in AWS Lambda) #2112

Comments

jcollum commented Feb 24, 2020

jamesdixon commented Feb 25, 2020

briandamaged commented Feb 25, 2020

592da commented Feb 25, 2020

brianc commented Feb 25, 2020

jamesdixon commented Feb 25, 2020

briandamaged commented Feb 25, 2020

jamesdixon commented Feb 25, 2020 via email

jcollum commented Feb 25, 2020

briandamaged commented Feb 25, 2020

brianc commented Feb 25, 2020

jcollum commented Feb 25, 2020

briandamaged commented Feb 25, 2020 • edited Loading

briandamaged commented Feb 25, 2020

briandamaged commented Feb 25, 2020

brianc commented Feb 26, 2020

briandamaged commented Feb 26, 2020

jcollum commented Feb 26, 2020

jamesdixon commented Feb 26, 2020

brianc commented Feb 26, 2020

jcollum commented Feb 26, 2020 via email

jamesdixon commented Feb 26, 2020

brianc commented Feb 26, 2020 via email • edited Loading

briandamaged commented Feb 26, 2020

sehrope commented Feb 26, 2020

briandamaged commented Feb 26, 2020

sehrope commented Feb 26, 2020

bdbrownell commented Jun 16, 2021

michaelseibt commented Jun 17, 2021 • edited Loading

rols2015 commented Sep 20, 2021

rick-seamless commented Jan 20, 2022

balbatross commented Mar 5, 2022

balbatross commented Mar 5, 2022

pgbouncer.ini

sharadpattanshetti commented Mar 29, 2022 • edited Loading

Dakuan commented Jun 3, 2022

ameer2468 commented Jan 12, 2023

boromisp commented Jan 13, 2023

bdbrownell commented Jan 13, 2023

ameer2468 commented Jan 13, 2023 • edited Loading

mepc36 commented Apr 27, 2023

pgbouncer.ini

bdbrownell commented Apr 27, 2023

moltar commented Aug 6, 2023

moltar commented Aug 7, 2023 • edited Loading

bdbrownell commented Aug 9, 2023

Tadeuchi commented Oct 18, 2023

cjnoname commented Mar 5, 2024

devodii commented May 27, 2024

briandamaged commented Feb 25, 2020 •

edited

Loading

brianc commented Feb 26, 2020 via email •

edited

Loading

michaelseibt commented Jun 17, 2021 •

edited

Loading

sharadpattanshetti commented Mar 29, 2022 •

edited

Loading

ameer2468 commented Jan 13, 2023 •

edited

Loading

moltar commented Aug 7, 2023 •

edited

Loading