Race condition at-or-near keep-alive expiration #3141

mweberxyz · 2024-04-19T23:26:22Z

Bug Description

I've managed to (fairly) reliably replicate an error when a request is made right around the time a connection is approaching it's keep alive timeout. I have seen it happen in both:

Node 20
Node 21

using each of the following clients:

Node-native fetch
Undici 6.13.0
Node http.get

Reproducible By

Run server.js:

const { createServer } = require("node:http");

const server = createServer(
  { keepAlive: true, keepAliveTimeout: 5000 },
  (req, res) => {
    res.writeHead(200, { "Content-Type": "application/json" });
    res.end(
      JSON.stringify({
        data: "Hello World!",
      })
    );
  }
);

server.listen(3000, async () => {
  const endpoint = `http://localhost:${server.address().port}`;
  console.log(endpoint);
});

Then run client-undici.js:

const undici = require("undici");

let timeout = 4993;
const start = async () => {
  timeout += 0.5;
  console.log({ timeout });

  const res = await undici.fetch("http://127.0.0.1:3000");
  await res.text();
  setTimeout(start, timeout);
};

start();

OR client-http.js:

const http = require("http");
let timeout = 4995;
const start = async () => {
  timeout += 0.5;
  console.log({ timeout });

  http.get("http://127.0.0.1:3000", (res) => {
    res.on("data", () => {});
    res.on("end", () => {
      setTimeout(start, timeout);
    });
  });
};

start();

A few runs might need to be tried, and the starting let timeout = ... might need to be adjusted a bit.. but give it a few tries and eventually you'll see a crash like:

      Error.captureStackTrace(err, this)
            ^

TypeError: fetch failed
    at Object.fetch (/Users/matt/code/replicate-undici-3133/node_modules/undici/index.js:109:13)
    at async Timeout.start [as _onTimeout] (/Users/matt/code/replicate-undici-3133/client.js:8:15) {
  [cause]: Error: read ECONNRESET
      at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
    errno: -54,
    code: 'ECONNRESET',
    syscall: 'read'
  }
}

Node.js v21.7.3

To speed it finding the sweet spot, run with NODE_DEBUG=net, if you see the socket being torn down, then the connection being re-established successfully, then the timeout needs to be reduced.

Expected Behavior

Node does not crash when making a request near keep alive expiration.

Logs & Screenshots

Using a similar approach, I managed to catch a similar error in-the-wild over the internet, but much harder to reliably reproduce.

Full NODE_DEBUG=* logs for each of the clients:

Environment

MacOS arm64
Node v20.12.1/v21.7.3

Additional context

Not sure if this is purely a core node problem, or just similar bugs in both node.http and undici, but figured I'd start with the report here as I stumbled upon it in the midst of trying to replicate #3133.

The text was updated successfully, but these errors were encountered:

mweberxyz · 2024-04-20T06:16:50Z

Looking closer at the logs, I don't think there's anything undici can do. The socket is being closed by the server after the 2nd request has been sent, as far as I can tell.

...
FETCH 27254: sending request to GET http://127.0.0.1:3000//
TIMER 27254: 4990 list empty
NET 27254: destroy
NET 27254: close
NET 27254: close handle
NET 27254: emit close
FETCH 27254: request to GET http://127.0.0.1:3000// errored - read ECONNRESET
node:internal/deps/undici/undici:12500
      Error.captureStackTrace(err, this);
            ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:12500:13
    at async Timeout.start [as _onTimeout] (/Users/matt/code/replicate-undici-3133/client.js:6:15) {
  [cause]: Error: read ECONNRESET
      at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
    errno: -54,
    code: 'ECONNRESET',
    syscall: 'read'
  }
}

Node.js v21.7.3

mcollina · 2024-04-20T08:37:35Z

one option could be to always add the retry interceptor for gets (and just retry them).

for posts and all non-idempotent requests is a bit more tricky.

mweberxyz · 2024-04-20T14:34:33Z

What is the goal of keepAliveTimeoutThreshold? With a Keep-Alive: timeout=5 returned by the server, and a default keepAliveTimeoutThreshhold of 1000ms, is it unexpected that after 4900ms a request sent on the existing connection?

mcollina · 2024-04-20T14:45:58Z

I think we should always be keeping 2s (or more) to account for slow networks. So if the server gives us 5s, we should actually consider it 3s.

mweberxyz · 2024-04-20T15:29:48Z

Regardless of the default, it appears to not be working as expected, and the cause is looking to be in lib/util/timers:

If I stub out the fast timers implementation, then no errors:

module.exports = {
  setTimeout (callback, delay, opaque) {
    return setTimeout(callback, delay, opaque)
  },

I will continue looking into lib/util/timers.

mweberxyz · 2024-04-20T23:51:15Z

The problem is timers always fire at least 1000ms after the provided timeout in the best case, and 2000ms after provided in the "best worst case" - ie just unlucky with the scheduling time, but no event loop contention.

Test case

const timers = require("./timers.js")

// needed because timers in `./timers.js` are unref-ed
setTimeout(() => {
    console.log("10s elapsed, exiting")
    process.exit(0)
}, 10000)

for(const target of [1000, 1500, 1990, 1999, 2000, 2001, 2500, 2990, 2999, 3000, 4000, 5000]) {
    const start = process.hrtime.bigint()
    timers.setTimeout(() => {
        const end = process.hrtime.bigint()
        console.log(`target = ${target}, actual = ${(end - start) / 1_000_000n}`)
    }, target)
}

Result

% node test-latency.js
target = 1000, actual = 2002
target = 1500, actual = 3020
target = 1990, actual = 3020
target = 1999, actual = 3020
target = 2000, actual = 3020
target = 2001, actual = 3021
target = 3000, actual = 4022
target = 2500, actual = 4023
target = 2990, actual = 4023
target = 2999, actual = 4024
target = 4000, actual = 5026
target = 5000, actual = 6028
10s elapsed, exiting

Which means for a server sent 5000 ms keep-alive and 1000ms default keepAliveTimeoutThreshold, the target is 4000, and the actual timeout is always too late - 5000ms or more.

Happy to propose a solution via PR, or @ronag jump in if you prefer.

Edit: I'm think best course of action is:

increase default keepAliveTimeoutThreshold to 2000ms
double the "tick rate" of the fast timeout code by reducing the setTimeout calls from 1000 to 500
subtract 500 from the requested timeout duration -- that means in the best case, the timer will fire exactly when called for.

Taking all three changes together, that means given a 5000ms keep alive, best case scenario we close the connection after 3000ms, best-worst case (ie- timeout scheduled just after a previous timeout was scheduled) it takes 3500ms, and then that leaves 1500ms of breathing room in the worst-worst case if the event loop is blocked.

Closes nodejs#3141

* fix: tweak keep-alive timeout implementation Closes #3141 * fix test impacted by default keepAliveTimeoutThreshold change * adjust timer registration strategy * only use fast timers with delay > 1000ms

mweberxyz added the bug Something isn't working label Apr 19, 2024

mweberxyz added a commit to mweberxyz/nodejs-undici that referenced this issue Apr 21, 2024

fix: tweak keep-alive timeout implementation

d1b576f

Closes nodejs#3141

mweberxyz mentioned this issue Apr 21, 2024

fix: tweak keep-alive timeout implementation #3145

Merged

6 tasks

ronag closed this as completed in #3145 Apr 22, 2024

mweberxyz mentioned this issue Apr 22, 2024

http: keepalive race condition near timeout nodejs/node#52649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition at-or-near keep-alive expiration #3141

Race condition at-or-near keep-alive expiration #3141

mweberxyz commented Apr 19, 2024 •

edited

Loading

mweberxyz commented Apr 20, 2024 •

edited

Loading

mcollina commented Apr 20, 2024

mweberxyz commented Apr 20, 2024

mcollina commented Apr 20, 2024

mweberxyz commented Apr 20, 2024

mweberxyz commented Apr 20, 2024 •

edited

Loading

Race condition at-or-near keep-alive expiration #3141

Race condition at-or-near keep-alive expiration #3141

Comments

mweberxyz commented Apr 19, 2024 • edited Loading

Bug Description

Reproducible By

Expected Behavior

Logs & Screenshots

Environment

Additional context

mweberxyz commented Apr 20, 2024 • edited Loading

mcollina commented Apr 20, 2024

mweberxyz commented Apr 20, 2024

mcollina commented Apr 20, 2024

mweberxyz commented Apr 20, 2024

mweberxyz commented Apr 20, 2024 • edited Loading

mweberxyz commented Apr 19, 2024 •

edited

Loading

mweberxyz commented Apr 20, 2024 •

edited

Loading

mweberxyz commented Apr 20, 2024 •

edited

Loading