-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition at-or-near keep-alive expiration #3141
Comments
Looking closer at the logs, I don't think there's anything undici can do. The socket is being closed by the server after the 2nd request has been sent, as far as I can tell.
|
one option could be to always add the retry interceptor for gets (and just retry them). for posts and all non-idempotent requests is a bit more tricky. |
What is the goal of keepAliveTimeoutThreshold? With a Keep-Alive: timeout=5 returned by the server, and a default keepAliveTimeoutThreshhold of 1000ms, is it unexpected that after 4900ms a request sent on the existing connection? |
I think we should always be keeping 2s (or more) to account for slow networks. So if the server gives us 5s, we should actually consider it 3s. |
Regardless of the default, it appears to not be working as expected, and the cause is looking to be in lib/util/timers: If I stub out the fast timers implementation, then no errors:
I will continue looking into lib/util/timers. |
The problem is timers always fire at least 1000ms after the provided timeout in the best case, and 2000ms after provided in the "best worst case" - ie just unlucky with the scheduling time, but no event loop contention. Test case
Result
Which means for a server sent 5000 ms keep-alive and 1000ms default keepAliveTimeoutThreshold, the target is 4000, and the actual timeout is always too late - 5000ms or more. Happy to propose a solution via PR, or @ronag jump in if you prefer. Edit: I'm think best course of action is:
Taking all three changes together, that means given a 5000ms keep alive, best case scenario we close the connection after 3000ms, best-worst case (ie- timeout scheduled just after a previous timeout was scheduled) it takes 3500ms, and then that leaves 1500ms of breathing room in the worst-worst case if the event loop is blocked. |
* fix: tweak keep-alive timeout implementation Closes #3141 * fix test impacted by default keepAliveTimeoutThreshold change * adjust timer registration strategy * only use fast timers with delay > 1000ms
Bug Description
I've managed to (fairly) reliably replicate an error when a request is made right around the time a connection is approaching it's keep alive timeout. I have seen it happen in both:
using each of the following clients:
Reproducible By
Run server.js:
Then run
client-undici.js
:OR
client-http.js
:A few runs might need to be tried, and the starting
let timeout = ...
might need to be adjusted a bit.. but give it a few tries and eventually you'll see a crash like:To speed it finding the sweet spot, run with
NODE_DEBUG=net
, if you see the socket being torn down, then the connection being re-established successfully, then the timeout needs to be reduced.Expected Behavior
Node does not crash when making a request near keep alive expiration.
Logs & Screenshots
Using a similar approach, I managed to catch a similar error in-the-wild over the internet, but much harder to reliably reproduce.
Full
NODE_DEBUG=*
logs for each of the clients:Environment
MacOS arm64
Node v20.12.1/v21.7.3
Additional context
Not sure if this is purely a core node problem, or just similar bugs in both node.http and undici, but figured I'd start with the report here as I stumbled upon it in the midst of trying to replicate #3133.
The text was updated successfully, but these errors were encountered: