fix(core): do not allow responses to choke request and ping processing #633

ztzg · 2020-12-10T17:55:02Z

Why is this needed?

Without this patch, a single select event is processed by iteration in the ConnectionHandler event loop.

In a scenario where the client issues a large number of async requests with an important amplification factor, e.g. get_children_async on a large node, it is possible for the 'select' operation to almost always return a "response ready" socket—as the server is often able to
process, serialize and ship a new reponse while Kazoo processes the previous one.

That response socket often (always?) ends up at the beginning of the list returned by select.

As only select_result[0] is processed in the loop, this can cause the client to ignore the "request ready" FD for a long time, during which no requests or pings are sent.

In effect, asynchronously "browsing" a large tree of nodes can stretch that duration to the point where it exceeds the timeout—causing the client to lose its session.

This patch considers both descriptors after select, and also arranges for pings to be sent in case it encounters an "unending" stream of responses to requests which were sent earlier.

Does this PR introduce any breaking change?

Not purposefully :)

ztzg · 2020-12-10T17:55:18Z

Cc: @ceache.

ceache · 2020-12-10T21:00:36Z

LGTM (as discussed offline).

StephenSorriaux

Thanks for this PR, great catch.

Love the Not purposefully :) part also :)

jeffwidman

Agree with the others, looks good to me. I particularly appreciate the detailed explanation of the problem, as it made it very clear why the change was necessary.

jeffwidman · 2020-12-10T23:29:56Z

kazoo/protocol/connection.py

+                            response = self._read_socket(read_timeout)
+                            if response == CLOSE_RESPONSE:
+                                break
+                        if self._read_sock in s:


You may want to add a comment here saying something like:

Check if any requests need sending before proceeding to process more responses. Otherwise the responses may choke out the requests.

That way if someone refactors this code, they'll be aware of the issue... providing that "why" context is super important IMO for tricky code like this. Ideally we'd add a test, but this kind of perf issue is super hard to test reliably so I completely understand why there isn't one.

Thank you for the review! Comment added.

FWIW, you may already know this, but most modern fonts are designed so you only need one space after the period, not two. Not something that should hold up this PR at all, but just an FYI.

Hi @jeffwidman,

Right. Old habits die hard… In any case, I should have paid closer attention to surrounding comments; sorry about that.

Without this patch, a single select event is processed by iteration in the 'ConnectionHandler' event loop. In a scenario where the client issues a large number of async requests with an important amplification factor, e.g. 'get_children_async' on a large node, it is possible for the 'select' operation to almost always return a "response ready" socket--as the server is often able to process, serialize and ship a new reponse while Kazoo processes the previous one. That response socket often (always?) ends up at the beginning of the list returned by 'select'. As only 'select_result[0]' is processed in the loop, this can cause the client to ignore the "request ready" FD for a long time, during which no requests or pings are sent. In effect, asynchronously "browsing" a large tree of nodes can stretch that duration to the point where it exceeds the timeout--causing the client to lose its session. This patch considers both descriptors after 'select', and also arranges for pings to be sent in case it encounters an "unending" stream of responses to requests which were sent earlier.

jeffwidman · 2020-12-13T20:29:02Z

Merged! Thank you again for the great PR and handling the niceties of squashing etc to make it easy for me to merge.

ztzg · 2020-12-14T07:22:51Z

Thank you!

liang-kang · 2023-07-04T10:46:11Z

If kazoo client connected zk server crashed, and exits a loop task for client to send request.

The client will enter below code for every request and not be aware the server has gone for a long time.

if self._read_sock in s
    self._send_request(read_timeout, connect_timeout)
    # Requests act as implicit pings.
    last_send = time.time()
    continue

If set client timeout to 40s and send request every 5s, client will cost 2 minutes to drop the connect in my platform.

Does that work as design?

ceache requested review from ceache, bbangert, jeffwidman and StephenSorriaux December 10, 2020 20:59

StephenSorriaux previously approved these changes Dec 10, 2020

View reviewed changes

jeffwidman previously approved these changes Dec 10, 2020

View reviewed changes

ztzg dismissed stale reviews from jeffwidman and StephenSorriaux via aa2df84 December 13, 2020 15:13

ztzg force-pushed the queue-buildup-large-responses branch from 46f0873 to aa2df84 Compare December 13, 2020 15:13

StephenSorriaux approved these changes Dec 13, 2020

View reviewed changes

jeffwidman mentioned this pull request Dec 13, 2020

Emit warning if ping socket select takes longer #632

Open

jeffwidman approved these changes Dec 13, 2020

View reviewed changes

jeffwidman merged commit 89e0660 into python-zk:master Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): do not allow responses to choke request and ping processing #633

fix(core): do not allow responses to choke request and ping processing #633

ztzg commented Dec 10, 2020

ztzg commented Dec 10, 2020

ceache commented Dec 10, 2020

StephenSorriaux left a comment

jeffwidman left a comment

jeffwidman Dec 10, 2020

ztzg Dec 13, 2020

jeffwidman Dec 13, 2020

ztzg Dec 14, 2020

jeffwidman commented Dec 13, 2020

ztzg commented Dec 14, 2020

liang-kang commented Jul 4, 2023

fix(core): do not allow responses to choke request and ping processing #633

fix(core): do not allow responses to choke request and ping processing #633

Conversation

ztzg commented Dec 10, 2020

Why is this needed?

Does this PR introduce any breaking change?

ztzg commented Dec 10, 2020

ceache commented Dec 10, 2020

StephenSorriaux left a comment

Choose a reason for hiding this comment

jeffwidman left a comment

Choose a reason for hiding this comment

jeffwidman Dec 10, 2020

Choose a reason for hiding this comment

ztzg Dec 13, 2020

Choose a reason for hiding this comment

jeffwidman Dec 13, 2020

Choose a reason for hiding this comment

ztzg Dec 14, 2020

Choose a reason for hiding this comment

jeffwidman commented Dec 13, 2020

ztzg commented Dec 14, 2020

liang-kang commented Jul 4, 2023