Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): do not allow responses to choke request and ping processing #633

Merged
merged 1 commit into from
Dec 13, 2020

Conversation

ztzg
Copy link
Contributor

@ztzg ztzg commented Dec 10, 2020

Why is this needed?

Without this patch, a single select event is processed by iteration in the ConnectionHandler event loop.

In a scenario where the client issues a large number of async requests with an important amplification factor, e.g. get_children_async on a large node, it is possible for the 'select' operation to almost always return a "response ready" socket—as the server is often able to
process, serialize and ship a new reponse while Kazoo processes the previous one.

That response socket often (always?) ends up at the beginning of the list returned by select.

As only select_result[0] is processed in the loop, this can cause the client to ignore the "request ready" FD for a long time, during which no requests or pings are sent.

In effect, asynchronously "browsing" a large tree of nodes can stretch that duration to the point where it exceeds the timeout—causing the client to lose its session.

This patch considers both descriptors after select, and also arranges for pings to be sent in case it encounters an "unending" stream of responses to requests which were sent earlier.

Does this PR introduce any breaking change?

Not purposefully :)

@ztzg
Copy link
Contributor Author

ztzg commented Dec 10, 2020

Cc: @ceache.

@ceache
Copy link
Contributor

ceache commented Dec 10, 2020

LGTM (as discussed offline).

Copy link
Member

@StephenSorriaux StephenSorriaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, great catch.

Love the Not purposefully :) part also :)

jeffwidman
jeffwidman previously approved these changes Dec 10, 2020
Copy link
Member

@jeffwidman jeffwidman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the others, looks good to me. I particularly appreciate the detailed explanation of the problem, as it made it very clear why the change was necessary.

response = self._read_socket(read_timeout)
if response == CLOSE_RESPONSE:
break
if self._read_sock in s:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add a comment here saying something like:

Check if any requests need sending before proceeding to process more responses. Otherwise the responses may choke out the requests.

That way if someone refactors this code, they'll be aware of the issue... providing that "why" context is super important IMO for tricky code like this. Ideally we'd add a test, but this kind of perf issue is super hard to test reliably so I completely understand why there isn't one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! Comment added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you may already know this, but most modern fonts are designed so you only need one space after the period, not two. Not something that should hold up this PR at all, but just an FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeffwidman,

Right. Old habits die hard… In any case, I should have paid closer attention to surrounding comments; sorry about that.

Without this patch, a single select event is processed by iteration in
the 'ConnectionHandler' event loop.

In a scenario where the client issues a large number of async requests
with an important amplification factor, e.g. 'get_children_async' on a
large node, it is possible for the 'select' operation to almost always
return a "response ready" socket--as the server is often able to
process, serialize and ship a new reponse while Kazoo processes the
previous one.

That response socket often (always?) ends up at the beginning of the
list returned by 'select'.

As only 'select_result[0]' is processed in the loop, this can cause
the client to ignore the "request ready" FD for a long time, during
which no requests or pings are sent.

In effect, asynchronously "browsing" a large tree of nodes can stretch
that duration to the point where it exceeds the timeout--causing the
client to lose its session.

This patch considers both descriptors after 'select', and also
arranges for pings to be sent in case it encounters an "unending"
stream of responses to requests which were sent earlier.
@ztzg ztzg dismissed stale reviews from jeffwidman and StephenSorriaux via aa2df84 December 13, 2020 15:13
@ztzg ztzg force-pushed the queue-buildup-large-responses branch from 46f0873 to aa2df84 Compare December 13, 2020 15:13
@jeffwidman jeffwidman merged commit 89e0660 into python-zk:master Dec 13, 2020
@jeffwidman
Copy link
Member

Merged! Thank you again for the great PR and handling the niceties of squashing etc to make it easy for me to merge.

@ztzg
Copy link
Contributor Author

ztzg commented Dec 14, 2020

Thank you!

@liang-kang
Copy link

If kazoo client connected zk server crashed, and exits a loop task for client to send request.

The client will enter below code for every request and not be aware the server has gone for a long time.

if self._read_sock in s
    self._send_request(read_timeout, connect_timeout)
    # Requests act as implicit pings.
    last_send = time.time()
    continue

If set client timeout to 40s and send request every 5s, client will cost 2 minutes to drop the connect in my platform.

Does that work as design?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants