-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): do not allow responses to choke request and ping processing #633
fix(core): do not allow responses to choke request and ping processing #633
Conversation
Cc: @ceache. |
LGTM (as discussed offline). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, great catch.
Love the Not purposefully :)
part also :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the others, looks good to me. I particularly appreciate the detailed explanation of the problem, as it made it very clear why the change was necessary.
response = self._read_socket(read_timeout) | ||
if response == CLOSE_RESPONSE: | ||
break | ||
if self._read_sock in s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add a comment here saying something like:
Check if any requests need sending before proceeding to process more responses. Otherwise the responses may choke out the requests.
That way if someone refactors this code, they'll be aware of the issue... providing that "why" context is super important IMO for tricky code like this. Ideally we'd add a test, but this kind of perf issue is super hard to test reliably so I completely understand why there isn't one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! Comment added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you may already know this, but most modern fonts are designed so you only need one space after the period, not two. Not something that should hold up this PR at all, but just an FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jeffwidman,
Right. Old habits die hard… In any case, I should have paid closer attention to surrounding comments; sorry about that.
Without this patch, a single select event is processed by iteration in the 'ConnectionHandler' event loop. In a scenario where the client issues a large number of async requests with an important amplification factor, e.g. 'get_children_async' on a large node, it is possible for the 'select' operation to almost always return a "response ready" socket--as the server is often able to process, serialize and ship a new reponse while Kazoo processes the previous one. That response socket often (always?) ends up at the beginning of the list returned by 'select'. As only 'select_result[0]' is processed in the loop, this can cause the client to ignore the "request ready" FD for a long time, during which no requests or pings are sent. In effect, asynchronously "browsing" a large tree of nodes can stretch that duration to the point where it exceeds the timeout--causing the client to lose its session. This patch considers both descriptors after 'select', and also arranges for pings to be sent in case it encounters an "unending" stream of responses to requests which were sent earlier.
aa2df84
46f0873
to
aa2df84
Compare
Merged! Thank you again for the great PR and handling the niceties of squashing etc to make it easy for me to merge. |
Thank you! |
If kazoo client connected zk server crashed, and exits a loop task for client to send request. The client will enter below code for every request and not be aware the server has gone for a long time.
If set client timeout to 40s and send request every 5s, client will cost 2 minutes to drop the connect in my platform. Does that work as design? |
Why is this needed?
Without this patch, a single select event is processed by iteration in the
ConnectionHandler
event loop.In a scenario where the client issues a large number of async requests with an important amplification factor, e.g.
get_children_async
on a large node, it is possible for the 'select' operation to almost always return a "response ready" socket—as the server is often able toprocess, serialize and ship a new reponse while Kazoo processes the previous one.
That response socket often (always?) ends up at the beginning of the list returned by
select
.As only
select_result[0]
is processed in the loop, this can cause the client to ignore the "request ready" FD for a long time, during which no requests or pings are sent.In effect, asynchronously "browsing" a large tree of nodes can stretch that duration to the point where it exceeds the timeout—causing the client to lose its session.
This patch considers both descriptors after
select
, and also arranges for pings to be sent in case it encounters an "unending" stream of responses to requests which were sent earlier.Does this PR introduce any breaking change?
Not purposefully :)