Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of ephemeral nodes a session can create #118

Merged
merged 17 commits into from
Nov 8, 2023

Conversation

GrantPSpencer
Copy link

@GrantPSpencer GrantPSpencer commented Sep 25, 2023

Description

In Zookeeper, an ephemeral node is only kept alive while the session it was created in is still active. When a session is closed by the client, all ephemeral nodes created by that session are deleted. Currently, there is no limit on how many ephemeral nodes can be created in a single session and if a large enough number of ephemeral nodes are created, then the work needed to delete these nodes and communicate the deletion to all followers can overwhelm the zookeeper server. This behavior has caused two previous Zookeeper site-up issues at LinkedIn.

My proposed solution to this problem is to throttle based on the cumulative bytes it takes to store all of the paths for the znodes created within that session. When a request comes into the PrepRequestProcessor, we check the DataTree and deny any create requests if we have already reached/exceeded the byte limit for that session.

With this approach, it is important to note that when a request is accepted, it would not immediately affect the ephemeral node count as it would need to reach the FinalRequestProcessor before the request affects the DataTree. This lag between the accepting ephemeral node requests and updating our ephemeral node count can lead to us not strictly upholding the limit. However, I believe this inaccuracy is acceptable as this should only result in minor variances of the total # of nodes allowed to be created and would still achieve the primary goal of preventing a server being overwhelmed when a session closes. If we wanted to be extra cautious, we could also fail requests at the end of the request processing pipeline when the new node is actually added to the DataTree.

4 years ago a PR (apache#1144) for a similar feature was opened against the opensource Zookeeper, but was never merged. The approach in that PR is slightly different as it attempts to keep a counter in the PrepRequestProcessor to track the # of ephemeral nodes currently in the DataTree. This solution does not have the same "lag time" problem as my proposed solution, but it does not account for the request failing at any step other than the PrepRequestProcessor. This could lead to the counter desyncing from the actual DataTree and being in an error state until reboot. Attempting to cover all the areas where a request could fail would also add quite a bit of constant work to the server for a feature that is intended to prevent an infrequent occurrence.

(Please let me know if this is not the correct repo/branch to target with these changes)

Tests

The following tests are written for this issue:
org/apache/zookeeper/server/quorum/EphemeralNodeThrottlingTest.java

The following is the result of the "mvn test" command on the appropriate module:

$ mvn test -o -Dtest=EphemeralNodeThrottlingTest -pl zookeeper-server/

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.zookeeper.server.quorum.EphemeralNodeThrottlingTest
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.93 s - in org.apache.zookeeper.server.quorum.EphemeralNodeThrottlingTest
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  21.543 s
[INFO] Finished at: 2023-09-25T13:57:39-07:00
[INFO] ------------------------------------------------------------------------

Copy link
Collaborator

@rahulrane50 rahulrane50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall in right direction Grant! Thanks for picking it up.

Copy link
Collaborator

@rahulrane50 rahulrane50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job @GrantPSpencer! Let's test this in EI once it's in!

Copy link
Member

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Grant! I left a few comments.

Also have one question. Why are we using the size of the path be the limit? Is it because we are worried about depth of tree of ephemeral nodes? If just ephemeral node count in general, counting each as 1 will be more efficient than iterating all chars in the path to get size.

@GrantPSpencer
Copy link
Author

Nice work Grant! I left a few comments.

Also have one question. Why are we using the size of the path be the limit? Is it because we are worried about depth of tree of ephemeral nodes? If just ephemeral node count in general, counting each as 1 will be more efficient than iterating all chars in the path to get size.

We're using the sum of all the path sizes for that session as the limit because the closeSession transaction will contain all those znode paths. If the sum of those znode paths is larger than the jute max buffer, then the server will crash. I did look into ways to split up the closeSession transaction but did not come up with any viable solutions.

Previously I used the # of ephemeral znodes as a proxy, but Kiran pointed out that trying estimate the sum of the path sizes would be a more accurate way to track it

Copy link
Member

@zpinto zpinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really nice work @GrantPSpencer!

Copy link
Collaborator

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You pivoted very fast from just count to using size. this is great change. just one minor comment.

Copy link
Collaborator

@desaikomal desaikomal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change. Thanks @GrantPSpencer for working through this.

@rahulrane50 rahulrane50 merged commit 6433cbd into linkedin:branch-3.6 Nov 8, 2023
@abhilash1in abhilash1in mentioned this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants