-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CostBalancerStrategy, deprecate cachingCost #14484
Conversation
@kfaraz This is a really neat idea.
|
Thanks for the suggestion, @AmatyaAvadhanula ! druid/server/src/main/java/org/apache/druid/server/coordinator/balancer/CostBalancerStrategy.java Lines 278 to 300 in 7d9656f
|
@@ -334,42 +367,10 @@ public void testFindServerAfterExecutorShutdownThrowsException() | |||
); | |||
} | |||
|
|||
@Test(timeout = 90_000L) | |||
public void testFindServerRaisesAlertOnTimeout() | |||
private void verifyPlacementCost(DataSegment segment, ServerHolder server, double expectedCost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we manually pass the expectedCost?
Can we not compare the results of the method computePlacementCost
with the sum of individual costs of the candidate segment across all segments on the server (i.e the original implementation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that can be another assertion in this test or another test.
This particular test was added to ensure that cost remains the same even when we make changes to the way cost is computed, such as in this PR. That is why the exact numerical literals have been used.
@@ -431,7 +433,7 @@ private int dropReplicas( | |||
Iterator<ServerHolder> serverIterator = | |||
(useRoundRobinAssignment || eligibleLiveServers.size() >= remainingNumToDrop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this beeligibleLiveServers.size() <= remainingNumToDrop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Thanks for catching this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, LGTM!
Thanks a lot for the feedback, @AmatyaAvadhanula !! |
Changes to `cost` strategy: - In every `ServerHolder`, track the number of segments per datasource per interval - Perform cost computations for a given interval just once, and then multiply by a constant factor to account for the total segment count in that interval - Do not perform joint cost computations with segments that are outside the compute interval (± 45 days) for the segment being considered for move - Remove metrics `segment/cost/*` as they were coordinator killers! Turning on these metrics (by setting `emitBalancingStats` to true) has often caused the coordinator to be stuck for hours. Moreover, they are too complicated to decipher and do not provide any meaningful insight into a Druid cluster. - Add new simpler metrics `segment/balancer/compute/*` to track cost computation time, count and errors. Other changes: - Remove flaky test from `CostBalancerStrategyTest`. - Add tests to verify that computed cost has remained unchanged - Remove usages of mock `BalancerStrategy` from `LoadRuleTest`, `BalanceSegmentsTest` - Clean up `BalancerStrategy` interface
`cachingCost` has been deprecated in #14484 and is not advised to be used in production clusters as it may cause usage skew across historicals which the coordinator is unable to rectify. This PR completely disables `cachingCost` strategy as it has now been rendered redundant due to recent performance improvements made to `cost` strategy. Changes - Disable `cachingCost` strategy - Add `DisabledCachingCostBalancerStrategyFactory` for the time being so that we can give a proper error message before falling back to `CostBalancerStrategy`. This will be removed in subsequent releases. - Retain `CachingCostBalancerStrategy` for testing/benchmarking purposes. - Add javadocs to `DiskNormalizedCostBalancerStrategy`
Motivation for balancing
With round robin segment assignment now enabled by default, coordinator runs are fast and clusters are generally well balanced. But as the name suggests, these assignments are round-robin and may be sub-optimal as they are not based on any server/segment-specific strategy.
So, in every coordinator run, assignment is immediately followed by balancing to ensure that segments are moved to the most strategic servers to optimize query performance. The balancing decisions are dictated by the strategy being used on the cluster. The most viable and strongly recommended candidate for this is the
cost
balancer strategy. The other two,cachingCost
anddiskNormalized
were meant to be improvements overcost
but come with their own set of challenges.Limits of
cost
balancer strategyThe segment balancing algorithm is throttled by
maxSegmentsToMove
which has been a way to ensure that the coordinator does not get stuck performing too many cost computations. With the recent coordinator improvements, the upper bound formaxSegmentsToMove
has now been set at 1000. While this number is good enough for smaller clusters with up to 100k used segments, it poses some problems at larger scales:maxSegmentsToMove
to higher values (say 5000) has often caused coordinator runs to get stuck for hours, resulting in the entire Druid cluster malfunctioning.As Druid clusters start handling more segments, this problem only gets compounded.
Thus, there is need for improvement in the performance of the
cost
balancer strategy.Proposed improvements
This PR avoids redundant cost computations by leveraging the following facts:
I1
would have exactly the same cost with any segment of intervalI2
. In other words, computed interval costs can be reused.cachingCost
strategy while bucketing segments into compute intervals.CostBalancerStrategyTest.testJointSegmentsCostWith45DayGap()
also proves such costs to be negligible.Changes to
cost
strategyServerHolder
, track the number of segments per datasource per intervalsegment/cost/*
as they were coordinator killers! Turning on these metrics (by settingemitBalancingStats
to true) has often caused the coordinator to be stuck for hours. Moreover, they are too complicated to decipher and do not provide any meaningful insight into the a Druid cluster.Other changes
CostBalancerStrategyTest
. No other functional change to this test to ensure that the new strategy works as expected.BalancerStrategy
fromLoadRuleTest
BalancerStrategy
interfaceTesting
All unit tests and simulation tests run successfully with no changes required.
Comparison of simulated run times of the
BalanceSegments
duty in seconds. Setup: 200 servers,maxSegmentsToMove
= 1000,balancerComputeThreads
= 1(intervals x partitions)
maxSegmentsToMove = 100
balancerComputeThreads = 1
maxSegmentsToMove = 1000
balancerComputeThreads = 4
maxSegmentsToMove = 1000
balancerComputeThreads = 4
x 10 datasources
Cluster tests
🟨 PENDING
Release note
The
cost
balancer strategy performs much better now and is capable of moving more segments in a single coordinator run. These improvements were made by borrowing ideas from thecachingCost
strategy.The
cachingCost
strategy itself has been known to cause issues and is now deprecated. It is likely to be removed in future Druid releases and must not be used on any cluster.This PR has: