Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Changes for coordinator improvements #14590

Merged
merged 5 commits into from
Jul 18, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 27 additions & 6 deletions docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -887,7 +887,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti
|Property|Possible Values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.serverview.type`|batch or http|Segment discovery method to use. "http" enables discovering segments using HTTP instead of ZooKeeper.|http|
|`druid.coordinator.loadqueuepeon.type`|curator or http|Whether to use "http" or "curator" implementation to assign segment loads/drops to historical|http|
|`druid.coordinator.loadqueuepeon.type`|curator or http|Implementation to use to assign segment loads and drops to historicals. Curator-based implementation is now deprecated and all users should move to using HTTP-based segment assignments.|http|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`druid.coordinator.segment.awaitInitializationOnStart`|true or false|Whether the Coordinator will wait for its view of segments to fully initialize before starting up. If set to 'true', the Coordinator's HTTP server will not start up, and the Coordinator will not announce itself as available, until the server view is initialized.|true|

###### Additional config when "http" loadqueuepeon is used
Expand All @@ -907,7 +907,7 @@ These Coordinator static configurations can be defined in the `coordinator/runti

#### Dynamic Configuration

The Coordinator has dynamic configuration to change certain behavior on the fly.
The Coordinator has dynamic configurations to tune certain behavior on the fly, without requiring a service restart.

It is recommended that you use the [web console](../operations/web-console.md) to configure these parameters.
However, if you need to do it via HTTP, the JSON object can be submitted to the Coordinator via a POST request at:
Expand Down Expand Up @@ -949,10 +949,11 @@ Issuing a GET request at the same URL will return the spec that is currently in
|`millisToWaitBeforeDeleting`|How long does the Coordinator need to be a leader before it can start marking overshadowed segments as unused in metadata storage.|900000 (15 mins)|
|`mergeBytesLimit`|The maximum total uncompressed size in bytes of segments to merge.|524288000L|
|`mergeSegmentsLimit`|The maximum number of segments that can be in a single [append task](../ingestion/tasks.md).|100|
|`smartSegmentLoading`|Whether to turn on the new ["smart"-mode of segment loading](#smart-segment-loading) which dynamically computes the optimal values of several parameters that maximize Coordinator performance.|true|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`maxSegmentsToMove`|The maximum number of segments that can be moved at any given time.|100|
|`replicantLifetime`|The maximum number of Coordinator runs for a segment to be replicated before we start alerting.|15|
|`replicationThrottleLimit`|The maximum number of segments that can be in the replication queue of a historical tier at any given time.|500|
|`balancerComputeThreads`|Thread pool size for computing moving cost of segments in segment balancing. Consider increasing this if you have a lot of segments and moving segments starts to get stuck.|1|
|`replicantLifetime`|The maximum number of Coordinator runs for which a segment can wait in the load queue of a Historical before Druid raises an alert.|15|
|`replicationThrottleLimit`|The maximum number of segment replicas that can be assigned to a Historical tier in a single Coordinator run. This parameter is a defensive measure to prevent Historicals from getting overwhelmed loading extra replicas of segments that are already available in the cluster.|500|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`balancerComputeThreads`|Thread pool size for computing moving cost of segments during segment balancing. Consider increasing this if you have a lot of segments and moving segments starts to get stuck.|1|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`killDataSourceWhitelist`|List of specific data sources for which kill tasks are sent if property `druid.coordinator.kill.on` is true. This can be a list of comma-separated data source names or a JSON array.|none|
|`killPendingSegmentsSkipList`|List of data sources for which pendingSegments are _NOT_ cleaned up if property `druid.coordinator.kill.pendingSegments.on` is true. This can be a list of comma-separated data sources or a JSON array.|none|
|`maxSegmentsInNodeLoadingQueue`|The maximum number of segments allowed in the load queue of any given server. Use this parameter to load segments faster if, for example, the cluster contains slow-loading nodes or if there are too many segments to be replicated to a particular node (when faster loading is preferred to better segments distribution). The optimal value depends on the loading speed of segments, acceptable replication time and number of nodes. |500|
Expand All @@ -961,9 +962,29 @@ Issuing a GET request at the same URL will return the spec that is currently in
|`decommissioningMaxPercentOfMaxSegmentsToMove`| Upper limit of segments the Coordinator can move from decommissioning servers to active non-decommissioning servers during a single run. This value is relative to the total maximum number of segments that can be moved at any given time based upon the value of `maxSegmentsToMove`.<br /><br />If `decommissioningMaxPercentOfMaxSegmentsToMove` is 0, the Coordinator does not move segments to decommissioning servers, effectively putting them in a type of "maintenance" mode. In this case, decommissioning servers do not participate in balancing or assignment by load rules. The Coordinator still considers segments on decommissioning servers as candidates to replicate on active servers.<br /><br />Decommissioning can stall if there are no available active servers to move the segments to. You can use the maximum percent of decommissioning segment movements to prioritize balancing or to decrease commissioning time to prevent active servers from being overloaded. The value must be between 0 and 100.|70|
|`pauseCoordination`| Boolean flag for whether or not the coordinator should execute its various duties of coordinating the cluster. Setting this to true essentially pauses all coordination work while allowing the API to remain up. Duties that are paused include all classes that implement the `CoordinatorDuty` Interface. Such duties include: Segment balancing, Segment compaction, Submitting kill tasks for unused segments (if enabled), Logging of used segments in the cluster, Marking of newly unused or overshadowed segments, Matching and execution of load/drop rules for used segments, Unloading segments that are no longer marked as used from Historical servers. An example of when an admin may want to pause coordination would be if they are doing deep storage maintenance on HDFS Name Nodes with downtime and don't want the coordinator to be directing Historical Nodes to hit the Name Node with API requests until maintenance is done and the deep store is declared healthy for use again. |false|
|`replicateAfterLoadTimeout`| Boolean flag for whether or not additional replication is needed for segments that have failed to load due to the expiry of `druid.coordinator.load.timeout`. If this is set to true, the coordinator will attempt to replicate the failed segment on a different historical server. This helps improve the segment availability if there are a few slow historicals in the cluster. However, the slow historical may still load the segment later and the coordinator may issue drop requests if the segment is over-replicated.|false|
|`maxNonPrimaryReplicantsToLoad`|This is the maximum number of non-primary segment replicants to load per Coordination run. This number can be set to put a hard upper limit on the number of replicants loaded. It is a tool that can help prevent long delays in new data being available for query after events that require many non-primary replicants to be loaded by the cluster; such as a Historical node disconnecting from the cluster. The default value essentially means there is no limit on the number of replicants loaded per coordination cycle. If you want to use a non-default value for this config, you may want to start with it being `~20%` of the number of segments found on your Historical server with the most segments. You can use the Druid metric, `coordinator/time` with the filter `duty=org.apache.druid.server.coordinator.duty.RunRules` to see how different values of this config impact your Coordinator execution time.|`Integer.MAX_VALUE`|
|`maxNonPrimaryReplicantsToLoad`|The maximum number of replicas that can be assigned across all tiers in a single Coordinator run. This parameter serves the same purpose as `replicationThrottleLimit` except this limit applies at the cluster-level instead of per tier. The default value essentially means that there is no limit on the number of replicas assigned per coordination cycle. If you want to use a non-default value for this config, you may want to start with it being `~20%` of the number of segments found on the Historical server with the most segments. Use the Druid metric, `coordinator/time` with the filter `duty=org.apache.druid.server.coordinator.duty.RunRules` to see how different values of this config impact your Coordinator execution time.|`Integer.MAX_VALUE` (i.e. no limit)|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved

##### Smart segment loading

The `smartSegmentLoading` mode of the Coordinator makes configuring it for segment loading and balancing much easier.
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
In this mode, the Coordinator does not require the user to provide values of the following parameters and computes them automatically instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this mode, the Coordinator does not require the user to provide values of the following parameters and computes them automatically instead.
If you enable this mode, do not provide values for the properties in the table below.```

**If provided, the values are simply ignored.**
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
The computed values are based on the current state of the cluster and are meant to optimize the performance of the Coordinator.
kfaraz marked this conversation as resolved.
Show resolved Hide resolved

|Property|Computed value|Explanation|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|--------|--------------|-----------|
|`useRoundRobinSegmentAssignment`|true|Speeds up segment assignment|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`maxSegmentsInNodeLoadingQueue`|0|Removes the limit on load queue size|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`replicationThrottleLimit`|2% of used segments, minimum value = 100|Ensures that replication is not done too aggressively in case of a historical disappearing only intermittently.|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`replicantLifetime`|60|Allows segments to wait about an hour (assuming a coordinator period of 1 minute) in the load queue before an alert is raised. This value is higher than the previous default of 15 because in `smartSegmentLoading` mode, load queues are not limited by size. Thus, segments might get assigned to a load queue even if the corresponding server is slow to load them.|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`maxNonPrimaryReplicantsToLoad`|`Integer.MAX_VALUE` (no limit)|This throttling is already handled by `replicationThrottleLimit`.|
|`maxSegmentsToMove`|2% of used segments, minimum value = 100, maximum value = 1000|Ensures that some segments are always moving in the cluster to keep it well balanced. The maximum value keeps the coordinator run times bounded.|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`decommissioningMaxPercentOfMaxSegmentsToMove`|100|Prioritizes move of segments from decommissioning servers so that they can be terminated quickly.|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved

When `smartSegmentLoading` is disabled, the configured values of these parameters are used without any modification.
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
You should disable this mode only if you want to explicitly set the value of any of the above parameters.
kfaraz marked this conversation as resolved.
Show resolved Hide resolved

##### Audit history
To view the audit history of Coordinator dynamic config issue a GET request to the URL -

```
Expand Down
22 changes: 12 additions & 10 deletions docs/operations/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,19 +283,21 @@ These metrics are for the Druid Coordinator and are reset each time the Coordina

|Metric|Description|Dimensions|Normal Value|
|------|-----------|----------|------------|
|`segment/assigned/count`|Number of segments assigned to be loaded in the cluster.|`tier`|Varies|
|`segment/moved/count`|Number of segments moved in the cluster.|`tier`|Varies|
|`segment/unmoved/count`|Number of segments which were chosen for balancing but were found to be already optimally placed.|`tier`|Varies|
|`segment/dropped/count`|Number of segments chosen to be dropped from the cluster due to being over-replicated.|`tier`|Varies|
|`segment/deleted/count`|Number of segments marked as unused due to drop rules.| |Varies|
|`segment/unneeded/count`|Number of segments dropped due to being marked as unused.|`tier`|Varies|
|`segment/cost/raw`|Used in cost balancing. The raw cost of hosting segments.|`tier`|Varies|
|`segment/cost/normalization`|Used in cost balancing. The normalization of hosting segments.|`tier`|Varies|
|`segment/cost/normalized`|Used in cost balancing. The normalized cost of hosting segments.|`tier`|Varies|
|`segment/assigned/count`|Number of segments assigned to be loaded in the cluster.|`dataSource`, `tier`|Varies|
|`segment/moved/count`|Number of segments moved in the cluster.|`dataSource`, `tier`|Varies|
|`segment/dropped/count`|Number of segments chosen to be dropped from the cluster due to being over-replicated.|`dataSource`, `tier`|Varies|
|`segment/deleted/count`|Number of segments marked as unused due to drop rules.|`dataSource`|Varies|
|`segment/unneeded/count`|Number of segments dropped due to being marked as unused.|`dataSource`, `tier`|Varies|
|`segment/assignSkipped/count`|Number of segments that could not be assigned to any server for loading due to replication throttling, no available disk space, full load queue, or some other reason.|`dataSource`, `tier`, `description`|Varies|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`segment/moveSkipped/count`|Number of segments that were chosen for balancing but could not be moved either due to already being optimally placed or some other reason.|`dataSource`, `tier`, `description`|Varies|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`segment/dropSkipped/count`|Number of segments that could not be dropped from any server.|`dataSource`, `tier`, `description`|Varies|
|`segment/loadQueue/size`|Size in bytes of segments to load.|`server`|Varies|
|`segment/loadQueue/failed`|Number of segments that failed to load.|`server`|0|
|`segment/loadQueue/count`|Number of segments to load.|`server`|Varies|
|`segment/dropQueue/count`|Number of segments to drop.|`server`|Varies|
|`segment/loadQueue/assigned`|Number of segments assigned for load or drop to the load queue of a server.|`dataSource`, `server`|Varies|
|`segment/loadQueue/success`|Number of segment assignments that completed successfully.|`dataSource`, `server`|Varies|
|`segment/loadQueue/failed`|Number of segment assignments that failed to complete.|`dataSource`, `server`|0|
|`segment/loadQueue/cancelled`|Number of segment assignments that were cancelled before completion.|`dataSource`, `server`|Varies|
kfaraz marked this conversation as resolved.
Show resolved Hide resolved
|`segment/size`|Total size of used segments in a data source. Emitted only for data sources to which at least one used segment belongs.|`dataSource`|Varies|
|`segment/count`|Number of used segments belonging to a data source. Emitted only for data sources to which at least one used segment belongs.|`dataSource`|< max|
|`segment/overShadowed/count`|Number of segments marked as unused due to being overshadowed.| |Varies|
Expand Down