-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug during initialization of HttpServerInventoryView #14517
Conversation
server/src/test/java/org/apache/druid/client/HttpServerInventoryViewTest.java
Fixed
Show resolved
Hide resolved
docs/operations/metrics.md
Outdated
|`segment/serverview/sync/healthy`|Sync status of the Broker with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise| | ||
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Broker has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|`segment/serverview/sync/healthy`|Sync status of the Broker with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise| | |
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Broker has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.| | |
|`segment/serverview/sync/healthy`|Sync status of the Coordinator with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise| | |
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Coordinator has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Did a bunch of copy-pastes and missed this.
} | ||
|
||
private void sync() | ||
{ | ||
if (!startStopLock.awaitStarted(1, TimeUnit.MILLISECONDS)) { | ||
log.info("Skipping sync() call for server[%s].", logIdentity); | ||
log.info("Skipping sync for server[%s] as lifecycle has not started.", logIdentity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these error messages are also a bit cryptic. I wonder if we can make them more meaningful. like what does lifecycle even mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am updating the message to say "syncer has not started yet" as the startStopLock
denotes nothing but the lifecycle of the syncer itself.
ScheduledExecutors.scheduleAtFixedRate( | ||
executor, | ||
Duration.standardSeconds(60), | ||
Duration.standardMinutes(5), | ||
this::checkAndResetUnhealthyServers | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could run with fixed delay instead of fixed rate.
ScheduledExecutors.scheduleAtFixedRate( | ||
executor, | ||
Duration.standardSeconds(30), | ||
Duration.standardMinutes(1), | ||
this::emitServerStatusMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that one benefit of having a separate executor is that metrics get emitted even if thread pool is busy for other reasons. Thought could be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. If I am making other changes, I will move metric emission and the check and reset logic to the separate executor.
server/src/main/java/org/apache/druid/client/HttpServerInventoryView.java
Show resolved
Hide resolved
} | ||
|
||
return false; | ||
final long unstableSeconds = getUnstableTimeMillis() / 1000; | ||
final String message = StringUtils.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything an admin can do when they see this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an action in case of exceeding the maxUnstableDuration.
Thanks for the feedback, @abhishekagarwal87 . I have addressed your comments. |
If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`, the initialization gets stuck as this server is never synced. The method eventually times out (default 250s). Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for non-stopped servers to sync. Other changes: - Add new metrics for better debugging of slow broker/coordinator startup - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable - Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer` - Minor refactor for readability - Add utility class `Stopwatch` - Add tests and stubs
Split from #14468
Bug
Description: If a server is removed during
HttpServerInventoryView.serverInventoryInitialized
, the initialization gets stuck as this server is never synced. The method eventually times out (default 250s).Fix: Mark a server as stopped if it is removed.
serverInventoryInitialized
only waits for non-stopped servers to sync.Changes
HttpServerInventoryView
segment/serverview/sync/healthy
: A 1-0 metric that denotes if the server view is syncing properly with a serversegment/serverview/sync/unstableTime
: tracks the time in millis for which sync with a server has been unstableHttpServerInventoryView
andChangeRequestHttpSyncer
Stopwatch
Classes to review
ChangeRequestHttpSyncer
HttpServerInventoryView
Pending
Release note
This PR has: