Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug during initialization of HttpServerInventoryView #14517

Merged
merged 6 commits into from
Jul 6, 2023

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jul 3, 2023

Split from #14468

Bug

Description: If a server is removed during HttpServerInventoryView.serverInventoryInitialized, the initialization gets stuck as this server is never synced. The method eventually times out (default 250s).
Fix: Mark a server as stopped if it is removed. serverInventoryInitialized only waits for non-stopped servers to sync.

Changes

  • Fix bug in HttpServerInventoryView
  • Add new metrics for better debugging of slow broker/coordinator startup
    • segment/serverview/sync/healthy: A 1-0 metric that denotes if the server view is syncing properly with a server
    • segment/serverview/sync/unstableTime: tracks the time in millis for which sync with a server has been unstable
  • Clean up logging in HttpServerInventoryView and ChangeRequestHttpSyncer
  • Minor refactor for readability
  • Add utility class Stopwatch
  • Add tests and stubs

Classes to review

  • ChangeRequestHttpSyncer
  • HttpServerInventoryView

Pending

  • cluster testing

Release note


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 3, 2023
@kfaraz kfaraz closed this Jul 4, 2023
@kfaraz kfaraz reopened this Jul 4, 2023
Comment on lines 328 to 329
|`segment/serverview/sync/healthy`|Sync status of the Broker with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise|
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Broker has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`segment/serverview/sync/healthy`|Sync status of the Broker with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise|
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Broker has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.|
|`segment/serverview/sync/healthy`|Sync status of the Coordinator with a segment-loading server such as a Historical or Peon. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled. This metric can be used in conjunction with `segment/serverview/sync/unstableTime` to debug slow startup of the Coordinator.|`server`, `tier`|1 for fully synced servers, 0 otherwise|
|`segment/serverview/sync/unstableTime`|Time in milliseconds for which the Coordinator has been failing to sync with a segment-loading server. Emitted only when [HTTP-based server view](../configuration/index.md#segment-management) is enabled.|`server`, `tier`|Not emitted for synced servers.|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! Did a bunch of copy-pastes and missed this.

}

private void sync()
{
if (!startStopLock.awaitStarted(1, TimeUnit.MILLISECONDS)) {
log.info("Skipping sync() call for server[%s].", logIdentity);
log.info("Skipping sync for server[%s] as lifecycle has not started.", logIdentity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these error messages are also a bit cryptic. I wonder if we can make them more meaningful. like what does lifecycle even mean?

Copy link
Contributor Author

@kfaraz kfaraz Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am updating the message to say "syncer has not started yet" as the startStopLock denotes nothing but the lifecycle of the syncer itself.

Comment on lines 217 to 222
ScheduledExecutors.scheduleAtFixedRate(
executor,
Duration.standardSeconds(60),
Duration.standardMinutes(5),
this::checkAndResetUnhealthyServers
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could run with fixed delay instead of fixed rate.

Comment on lines 223 to 227
ScheduledExecutors.scheduleAtFixedRate(
executor,
Duration.standardSeconds(30),
Duration.standardMinutes(1),
this::emitServerStatusMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that one benefit of having a separate executor is that metrics get emitted even if thread pool is busy for other reasons. Thought could be done later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. If I am making other changes, I will move metric emission and the check and reset logic to the separate executor.

}

return false;
final long unstableSeconds = getUnstableTimeMillis() / 1000;
final String message = StringUtils.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything an admin can do when they see this message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an action in case of exceeding the maxUnstableDuration.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 5, 2023

Thanks for the feedback, @abhishekagarwal87 . I have addressed your comments.

@kfaraz kfaraz requested a review from abhishekagarwal87 July 6, 2023 02:37
@kfaraz kfaraz merged commit 87bb1b9 into apache:master Jul 6, 2023
@kfaraz kfaraz deleted the fix_bug_serverview branch July 6, 2023 07:34
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
If a server is removed during `HttpServerInventoryView.serverInventoryInitialized`,
the initialization gets stuck as this server is never synced. The method eventually times
out (default 250s).

Fix: Mark a server as stopped if it is removed. `serverInventoryInitialized` only waits for
non-stopped servers to sync.

Other changes:
- Add new metrics for better debugging of slow broker/coordinator startup
  - `segment/serverview/sync/healthy`: whether the server view is syncing properly with a server
  - `segment/serverview/sync/unstableTime`: time for which sync with a server has been unstable  
- Clean up logging in `HttpServerInventoryView` and `ChangeRequestHttpSyncer`
- Minor refactor for readability
- Add utility class `Stopwatch`
- Add tests and stubs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants