-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rollup] Add more diagnostic stats to job #35471
Conversation
To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the `took` time. It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster
Pinging @elastic/es-search-aggs |
static final ParseField BULK_LATENCY = new ParseField("bulk_latency_in_ms"); | ||
static final ParseField SEARCH_LATENCY = new ParseField("search_latency_in_ms"); | ||
static final ParseField SEARCH_FAILURES = new ParseField("search_failures"); | ||
static final ParseField BULK_FAILURES = new ParseField("bulk_failures"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it would be nicer to call it INDEX_FAILURES
, BULK
is an implementation detail about how indexing is internally implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
nice addition! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good @polyfractal . I left some comments
static final ParseField MAX = new ParseField("max"); | ||
static final ParseField AVG = new ParseField("avg"); | ||
static final ParseField COUNT = new ParseField("count"); | ||
static final ParseField TOTAL = new ParseField("total"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the _stats
API can we call these bulk_time_in_millis
and query_time_in_millis
? I am also not sure if we need the min
, the max
and the avg
. It should be enough to have the total time spent in these operations and the number of calls per action ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polyfractal Are MIN
, ..., ..., TOTAL
leftovers from previous iterations? They look unused to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right you are!
"trigger_count" : 0 | ||
"trigger_count" : 0, | ||
"bulk_failures": 0, | ||
"bulk_latency_in_ms": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this to:
"bulk_time_in_ms": 0,
"bulk_total": 0,
"search_time_in_ms": 0,
"search_total": 0
?
I don't think we need more than the total time and the number of invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can simplify these. :)
@@ -153,6 +153,7 @@ public synchronized boolean maybeTriggerAsyncJob(long now) { | |||
// fire off the search. Note this is async, the method will return from here | |||
executor.execute(() -> { | |||
try { | |||
stats.markStartSearch(); | |||
doNextSearch(buildSearchRequest(), ActionListener.wrap(this::onSearchResponse, exc -> finishWithFailure(exc))); | |||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing stats.incrementSearchFailures()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @hendrikmuhs @jimczi! |
This adds some new statistics to the job to help with debugging performance issues: - Total search and index time (in milliseconds) encounteed by the indexer during runtime. This time is the total service time including transfer between nodes, not just the `took` time. - Total count of search and index requests. Together with the total times, this can be used to determine average request time. - Count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster
* master: DOCS Audit event attributes in new format (elastic#35510) Scripting: Actually add joda time back to whitelist (elastic#35965) [DOCS] fix HLRC ILM doc misreferenced tag Add realm information for Authenticate API (elastic#35648) [ILM] add HLRC docs to remove-policy-from-index (elastic#35759) [Rollup] Update serialization version after backport [Rollup] Add more diagnostic stats to job (elastic#35471) Build: Fix gradle build for Mac OS (elastic#35968) Adds deprecation logging to ScriptDocValues#getValues. (elastic#34279) [Monitoring] Make Exporters Async (elastic#35765) [ILM] reduce time restriction on IndexLifecycleExplainResponse (elastic#35954) Remove use of AbstractComponent in xpack (elastic#35394) Deprecate types in search and multi search templates. (elastic#35669) Remove fromXContent from IndexUpgradeInfoResponse (elastic#35934)
To help debug future performance issues, this adds the min/max/avg/count/total latencies (in milliseconds) for search and bulk phase. This latency is the total service time including transfer between nodes, not just the
took
time.It also adds the count of search/bulk failures encountered during runtime. This information is also in the log, but a runtime counter will help expose problems faster.
Also updates the HLRC with the new response elements.
/cc @hendrikmuhs This adds the stats to the
IndexerJobStats
superclass, although all the xcontent stuff is done in the Rollup implementation