emit maxLag/avgLag in KafkaSupervisor #6587

QiuMM · 2018-11-07T18:38:59Z

Currently, the KafkaSupervisor just emit the total lag which is not enough. Some index tasks may have larger lag than others(e.g failure of the tasks can lead to some tasks have very large lag) , emit the max lag can let me monitor this thing. By the way, also emit the average lag.

jon-wei · 2018-11-07T19:54:21Z

docs/content/operations/metrics.md

@@ -129,7 +129,9 @@ emission period.|
 |`ingest/handoff/count`|Number of handoffs that happened.|dataSource, taskId, taskType.|Varies. Generally greater than 0 once every segment granular period if cluster operating normally|
 |`ingest/sink/count`|Number of sinks not handoffed.|dataSource, taskId, taskType.|1~3|
 |`ingest/events/messageGap`|Time gap between the data time in event and current system time.|dataSource, taskId, taskType.|Greater than 0, depends on the time carried in event |
-|`ingest/kafka/lag`|Applicable for Kafka Indexing Service. Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |
+|`ingest/kafka/maxLag`|Applicable for Kafka Indexing Service. Max lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |
+|`ingest/kafka/totalLag`|Applicable for Kafka Indexing Service. Total lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers across all partitions. Minimum emission period for this metric is a minute.|dataSource.|Greater than 0, should not be a very high number |


This should be kept as ingest/kafka/lag for backwards compatibility

…ibility

gianm · 2018-11-13T22:13:56Z

How about adding a metric ingest/kafka/partitionLag with a dimension "partition" on it? Then you could take the average or max of those to get the average/max per-partition lag. But you could also look at the lag for each individual partition, which would be useful too.

Maybe include the task ID as a dimension too?

QiuMM · 2018-11-14T03:22:09Z

@gianm for metrics, I think it's enough to emit max lag, average lag and total lag across all partitions. In my scenario, some kafka topics may have 1000 partitions, so there will be lots of metric events if emit lag for each individual partition. If people want to look at the lag for each individual partition, they can get status of a supervisor through endpoint /druid/indexer/v1/supervisor/{id}/status. (overlord console ui has enabled this status endpoint).

gianm · 2018-11-28T10:05:51Z

@QiuMM Sure, that sounds ok to me. Thanks for describing your reasoning.

gianm

LGTM.

* created seekablestream classes * created seekablestreamsupervisor class * first attempt to integrate kafa indexing service to use SeekableStream * seekablestream bug fixes * kafkarecordsupplier * integrated kafka indexing service with seekablestream * implemented resume/suspend and refactored some package names * moved kinesis indexing service into core druid extensions * merged some changes from kafka supervisor race condition * integrated kinesis-indexing-service with seekablestream * unite tests for kinesis-indexing-service * various bug fixes for kinesis-indexing-service * refactored kinesisindexingtask * finished up more kinesis unit tests * more bug fixes for kinesis-indexing-service * finsihed refactoring kinesis unit tests * removed KinesisParititons and KafkaPartitions to use SeekableStreamPartitions * kinesis-indexing-service code cleanup and docs * merge #6291 merge #6337 merge #6383 * added more docs and reordered methods * fixd kinesis tests after merging master and added docs in seekablestream * fix various things from pr comment * improve recordsupplier and add unit tests * migrated to aws-java-sdk-kinesis * merge changes from master * fix pom files and forbiddenapi checks * checkpoint JavaType bug fix * fix pom and stuff * disable checkpointing in kinesis * fix kinesis sequence number null in closed shard * merge changes from master * fixes for kinesis tasks * capitalized <partitionType, sequenceType> * removed abstract class loggers * conform to guava api restrictions * add docker for travis other modules test * address comments * improve RecordSupplier to supply records in batch * fix strict compile issue * add test scope for localstack dependency * kinesis indexing task refactoring * comments * github comments * minor fix * removed unneeded readme * fix deserialization bug * fix various bugs * KinesisRecordSupplier unable to catch up to earliest position in stream bug fix * minor changes to kinesis * implement deaggregate for kinesis * Merge remote-tracking branch 'upstream/master' into seekablestream * fix kinesis offset discrepancy with kafka * kinesis record supplier disable getPosition * pr comments * mock for kinesis tests and remove docker dependency for unit tests * PR comments * avg lag in kafkasupervisor #6587 * refacotred SequenceMetadata in taskRunners * small fix * more small fix * recordsupplier resource leak * revert .travis.yml formatting * fix style * kinesis docs * doc part2 * more docs * comments * comments*2 * revert string replace changes * comments * teamcity * comments part 1 * comments part 2 * comments part 3 * merge #6754 * fix injection binding * comments * KinesisRegion refactor * comments part idk lol * can't think of a commit msg anymore * remove possiblyResetDataSourceMetadata() for IncrementalPublishingTaskRunner * commmmmmmmmmments * extra error handling in KinesisRecordSupplier getRecords * comments * quickfix * typo * oof

emit maxLag/totalLag/avgLag in KafkaSupervisor

60bef91

jon-wei reviewed Nov 7, 2018

View reviewed changes

modify ingest/kafka/totalLag to ingest/kafka/lag for backwards compat…

a92c41e

…ibility

QiuMM changed the title ~~emit maxLag/totalLag/avgLag in KafkaSupervisor~~ emit maxLag/avgLag in KafkaSupervisor Nov 9, 2018

gianm assigned jon-wei Nov 13, 2018

QiuMM added the Area - Metrics/Event Emitting label Nov 22, 2018

QiuMM added this to the 0.13.1 milestone Nov 22, 2018

gianm assigned gianm and unassigned jon-wei Nov 27, 2018

gianm approved these changes Nov 28, 2018

View reviewed changes

gianm merged commit c5405bb into apache:master Nov 28, 2018

jsun98 added a commit to jsun98/druid that referenced this pull request Dec 1, 2018

avg lag in kafkasupervisor apache#6587

c4553ee

janeklb mentioned this pull request Jan 24, 2019

Move ingest/kafka/* metrics into a separate section on the metrics docs #6895

Merged

QiuMM mentioned this pull request Feb 23, 2019

0.14.0-incubating release notes #7126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emit maxLag/avgLag in KafkaSupervisor #6587

emit maxLag/avgLag in KafkaSupervisor #6587

QiuMM commented Nov 7, 2018

jon-wei Nov 7, 2018

QiuMM Nov 8, 2018

gianm commented Nov 13, 2018

QiuMM commented Nov 14, 2018 •

edited

Loading

gianm commented Nov 28, 2018

gianm left a comment

emit maxLag/avgLag in KafkaSupervisor #6587

emit maxLag/avgLag in KafkaSupervisor #6587

Conversation

QiuMM commented Nov 7, 2018

jon-wei Nov 7, 2018

Choose a reason for hiding this comment

QiuMM Nov 8, 2018

Choose a reason for hiding this comment

gianm commented Nov 13, 2018

QiuMM commented Nov 14, 2018 • edited Loading

gianm commented Nov 28, 2018

gianm left a comment

Choose a reason for hiding this comment

QiuMM commented Nov 14, 2018 •

edited

Loading