-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kafka): add consumer_lag metrics to Kafka #15106
feat(kafka): add consumer_lag metrics to Kafka #15106
Conversation
- fix a typo in Kafka error code name Tested: - No
- add consumer_lag metric to Kafka Tested: - Local run with Kafka
✅ Deploy Preview for vrl-playground canceled.
|
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
@neuronull I guess you could review it. |
Thanks for the contribution! Added it to the backlog. If I can't get to it today then someone will next week 👍 |
@bruceg could you review it please? |
@neuronull maybe you've enough time today :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here looks fine, but I am concerned about the increase in metric cardinality this might cause. Could somebody else weigh in if this is a potential problem or not?
src/internal_events/kafka.rs
Outdated
for (topic_id, topic) in &self.statistics.topics { | ||
for (partition_id, partition) in &topic.partitions { | ||
gauge!("kafka_consumer_lag", partition.consumer_lag as f64, "topic_id" => topic_id.clone(), "partition_id" => partition_id.to_string()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads like it could cause quite a large number of new metrics. Is this something we should be making optional, or is it fine here because the number is relatively bounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess that it should be relatively bounded. I don't think that it'll affect metrics number a lot. So I suggest to leave it as is, without an optional switch.
If users will complain about the metrics number or something like that - we will decide, what should be done here. And yes, one of the ways is to add an additional flag (or something like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally you can filter the unused internal metrics with a filter transform. I think this is a really nice to have metric if you use kafka as source, as you don't need to install an extra kafka lag exporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal metrics can be filtered before exporting them, but they will result in a higher number of metrics needing to be tracked internally.
I agree that it'd be good to let users opt into this. We've already had a few issues resulting from tagging metrics with high cardinality tags which prompted #15426 . We likely need to come up with a more generic way to opt into these extra metrics but for now we could just use a simple boolean flag on the source config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jszwedko so what is the proposed way to move this PR further? Adding some specific field like enable_lag_metrics: bool
to the source/sink config? In this case, we need to update both Kafka source and sink configurations, not just a source configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Yeah, I think something like that would be fine. I'd suggest metrics.topic_lag_metric: true|false
to group together any metric specific options like this. Does this metric apply to the sink? It seems to only apply to consumers (aka the source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this metric apply to the sink? It seems to only apply to consumers (aka the source).
I guess it should. Because the same code is reused in the Kafka sink too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on making this opt-in given we're adding T x P
new metrics, without any knowledge of what those numbers look like for typical users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/internal_events/kafka.rs
Outdated
for (topic_id, topic) in &self.statistics.topics { | ||
for (partition_id, partition) in &topic.partitions { | ||
gauge!("kafka_consumer_lag", partition.consumer_lag as f64, "topic_id" => topic_id.clone(), "partition_id" => partition_id.to_string()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on making this opt-in given we're adding T x P
new metrics, without any knowledge of what those numbers look like for typical users.
e5e84cc
to
2e53d5e
Compare
Regression Test Results
Run ID: ef5a18d9-45e7-4ee6-9724-4e49c88191f5 Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their No interesting changes in Fine details of change detection per experiment.
|
Regression Test Results
Run ID: 4421f2c2-f890-4c1b-8d26-188383ea5ff6 Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their Changes in
Fine details of change detection per experiment.
|
@davidhuie-dd Friendly reminder :) |
src/sources/kafka.rs
Outdated
#[configurable_component] | ||
#[derive(Clone, Debug, Default)] | ||
struct Metrics { | ||
/// Expose topic lag metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add information about the new metrics we'll be creating here? Names, tags, types, when this should be used, etc.
Just by looking at the docs, it's hard to tell what the flag does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we definitely can and should. Could you please suggest to me the right place for placing the information about these metrics? I guess it should be somewhere in kafka
source CUE file near metrics
section, but I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would add an entry here: https://github.com/vectordotdev/vector/blob/master/website/cue/reference/components/sources/kafka.cue#L110
Then you would add specific info about the metric here (this is what the reference above is for): https://github.com/vectordotdev/vector/blob/master/website/cue/reference/components/sources/internal_metrics.cue#L618
Then, we should also mention inline here (after the ///
) the names of the metrics that are created as a result of enabling the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Regression Test Results
Run ID: e8e68ae9-33dc-480a-b126-10c6fa6c9d5b Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their Changes in
Fine details of change detection per experiment.
|
src/sources/kafka.rs
Outdated
#[configurable_component] | ||
#[derive(Clone, Debug, Default)] | ||
struct Metrics { | ||
/// Expose topic lag metrics (`kafka_consumer_lag`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also say that it's for every topic and partition? This description still seems a bit opaque.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased a little bit :)
Regression Test Results
Run ID: 42464567-058d-4f20-8c32-53caa9d7c1b1 Explanation
A regression test is an integrated performance test for
The table below, if present, lists those experiments that have experienced a
statistically significant change in their Changes in
Fine details of change detection per experiment.
|
#[derive(Clone, Debug, Default)] | ||
struct Metrics { | ||
/// Expose topic lag metrics for all topics and partitions. The metric name is `kafka_consumer_lag`. | ||
pub topic_lag_metric: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be plural since there will often be many metrics generated from this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Regression Detector ResultsRun ID: 3a4881a3-5544-47ac-8c39-4b8446581a3b ExplanationA regression test is an integrated performance test for The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed. No interesting changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%. Fine details of change detection per experiment.
|
Resolves #13987
Just added
kafka_consumer_lag
metric withtopic_id
andpartition_id
labels.