-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: hardware metrics #876
Conversation
// Update CPU usage metric | ||
let cpu_usage = system.global_cpu_usage() as i64; | ||
crate::metrics::CPU_USAGE_PERCENTAGE | ||
.with_label_values(&["global"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better also add account_id to the label values, like .with_label_values(&["global", my_account_id.as_str()])
.
You'd likely need to pass in the account_id in the update_system_metrics() to be able to access it. The account_id value is available here:
mpc/chain-signatures/node/src/protocol/mod.rs
Line 204 in d8d8e38
let my_account_id = self.ctx.account_id.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now I think
chain-signatures/node/src/metrics.rs
Outdated
}); | ||
|
||
// Total Memory Metric | ||
pub(crate) static TOTAL_MEMORY_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either this one or AVAILABLE_MEMORY_BYTES
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we want a total to ensure partners have enough memory? I figured available would be useful to alert off of as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying that available + used = total
@@ -227,6 +228,9 @@ impl MpcSignProtocol { | |||
loop { | |||
let protocol_time = Instant::now(); | |||
tracing::trace!("trying to advance chain signatures protocol"); | |||
// Hardware metric refresh | |||
update_system_metrics(&my_account_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often is this executed in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppca might know better than I 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as often as a loop. Might be too often in this case. I think we can do once every minute?
add
let mut last_hardware_pull = Instant::now();
at around here:
mpc/chain-signatures/node/src/protocol/mod.rs
Line 215 in 11dd7e0
let mut last_pinged = Instant::now(); |
and
if last_hardware_pull.elapsed() > Duration::from_secs(60) {
update_system_metrics(&my_account_id);;
}
at around
mpc/chain-signatures/node/src/protocol/mod.rs
Line 229 in 11dd7e0
tracing::trace!("trying to advance chain signatures protocol"); |
This will update the metrics every minute. @volovyks @kmaus-near
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do that every ~10 seconds? I'm afraid we can lose CPU spikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's change to if last_hardware_pull.elapsed() > Duration::from_secs(10)
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example the crate uses is 5s, which would be pretty common practice for hardware metrics, and is the lowest refresh rate in grafana.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5s sounds good to me
chain-signatures/node/src/metrics.rs
Outdated
}); | ||
|
||
// Disk Space Metric | ||
pub(crate) static DISK_SPACE_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have available and used for the DISK_SPACE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is Available disk space, mostly for alerting purposes. We can also add a total disk space just so there's visibility into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify in the name if it is used or remaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, I will do a calculation in grafana for used disk space since only total and available is part of that library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any blockers here?
chain-signatures/node/src/metrics.rs
Outdated
}); | ||
|
||
// Total Memory Metric | ||
pub(crate) static TOTAL_MEMORY_BYTES: Lazy<IntGaugeVec> = Lazy::new(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying that available + used = total
I think we are alright. I was slightly concerned with CPU metrics being inaccurate, but it seems they are a pretty decent representation of actual host metrics after watching the grafana metrics vs the host metrics. Let me convert this to a PR and remove the Total memory metric, if we need it we can calculate it from the other two. |
Please rebase latest develop and fix the test failures. |
if last_hardware_pull.elapsed() > Duration::from_secs(5) { | ||
update_system_metrics(&my_account_id); | ||
} | ||
|
||
tracing::debug!("trying to advance chain signatures protocol"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe delete this one then? there are 2 tracing::debug!("trying to advance chain signatures protocol");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except that one repeated tracing :D
Still need to actually test if this does anything more than compile, but here are some eyes on what I came up with.