-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker CPU utilization underestimated on Kubernetes #1242
Comments
@kyguy thanks for the detailed analysis. Quick qn
Apparently |
I originally tested this on the latest released versions of openjdk-8 (8u252-b09) and openjdk-14 (14.0.1+7). Although openjdk-14 contains the updates from this patch [1], the patch does not address the The patch [1] does however fix some methods to be container aware [2]:
Some of which we could potentially use like [1] https://bugs.openjdk.java.net/browse/JDK-8226575 |
good analysis @kyguy - thanks. you're right - seems that +1 on using |
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container.
Although this change would be cleaner, the |
Thanks for the detailed analysis and discussion @kyguy and @amuraru!
In heterogeneous cluster, CPU capacity of individual brokers should indeed be provided to account for the differences in compute power across brokers. However, rather than requiring users to manually populate a file with broker capacities (e.g. CPU cores), this process can be automated -- i.e. if users have access to an external resolver/service for broker capacities, they can customize their resolver to use this source for the capacity information (please see
Since |
Exactly, the problem here is that Kubernetes pods are agnostic of the physical node they reside on. The
The problem with this approach is that no matter how we resolve the capacities for the brokers, the metric reporters are always going to report CPU utilization values with respect to the CPU resources available on the physical host which the broker pods are scheduled on. As soon as, another broker pod or any other application pod is scheduled to the the same physical host as the original broker pod, the CPU utilization values will be underestimated and will not be trustworthy. Of course, we could restrict a physical hosts to only allow the hosting of a broker pod but that would remove the resource utilization benefits of running on Kubernetes in the first place! As stated above, we could fix this in the the cluster model by leverage the Kubernetes API when building the cluster model but I think it would be a lot cleaner to fix in the metrics reporter. It would follow the the Kubernetes model of treating pods as hosts and abstracting the physical hosts from users. Let me know what you think! [1] Line 500 in f522db3
|
Contacted the OpenJDK community, they are in the process of backporting the container fixes of com.sun.management.OperatingSystemMXBean made in OpenJDK version 14 to versions 8 and 11. Once this is done we would be able to use a fix like @amuraru [2] to address this issue cleanly. In the meantime, I have provided a PR with a workaround [3]. I really think we should fix this issue in the metric reporter instead of the cluster model. From what I understand, Cruise Control determines the resource capacity limit of a node by summing the capacity limits of the brokers assigned to the node. Cruise Control relies on users to get the capacity of nodes by having users provide broker capacity information via a configuration file. This leaves it up to the user to schedule brokers to nodes. To schedule brokers to nodes effectively, a user needs to manage node resources and node resource management is exactly what we are trying to outsource to Kubernetes. Altering the cluster model logic would shift a lot of the node resource management we get for free from Kubernetes to Cruise Control. However, altering the metric reporter enables Kubernetes to handle the node resource management and lets Cruise Control focus on brokers and virtual hosts. [1] https://bugs.openjdk.java.net/browse/JDK-8226575 |
@kyguy Thanks for contacting the OpenJDK community and creating the PR! It is great to hear that they are backporting the container fixes. I have no major concerns wrt the PR #1277 -- just left some comments. As a side note, even with this change, in case containers that host brokers have heterogenous CPU capacity, this must still be reflected on Capacity Config File of CC. For example, if
then to correctly estimate the impact of replica reassignment on CPU utilization, CC must know that |
Thanks for the feedback @efeg!
You are absolutely right, a deployment with brokers of heterogeneous CPU capacities will require extra configuration in the For this patch, I am assuming that the host brokers CPU capacities are homogeneous. This helps simplify the definition of broker hosts/pods in a Kubernetes resource and offload the work of managing CPU resources from applications (e.g. operators and CC) to Kubernetes. |
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.
Underlying Problem
The method, getProcessCpuLoad() [1] [2], which the Cruise Control Metric Reporter uses to gather CPU utilization is NOT cgroup aware. This causes the Cruise Control Metric Reporter to underestimate the CPU utilization of the Kafka brokers.
Not matter what Kubernetes resource restrictions are in place, the metric reporter will return:
CPU Utilization = ((allocated container cores) * (container CPU utilization)) / (cores on physical host)
For example, if you set a 2 core limit on a broker pod that is scheduled to a physical node with 8 cores and max out the CPU of the broker, the reported CPU utilization will be:
0.25 = ((2 cores) * (1.0 utilization)) / (8 cores on physical host)
When the CPU utilization should be:
1.0
Rebalance issues tied to CPU resource underestimation
This causes problems when there is:
Kubernetes CPU resource limits
Although the brokers’ CPU resource will still be properly restricted by K8s, the metric reporter will underestimate the utilization of those CPU resources being allocated. This will make brokers appear to have more CPU resources available than they actually have.
The metric reporter will show a CPU utilization of 50% for Broker0 (B0) even when Broker0 is really using 100% of it’s K8s allocated CPU core. This could cause the rebalance operation to assign more partitions to a maxed out broker0.
B0 is using 100% of the CPU resources allocated to it by k8s and has no CPU capacity left but metric reporter is reporting that B0 is only using 50% of its CPU resources because it thinks that all of the node's CPU resources are available to B0.
One broker per node
Even if we only put one broker per node, the reported CPU utilization would only be correct if there were no K8s CPU limits and no other applications running on the same node. Even if this were the case, the estimated load of a broker on a node with multiple cores would not be weighted any differently than a broker on a node with one core. So it would be possible to overload a broker when moving partitions from a node with more cores to a node with less cores.
We could get around this issue by adding specific broker CPU capacity entries to the Cruise Control capacity configuration to account for the weight, but it would require tracking the nodes that brokers get scheduled on, getting the number of CPU cores that are on that node, and updating the specific broker CPU capacity entries.
Multiple brokers per node
Even when a node is using 100% of its CPU resources, if there is more than one broker on that node, the metric reporter for each broker on that node will report a CPU utilization value that is less than 100%. This gives the appearance that these brokers have more CPU resources than they actually have.
In its cluster model, Cruise Control tracks and aggregates the broker load on hosts using hostnames [2]. On bare metal, this works fine since the hostname correspond to the underlying node but on k8s the hostname correspond to the name of the broker's pod so it's possible that more one pod could be scheduled on the same physical host. One way to solve this issue would be to alter the Cruise Control metric reporter to query the node names of the broker pods from the K8s API and then update the CC cluster model accordingly.
Potential solution
One potential solution to solve the issues above would be to allow the Cruise Control Metric Reporter to be configured to get the CPU utilization of a JVM process with a method that is aware of container boundaries. Right now, the metric reporter uses
getProcessCpuLoad
[2], which gets the CPU usage of the JVM with respect to the physical node. There have been recent efforts to update these functions to be aware of their operating environment whether it be a physical host or a container but this specific method has not been updated.The best approach I have found so far is to still use
getProcessCpuLoad()
and multiply it by the percentage of CPU resources that the container is allowed, for example:We could then have a Cruise Control Metric reporter configuration option that would allow this function to be used in place of the original when operating in Kubernetes.
[1]
cruise-control/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java
Line 168 in 6448a82
[2] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad--
The text was updated successfully, but these errors were encountered: