Broker CPU utilization underestimated on Kubernetes #1242

kyguy · 2020-06-17T16:40:01Z

Underlying Problem

The method, getProcessCpuLoad() [1] [2], which the Cruise Control Metric Reporter uses to gather CPU utilization is NOT cgroup aware. This causes the Cruise Control Metric Reporter to underestimate the CPU utilization of the Kafka brokers.

Not matter what Kubernetes resource restrictions are in place, the metric reporter will return:

CPU Utilization = ((allocated container cores) * (container CPU utilization)) / (cores on physical host)

For example, if you set a 2 core limit on a broker pod that is scheduled to a physical node with 8 cores and max out the CPU of the broker, the reported CPU utilization will be:

0.25 = ((2 cores) * (1.0 utilization)) / (8 cores on physical host)

When the CPU utilization should be:

1.0

Rebalance issues tied to CPU resource underestimation

This causes problems when there is:

Kubernetes CPU resource limits

Although the brokers’ CPU resource will still be properly restricted by K8s, the metric reporter will underestimate the utilization of those CPU resources being allocated. This will make brokers appear to have more CPU resources available than they actually have.

+-----------+
|     B0    |         
|  2 cores  |       K8s CPU limit: 1 core
+-----------+              
   Node 0

The metric reporter will show a CPU utilization of 50% for Broker0 (B0) even when Broker0 is really using 100% of it’s K8s allocated CPU core. This could cause the rebalance operation to assign more partitions to a maxed out broker0.

+-----------+                                         +---------+
|     B0    |      move partitions from B1 to B0      |    B1   |
|  2 cores  |   <---------------------------------    |         |
+-----------+                                         +---------+
   Node 0                                                Node 1

B0 is using 100% of the CPU resources allocated to it by k8s and has no CPU capacity left but metric reporter is reporting that B0 is only using 50% of its CPU resources because it thinks that all of the node's CPU resources are available to B0.

One broker per node

Even if we only put one broker per node, the reported CPU utilization would only be correct if there were no K8s CPU limits and no other applications running on the same node. Even if this were the case, the estimated load of a broker on a node with multiple cores would not be weighted any differently than a broker on a node with one core. So it would be possible to overload a broker when moving partitions from a node with more cores to a node with less cores.

+-----------+                                     +-----------+
|           |  move load from Node 1 to Node 2    |           |
|  4 cores  |  -------------------------------->  |  1 core   |
+-----------+                                     +-----------+
   Node 1                                            Node 2
  CPU 100%                                           CPU 0%


+-----------+                                     +-----------+  
|           |     move 2 cores worth of work      |           |
|  4 cores  |  -------------------------------->  |  1 core   |
+-----------+                                     +-----------+
   Node 1                                            Node 2
 CPU  50%                                           CPU 200%

We could get around this issue by adding specific broker CPU capacity entries to the Cruise Control capacity configuration to account for the weight, but it would require tracking the nodes that brokers get scheduled on, getting the number of CPU cores that are on that node, and updating the specific broker CPU capacity entries.

Multiple brokers per node

Even when a node is using 100% of its CPU resources, if there is more than one broker on that node, the metric reporter for each broker on that node will report a CPU utilization value that is less than 100%. This gives the appearance that these brokers have more CPU resources than they actually have.

+-----------+                                        +------------+                             
|           |    move load from B0 to B1 and B2      |     B1     |
|     B0    |  ---------------------------------->   |     B2     |
+-----------+                                        +------------+
   Node                                                  Node 2
 CPU 100%                                               CPU 100%
 
Broker0(B0): CPU 100%                               Broker1(B1): CPU 50%              
                                                    Broker2(B2): CPU 50%

In its cluster model, Cruise Control tracks and aggregates the broker load on hosts using hostnames [2]. On bare metal, this works fine since the hostname correspond to the underlying node but on k8s the hostname correspond to the name of the broker's pod so it's possible that more one pod could be scheduled on the same physical host. One way to solve this issue would be to alter the Cruise Control metric reporter to query the node names of the broker pods from the K8s API and then update the CC cluster model accordingly.

Potential solution

One potential solution to solve the issues above would be to allow the Cruise Control Metric Reporter to be configured to get the CPU utilization of a JVM process with a method that is aware of container boundaries. Right now, the metric reporter uses getProcessCpuLoad [2], which gets the CPU usage of the JVM with respect to the physical node. There have been recent efforts to update these functions to be aware of their operating environment whether it be a physical host or a container but this specific method has not been updated.

The best approach I have found so far is to still use getProcessCpuLoad() and multiply it by the percentage of CPU resources that the container is allowed, for example:

CPU util =  getProcessCpuLoad()  * ((number of physical cores)/(cgroup share quota))

We could then have a Cruise Control Metric reporter configuration option that would allow this function to be used in place of the original when operating in Kubernetes.

[1]

cruise-control/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java

Line 168 in 6448a82

    
           double cpuUtil = ((com.sun.management.OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean()).getProcessCpuLoad();

[2] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad--

The text was updated successfully, but these errors were encountered:

amuraru · 2020-06-17T17:40:27Z

@kyguy thanks for the detailed analysis.

Quick qn

There have been recent efforts to update these functions to be aware of their operating environment whether it be a physical host or a container but this specific method has not been updated.

Apparently
https://bugs.openjdk.java.net/browse/JDK-8226575 addressed this in openjdk-14 and was in backported in jdk8u261 as well. What version of java are you using

kyguy · 2020-06-22T21:57:53Z

I originally tested this on the latest released versions of openjdk-8 (8u252-b09) and openjdk-14 (14.0.1+7). Although openjdk-14 contains the updates from this patch [1], the patch does not address the getProcessCpuLoad() method used by the Cruise Control metric reporter.

The patch [1] does however fix some methods to be container aware [2]:

getFreePhysicalMemorySize()
getTotalPhysicalMemorySize()
getFreeSwapSpaceSize()
getTotalSwapSpaceSize()
getSystemCpuLoad()

Some of which we could potentially use like getSystemCpuLoad() or getCpuLoad() [2] in place of getProcessCpuLoad() when running Cruise Control in a containerized environment for openjdk-14 but not for openjdk-8. After building and testing the latest openjdk8 source [3], it appears that these methods were not patched for openjdk-8 despite the tickets [4] claiming that version 8u261 was patched. I have sent an email to the [email protected] mailing list for more information.

[1] https://bugs.openjdk.java.net/browse/JDK-8226575
[2] https://hg.openjdk.java.net/jdk-updates/jdk14u/rev/899254bd233b
[3] http://hg.openjdk.java.net/jdk8u/jdk8u-dev/
[4] https://bugs.openjdk.java.net/browse/JDK-8242287

amuraru · 2020-06-23T07:56:56Z

good analysis @kyguy - thanks. you're right - seems that getProcessCpuLoad is reported against all cpus on the node and not cgroup aware:
https://github.com/openjdk/jdk14u/blob/7a3bf58b8ad2c327229a94ae98f58ec96fa39332/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c#L77-L125

+1 on using getSystemCpuLoad()/getCpuLoad() when running broker in a container - it could be done configurable imo. @efeg wdyt?

See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container.

kyguy · 2020-06-25T23:18:01Z

+1 on using getSystemCpuLoad()/getCpuLoad() when running broker in a container - it could be done configurable imo.

Although this change would be cleaner, the getSystemCpuLoad() and getCpuLoad() methods are not patched for openjdk versions < 14, so the solution would not work when running openjdk versions < 14 . I have been playing with a solution for openjdk versions >= 8 that uses the getProcessCpuLoad() method, the host's core count, and the containers's cgroup share quota. It's trickier solution but it is effective for the earlier openjdk versions as well! I'll put a proof of concept together and ping you for review

efeg · 2020-06-29T18:20:15Z

Thanks for the detailed analysis and discussion @kyguy and @amuraru!

We could get around this issue by adding specific broker CPU capacity entries to the Cruise Control capacity configuration to account for the weight, but it would require tracking the nodes that brokers get scheduled on, getting the number of CPU cores that are on that node, and updating the specific broker CPU capacity entries.

In heterogeneous cluster, CPU capacity of individual brokers should indeed be provided to account for the differences in compute power across brokers. However, rather than requiring users to manually populate a file with broker capacities (e.g. CPU cores), this process can be automated -- i.e. if users have access to an external resolver/service for broker capacities, they can customize their resolver to use this source for the capacity information (please see Option#2 in Populating the Capacity Config File).

Multiple brokers per node issue seems to be due to a mismatch between the actual and effective host information. CPU is a host-level resource (see Resource#CPU). Hence, if the host information in cluster metadata matches the broker pods from K8s, this issue would be addressed (see the retrieval of host information from Kafka cluster metadata)

We could then have a Cruise Control Metric reporter configuration option that would allow this function to be used in place of the original when operating in Kubernetes.

Since cgroup share quota concerns the capacity of brokers, I feel it is preferable to have this information provided by BrokerCapacityConfigResolver#capacityForBroker (i.e. in BrokerCapacityInfo) rather than having changes on metrics reporter side. This also provides higher capacity visibility on CC-side.

kyguy · 2020-06-30T20:07:37Z

Multiple brokers per node issue seems to be due to a mismatch between the actual and effective host information. CPU is a host-level resource (see Resource#CPU). Hence, if the host information in cluster metadata matches the broker pods from K8s, this issue would be addressed (see the retrieval of host information from Kafka cluster metadata)

Exactly, the problem here is that Kubernetes pods are agnostic of the physical node they reside on. The node.host() method used to build the cluster model [1] returns the hostname of the pod, not the hostname of the physical node that pod is scheduled on. This causes the cluster model to think that the broker pods are on their own physical nodes even though they may not be. So the CPU utilization values reported by the metric reporters are never aggregated by physical host in the cluster model. To address this issue in the cluster model, we would need to make a calls to the Kubernetes API here [1] using the hostname of a broker pod to retrieve the hostname of the physical node which that pod is scheduled on, and use that information to populate the node.host() information of a broker in the cluster model. We could address the issue this way, but I think it would be a little messier than the fix for the metric reporter

Since cgroup share quota concerns the capacity of brokers, I feel it is preferable to have this information provided by BrokerCapacityConfigResolver#capacityForBroker (i.e. in BrokerCapacityInfo) rather than having changes on metrics reporter side. This also provides higher capacity visibility on CC-side.

The problem with this approach is that no matter how we resolve the capacities for the brokers, the metric reporters are always going to report CPU utilization values with respect to the CPU resources available on the physical host which the broker pods are scheduled on. As soon as, another broker pod or any other application pod is scheduled to the the same physical host as the original broker pod, the CPU utilization values will be underestimated and will not be trustworthy. Of course, we could restrict a physical hosts to only allow the hosting of a broker pod but that would remove the resource utilization benefits of running on Kubernetes in the first place!

As stated above, we could fix this in the the cluster model by leverage the Kubernetes API when building the cluster model but I think it would be a lot cleaner to fix in the metrics reporter. It would follow the the Kubernetes model of treating pods as hosts and abstracting the physical hosts from users. Let me know what you think!

[1]

cruise-control/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/LoadMonitor.java

Line 500 in f522db3

    
           clusterModel.createBroker(rack, node.host(), node.id(), brokerCapacity, populateReplicaPlacementInfo);

kyguy · 2020-07-21T22:37:59Z

Contacted the OpenJDK community, they are in the process of backporting the container fixes of com.sun.management.OperatingSystemMXBean made in OpenJDK version 14 to versions 8 and 11. Once this is done we would be able to use a fix like @amuraru [2] to address this issue cleanly. In the meantime, I have provided a PR with a workaround [3].

I really think we should fix this issue in the metric reporter instead of the cluster model. From what I understand, Cruise Control determines the resource capacity limit of a node by summing the capacity limits of the brokers assigned to the node. Cruise Control relies on users to get the capacity of nodes by having users provide broker capacity information via a configuration file. This leaves it up to the user to schedule brokers to nodes. To schedule brokers to nodes effectively, a user needs to manage node resources and node resource management is exactly what we are trying to outsource to Kubernetes. Altering the cluster model logic would shift a lot of the node resource management we get for free from Kubernetes to Cruise Control. However, altering the metric reporter enables Kubernetes to handle the node resource management and lets Cruise Control focus on brokers and virtual hosts.

[1] https://bugs.openjdk.java.net/browse/JDK-8226575
[2] adobe@ec8f433
[3] #1277

efeg · 2020-07-23T00:03:52Z

@kyguy Thanks for contacting the OpenJDK community and creating the PR! It is great to hear that they are backporting the container fixes. I have no major concerns wrt the PR #1277 -- just left some comments.

As a side note, even with this change, in case containers that host brokers have heterogenous CPU capacity, this must still be reflected on Capacity Config File of CC. For example, if

broker-1 is running on a container with cpu quota: 100K and cpu period: 100K with nproc=1, and
broker-2 is running on another container with cpu quota: 50K and cpu period: 100K with nproc=1

then to correctly estimate the impact of replica reassignment on CPU utilization, CC must know that broker-2 has half the capacity of broker-1. As long as this proportion between the capacity of brokers are represented accurately (e.g. via config/capacityCores.json using number of cores proportional to the capacity), the load estimation on CC will be accurate.

kyguy · 2020-07-28T01:14:53Z

Thanks for the feedback @efeg!

then to correctly estimate the impact of replica reassignment on CPU utilization, CC must know that broker-2 has half the capacity of broker-1. As long as this proportion between the capacity of brokers are represented accurately (e.g. via config/capacityCores.json using number of cores proportional to the capacity), the load estimation on CC will be accurate.

You are absolutely right, a deployment with brokers of heterogeneous CPU capacities will require extra configuration in the config/capcityCores.json file.

For this patch, I am assuming that the host brokers CPU capacities are homogeneous. This helps simplify the definition of broker hosts/pods in a Kubernetes resource and offload the work of managing CPU resources from applications (e.g. operators and CC) to Kubernetes.

See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container.

See: linkedin#1242 This is an internal patch to use SystemCpuLoad to monitor broker pod cpu usage knowing that we always run the broker in a container. As `SystemCpuLoad` reports un-normalized cpu load across all cores we do this normalization to match CC expected value in [0, 1] interval.

efeg added correctness A condition affecting the proper functionality. robustness Makes the project tolerate or handle perturbations. labels Jun 19, 2020

tomncooper mentioned this issue Jun 19, 2020

[Enhancement] Add K8s compatible CPU Capacity goal to Cruise Contorl strimzi/strimzi-kafka-operator#3215

Closed

kyguy mentioned this issue Jul 21, 2020

Add config option for accurate CPU estimation on Kubernetes #1277

Merged

efeg closed this as completed in #1277 Jul 30, 2020

amuraru mentioned this issue Aug 3, 2020

[INTERNAL] - Broker CPU utilization underestimated on Kubernetes ADDENDUM adobe/cruise-control#1

Merged

amuraru mentioned this issue Sep 29, 2020

Enable kubernetes mode in CruiseControl banzaicloud/koperator#463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker CPU utilization underestimated on Kubernetes #1242

Broker CPU utilization underestimated on Kubernetes #1242

kyguy commented Jun 17, 2020

amuraru commented Jun 17, 2020

kyguy commented Jun 22, 2020 •

edited

Loading

amuraru commented Jun 23, 2020 •

edited

Loading

kyguy commented Jun 25, 2020

efeg commented Jun 29, 2020

kyguy commented Jun 30, 2020

kyguy commented Jul 21, 2020

efeg commented Jul 23, 2020

kyguy commented Jul 28, 2020

Broker CPU utilization underestimated on Kubernetes #1242

Broker CPU utilization underestimated on Kubernetes #1242

Comments

kyguy commented Jun 17, 2020

Underlying Problem

Rebalance issues tied to CPU resource underestimation

Kubernetes CPU resource limits

One broker per node

Multiple brokers per node

Potential solution

amuraru commented Jun 17, 2020

kyguy commented Jun 22, 2020 • edited Loading

amuraru commented Jun 23, 2020 • edited Loading

kyguy commented Jun 25, 2020

efeg commented Jun 29, 2020

kyguy commented Jun 30, 2020

kyguy commented Jul 21, 2020

efeg commented Jul 23, 2020

kyguy commented Jul 28, 2020

kyguy commented Jun 22, 2020 •

edited

Loading

amuraru commented Jun 23, 2020 •

edited

Loading