[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

Huang-Wei · 2023-10-11T22:09:38Z

What is this PR for?

Fix a bug in metric of pending pods/resources. The bug seems to present in all YK versions.

What type of PR is it?

- Bug Fix

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2031

How should this be tested?

I cannot find any existing test suite except for pkg/metrics/queue_test.go, but that's more simply checking the metric's existence. Pplease advice if you want tests in this PR, or in follow-up to add tests to cover metric accuracy generally.

For now, I did a local ad-hoc test to verify this PR works:

yzhangal

Thanks for the patch Wei. It seems

func (sq *Queue) decPendingResource(delta *resources.Resource) {

need a similar change. Right?

manirajv06 · 2023-10-12T10:11:51Z

Thanks for the patch Wei. It seems

func (sq *Queue) decPendingResource(delta *resources.Resource) {

need a similar change. Right?

yes

manirajv06 · 2023-10-12T10:18:55Z

Pplease advice if you want tests in this PR, or in follow-up to add tests to cover metric accuracy generally.

Yes, we should be doing metrics assertions in all possible places, at least for important state change events in objects like Queues, Application etc. Please file a separate jira. Though no. of changes is going to be high, it is good to do especially given that we are seeing increase in no. of bugs with respect to metrics recently.

Can we start with this? It is better to start by making changes in queue_test#TestPendingCalc only.

manirajv06

Please see comments.

Huang-Wei · 2023-10-12T23:02:47Z

Thanks @yzhangal @manirajv06 for the review. I added a minimum UT to repro the bug:

=== RUN   TestPendingCalc
2023-10-12T15:57:52.289-0700	INFO	core.scheduler.queue	objects/queue.go:144	configured queue added to scheduler	{"queueName": "root"}
2023-10-12T15:57:52.290-0700	INFO	core.scheduler.queue	objects/queue.go:144	configured queue added to scheduler	{"queueName": "root.leaf"}
    queue_test.go:300: assertion failed: error is not nil: 
        
        Diff:
        --- metric output does not match expectation; want
        +++ got:
        @@ -1,10 +1,2 @@
        -(bytes.Buffer) # HELP yunikorn_root_leaf_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
        -# TYPE yunikorn_root_leaf_queue_resource gauge
        -yunikorn_root_leaf_queue_resource{resource="memory",state="pending"} 100
        -yunikorn_root_leaf_queue_resource{resource="vcores",state="pending"} 10
        -# HELP yunikorn_root_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
        -# TYPE yunikorn_root_queue_resource gauge
        -yunikorn_root_queue_resource{resource="memory",state="pending"} 100
        -yunikorn_root_queue_resource{resource="vcores",state="pending"} 10
        +(bytes.Buffer) 
         
        -
        : unexpected metrics
--- FAIL: TestPendingCalc (0.00s)

To ensure we have enough UT coverage on metrics, a follow-up PR would be a better bet.

go.mod

pkg/scheduler/objects/queue_test.go

wilfred-s

This adds a lot of overhead. The call to updateAllocatedAndPendingResourceMetrics() updates three metrics when it only should do one. Three loops are run setting metrics for values that we know have not changed. We call the specific increment or decrement for each type separately. We need to update the metrics separately also.

The function should be split into three calls one for each type we track: allocated, pending and preempting.

wilfred-s · 2023-10-13T02:10:23Z

To ensure we have enough UT coverage on metrics, a follow-up PR would be a better bet.
Completely agree, a follow up jira is needed. Probably linked to the changes for using labels in the queue metrics instead of the names like we do now.

Huang-Wei · 2023-10-13T05:30:29Z

This adds a lot of overhead. The call to updateAllocatedAndPendingResourceMetrics() updates three metrics when it only should do one.

Agree. However, the function updateAllocatedAndPendingResourceMetrics() and how this pattern is used has been across several releases. In this PR, I followed existing pattern and focus on the accuracy issue, and try to make it as neat as possible so it's cherrypick-able.

For radical performance optimization on how metrics get updated, we should track in a separate JIRA.

wilfred-s · 2023-10-16T04:37:45Z

Agree. However, the function updateAllocatedAndPendingResourceMetrics() and how this pattern is used has been across several releases. In this PR, I followed existing pattern and focus on the accuracy issue, and try to make it as neat as possible so it's cherrypick-able.

I am also looking at keeping accurate track. We generate metrics entries at a point in time without changes to the underlying values. This could hide the fact that we do not call the metrics at the right time or don't call it for certain changes. They seem accurate in the long run as they are updated as a side effect of another change. This might not show if churn and update rate is high but could show for lower change rates.

Huang-Wei · 2023-10-16T18:10:05Z

@wilfred-s how should I proceed, like creating a separate PR to address the inefficiency issue and then rebase here?

wilfred-s · 2023-10-18T07:13:44Z

Adding the change here adds a total of 14 lines to the change. I would say that is small enough to include here and fix both in one go. If you split the change in two you will touch the same lines of code in both. I doubt that the extra change will affect the cherry pick.

Huang-Wei · 2023-10-18T17:23:26Z

I doubt that the extra change will affect the cherry pick.

That would cause the rebased PR not cherrypickable. That's why earlier I proposed to have this PR a pure fix (leave perf bits apart), and that would cherrypickable.

Anyway, let me try to include perf bits in this PR.

Huang-Wei · 2023-10-18T19:22:48Z

let me try to include perf bits in this PR.

This is done in the first commit, and followed by the pending metric fix.

(I mark updateAllocatedAndPendingResourceMetrics() as DEPRECATED for now. LMK if you want to remove it entirely)

PTAL @wilfred-s @yzhangal @manirajv06

codecov · 2023-10-22T02:02:03Z

Codecov Report

Merging #673 (17fac11) into master (ba3b362) will increase coverage by 0.00%.
Report is 2 commits behind head on master.
The diff coverage is 78.57%.

@@           Coverage Diff           @@
##           master     #673   +/-   ##
=======================================
  Coverage   77.67%   77.68%           
=======================================
  Files          79       79           
  Lines       13089    13098    +9     
=======================================
+ Hits        10167    10175    +8     
- Misses       2596     2598    +2     
+ Partials      326      325    -1

Files	Coverage Δ
pkg/scheduler/objects/application.go	`68.27% <100.00%> (ø)`
pkg/scheduler/objects/queue.go	`77.39% <76.92%> (-0.16%)`	⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

wilfred-s

LGTM

yzhangal reviewed Oct 12, 2023

View reviewed changes

manirajv06 assigned manirajv06 and Huang-Wei and unassigned manirajv06 Oct 12, 2023

manirajv06 self-requested a review October 12, 2023 10:11

manirajv06 requested changes Oct 12, 2023

View reviewed changes

Huang-Wei commented Oct 12, 2023

View reviewed changes

go.mod Show resolved Hide resolved

Huang-Wei commented Oct 12, 2023

View reviewed changes

pkg/scheduler/objects/queue_test.go Show resolved Hide resolved

wilfred-s requested changes Oct 13, 2023

View reviewed changes

Huang-Wei force-pushed the fix-pending-metric branch from fc40665 to 201d0dc Compare October 13, 2023 06:04

Huang-Wei force-pushed the fix-pending-metric branch from 201d0dc to 7e84ff3 Compare October 18, 2023 19:20

Huang-Wei added 2 commits October 18, 2023 17:00

Split updatePreemptingResourceMetrics into 3 functions

f381a08

Fix a bug on metric of pending pods/resources

17fac11

Huang-Wei force-pushed the fix-pending-metric branch from 7e84ff3 to 17fac11 Compare October 19, 2023 00:01

Huang-Wei requested review from yzhangal, wilfred-s and manirajv06 October 19, 2023 17:39

wilfred-s approved these changes Oct 24, 2023

View reviewed changes

wilfred-s closed this in 5a9d1b5 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

Huang-Wei commented Oct 11, 2023

yzhangal left a comment

manirajv06 commented Oct 12, 2023

manirajv06 commented Oct 12, 2023

manirajv06 left a comment

Huang-Wei commented Oct 12, 2023

wilfred-s left a comment

wilfred-s commented Oct 13, 2023

Huang-Wei commented Oct 13, 2023

wilfred-s commented Oct 16, 2023

Huang-Wei commented Oct 16, 2023

wilfred-s commented Oct 18, 2023

Huang-Wei commented Oct 18, 2023

Huang-Wei commented Oct 18, 2023

codecov bot commented Oct 22, 2023

wilfred-s left a comment

[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

Conversation

Huang-Wei commented Oct 11, 2023

What is this PR for?

What type of PR is it?

What is the Jira issue?

How should this be tested?

yzhangal left a comment

Choose a reason for hiding this comment

manirajv06 commented Oct 12, 2023

manirajv06 commented Oct 12, 2023

manirajv06 left a comment

Choose a reason for hiding this comment

Huang-Wei commented Oct 12, 2023

wilfred-s left a comment

Choose a reason for hiding this comment

wilfred-s commented Oct 13, 2023

Huang-Wei commented Oct 13, 2023

wilfred-s commented Oct 16, 2023

Huang-Wei commented Oct 16, 2023

wilfred-s commented Oct 18, 2023

Huang-Wei commented Oct 18, 2023

Huang-Wei commented Oct 18, 2023

codecov bot commented Oct 22, 2023

Codecov Report

wilfred-s left a comment

Choose a reason for hiding this comment