Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673

Closed
wants to merge 2 commits into from

Conversation

Huang-Wei
Copy link
Contributor

What is this PR for?

Fix a bug in metric of pending pods/resources. The bug seems to present in all YK versions.

What type of PR is it?

  • - Bug Fix

What is the Jira issue?

How should this be tested?

I cannot find any existing test suite except for pkg/metrics/queue_test.go, but that's more simply checking the metric's existence. Pplease advice if you want tests in this PR, or in follow-up to add tests to cover metric accuracy generally.

For now, I did a local ad-hoc test to verify this PR works:

image

Copy link

@yzhangal yzhangal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch Wei. It seems

func (sq *Queue) decPendingResource(delta *resources.Resource) {

need a similar change. Right?

@manirajv06 manirajv06 assigned manirajv06 and Huang-Wei and unassigned manirajv06 Oct 12, 2023
@manirajv06 manirajv06 self-requested a review October 12, 2023 10:11
@manirajv06
Copy link
Contributor

Thanks for the patch Wei. It seems

func (sq *Queue) decPendingResource(delta *resources.Resource) {

need a similar change. Right?

yes

@manirajv06
Copy link
Contributor

Pplease advice if you want tests in this PR, or in follow-up to add tests to cover metric accuracy generally.

Yes, we should be doing metrics assertions in all possible places, at least for important state change events in objects like Queues, Application etc. Please file a separate jira. Though no. of changes is going to be high, it is good to do especially given that we are seeing increase in no. of bugs with respect to metrics recently.

Can we start with this? It is better to start by making changes in queue_test#TestPendingCalc only.

Copy link
Contributor

@manirajv06 manirajv06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments.

@Huang-Wei
Copy link
Contributor Author

Thanks @yzhangal @manirajv06 for the review. I added a minimum UT to repro the bug:

=== RUN   TestPendingCalc
2023-10-12T15:57:52.289-0700	INFO	core.scheduler.queue	objects/queue.go:144	configured queue added to scheduler	{"queueName": "root"}
2023-10-12T15:57:52.290-0700	INFO	core.scheduler.queue	objects/queue.go:144	configured queue added to scheduler	{"queueName": "root.leaf"}
    queue_test.go:300: assertion failed: error is not nil: 
        
        Diff:
        --- metric output does not match expectation; want
        +++ got:
        @@ -1,10 +1,2 @@
        -(bytes.Buffer) # HELP yunikorn_root_leaf_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
        -# TYPE yunikorn_root_leaf_queue_resource gauge
        -yunikorn_root_leaf_queue_resource{resource="memory",state="pending"} 100
        -yunikorn_root_leaf_queue_resource{resource="vcores",state="pending"} 10
        -# HELP yunikorn_root_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
        -# TYPE yunikorn_root_queue_resource gauge
        -yunikorn_root_queue_resource{resource="memory",state="pending"} 100
        -yunikorn_root_queue_resource{resource="vcores",state="pending"} 10
        +(bytes.Buffer) 
         
        -
        : unexpected metrics
--- FAIL: TestPendingCalc (0.00s)

To ensure we have enough UT coverage on metrics, a follow-up PR would be a better bet.

go.mod Show resolved Hide resolved
Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot of overhead. The call to updateAllocatedAndPendingResourceMetrics() updates three metrics when it only should do one. Three loops are run setting metrics for values that we know have not changed. We call the specific increment or decrement for each type separately. We need to update the metrics separately also.

The function should be split into three calls one for each type we track: allocated, pending and preempting.

@wilfred-s
Copy link
Contributor

To ensure we have enough UT coverage on metrics, a follow-up PR would be a better bet.
Completely agree, a follow up jira is needed. Probably linked to the changes for using labels in the queue metrics instead of the names like we do now.

@Huang-Wei
Copy link
Contributor Author

This adds a lot of overhead. The call to updateAllocatedAndPendingResourceMetrics() updates three metrics when it only should do one.

Agree. However, the function updateAllocatedAndPendingResourceMetrics() and how this pattern is used has been across several releases. In this PR, I followed existing pattern and focus on the accuracy issue, and try to make it as neat as possible so it's cherrypick-able.

For radical performance optimization on how metrics get updated, we should track in a separate JIRA.

@wilfred-s
Copy link
Contributor

Agree. However, the function updateAllocatedAndPendingResourceMetrics() and how this pattern is used has been across several releases. In this PR, I followed existing pattern and focus on the accuracy issue, and try to make it as neat as possible so it's cherrypick-able.

I am also looking at keeping accurate track. We generate metrics entries at a point in time without changes to the underlying values. This could hide the fact that we do not call the metrics at the right time or don't call it for certain changes. They seem accurate in the long run as they are updated as a side effect of another change. This might not show if churn and update rate is high but could show for lower change rates.

@Huang-Wei
Copy link
Contributor Author

@wilfred-s how should I proceed, like creating a separate PR to address the inefficiency issue and then rebase here?

@wilfred-s
Copy link
Contributor

Adding the change here adds a total of 14 lines to the change. I would say that is small enough to include here and fix both in one go. If you split the change in two you will touch the same lines of code in both. I doubt that the extra change will affect the cherry pick.

@Huang-Wei
Copy link
Contributor Author

I doubt that the extra change will affect the cherry pick.

That would cause the rebased PR not cherrypickable. That's why earlier I proposed to have this PR a pure fix (leave perf bits apart), and that would cherrypickable.

Anyway, let me try to include perf bits in this PR.

@Huang-Wei
Copy link
Contributor Author

let me try to include perf bits in this PR.

This is done in the first commit, and followed by the pending metric fix.

(I mark updateAllocatedAndPendingResourceMetrics() as DEPRECATED for now. LMK if you want to remove it entirely)

PTAL @wilfred-s @yzhangal @manirajv06

@codecov
Copy link

codecov bot commented Oct 22, 2023

Codecov Report

Merging #673 (17fac11) into master (ba3b362) will increase coverage by 0.00%.
Report is 2 commits behind head on master.
The diff coverage is 78.57%.

@@           Coverage Diff           @@
##           master     #673   +/-   ##
=======================================
  Coverage   77.67%   77.68%           
=======================================
  Files          79       79           
  Lines       13089    13098    +9     
=======================================
+ Hits        10167    10175    +8     
- Misses       2596     2598    +2     
+ Partials      326      325    -1     
Files Coverage Δ
pkg/scheduler/objects/application.go 68.27% <100.00%> (ø)
pkg/scheduler/objects/queue.go 77.39% <76.92%> (-0.16%) ⬇️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wilfred-s wilfred-s closed this in 5a9d1b5 Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants