-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2031] Fix a bug on metric of pending pods/resources #673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch Wei. It seems
func (sq *Queue) decPendingResource(delta *resources.Resource) {
need a similar change. Right?
yes |
Yes, we should be doing metrics assertions in all possible places, at least for important state change events in objects like Queues, Application etc. Please file a separate jira. Though no. of changes is going to be high, it is good to do especially given that we are seeing increase in no. of bugs with respect to metrics recently. Can we start with this? It is better to start by making changes in queue_test#TestPendingCalc only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments.
Thanks @yzhangal @manirajv06 for the review. I added a minimum UT to repro the bug: === RUN TestPendingCalc
2023-10-12T15:57:52.289-0700 INFO core.scheduler.queue objects/queue.go:144 configured queue added to scheduler {"queueName": "root"}
2023-10-12T15:57:52.290-0700 INFO core.scheduler.queue objects/queue.go:144 configured queue added to scheduler {"queueName": "root.leaf"}
queue_test.go:300: assertion failed: error is not nil:
Diff:
--- metric output does not match expectation; want
+++ got:
@@ -1,10 +1,2 @@
-(bytes.Buffer) # HELP yunikorn_root_leaf_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
-# TYPE yunikorn_root_leaf_queue_resource gauge
-yunikorn_root_leaf_queue_resource{resource="memory",state="pending"} 100
-yunikorn_root_leaf_queue_resource{resource="vcores",state="pending"} 10
-# HELP yunikorn_root_queue_resource Queue resource metrics. State of the resource includes `guaranteed`, `max`, `allocated`, `pending`, `preempting`.
-# TYPE yunikorn_root_queue_resource gauge
-yunikorn_root_queue_resource{resource="memory",state="pending"} 100
-yunikorn_root_queue_resource{resource="vcores",state="pending"} 10
+(bytes.Buffer)
-
: unexpected metrics
--- FAIL: TestPendingCalc (0.00s) To ensure we have enough UT coverage on metrics, a follow-up PR would be a better bet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot of overhead. The call to updateAllocatedAndPendingResourceMetrics()
updates three metrics when it only should do one. Three loops are run setting metrics for values that we know have not changed. We call the specific increment or decrement for each type separately. We need to update the metrics separately also.
The function should be split into three calls one for each type we track: allocated, pending and preempting.
|
Agree. However, the function For radical performance optimization on how metrics get updated, we should track in a separate JIRA. |
fc40665
to
201d0dc
Compare
I am also looking at keeping accurate track. We generate metrics entries at a point in time without changes to the underlying values. This could hide the fact that we do not call the metrics at the right time or don't call it for certain changes. They seem accurate in the long run as they are updated as a side effect of another change. This might not show if churn and update rate is high but could show for lower change rates. |
@wilfred-s how should I proceed, like creating a separate PR to address the inefficiency issue and then rebase here? |
Adding the change here adds a total of 14 lines to the change. I would say that is small enough to include here and fix both in one go. If you split the change in two you will touch the same lines of code in both. I doubt that the extra change will affect the cherry pick. |
That would cause the rebased PR not cherrypickable. That's why earlier I proposed to have this PR a pure fix (leave perf bits apart), and that would cherrypickable. Anyway, let me try to include perf bits in this PR. |
201d0dc
to
7e84ff3
Compare
This is done in the first commit, and followed by the pending metric fix. (I mark updateAllocatedAndPendingResourceMetrics() as DEPRECATED for now. LMK if you want to remove it entirely) |
7e84ff3
to
17fac11
Compare
Codecov Report
@@ Coverage Diff @@
## master #673 +/- ##
=======================================
Coverage 77.67% 77.68%
=======================================
Files 79 79
Lines 13089 13098 +9
=======================================
+ Hits 10167 10175 +8
- Misses 2596 2598 +2
+ Partials 326 325 -1
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What is this PR for?
Fix a bug in metric of pending pods/resources. The bug seems to present in all YK versions.
What type of PR is it?
What is the Jira issue?
How should this be tested?
I cannot find any existing test suite except for pkg/metrics/queue_test.go, but that's more simply checking the metric's existence. Pplease advice if you want tests in this PR, or in follow-up to add tests to cover metric accuracy generally.
For now, I did a local ad-hoc test to verify this PR works: