fix: Enhance GPU metrics collection and error handling in vGPU monitor #827
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind flake
This pull request includes significant changes to the
vGPUmonitor
application to improve its structure and functionality. The most important changes include the addition of context and signal handling, the restructuring of the metrics collection process, and the refactoring of thewatchAndFeedback
function to support graceful shutdowns.Context and Signal Handling:
cmd/vGPUmonitor/main.go
: Added context and signal handling to enable graceful shutdown of the application. This includes capturing system signals and using a context to manage the lifecycle of goroutines.Metrics Collection:
cmd/vGPUmonitor/metrics.go
: Refactored the metrics collection process by splitting it into multiple functions (collectGPUInfo
,collectPodAndContainerInfo
,collectContainerMetrics
, etc.) to improve readability and maintainability. [1] [2]cmd/vGPUmonitor/metrics.go
: Introduced thesendMetric
helper function to streamline sending metrics to Prometheus.Refactoring
watchAndFeedback
:cmd/vGPUmonitor/feedback.go
: Refactored thewatchAndFeedback
function to support context-based cancellation, improving the application's ability to shut down gracefully. [1] [2]Code Cleanup:
cmd/vGPUmonitor/feedback.go
: Removed the unusedtime
import.cmd/vGPUmonitor/metrics.go
: Removed unused imports and cleaned up the code to improve readability.These changes collectively enhance the robustness and maintainability of the
vGPUmonitor
application.What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
No