-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: GPU Support #833
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Exactly the same situation, did you resolve this issue already? |
/remove-lifecycle stale |
/remove-lifecycle rotten |
I think NPD should support different config for different device or runtime; |
BTW, for GPU, we don't need to install more dependencies, it just add env |
Thanks for filing the feature request! I think this totally makes sense. Do you have any more concrete proposal? /cc @SergeyKanzhelev |
yes, accelerators health is an important functionality and would be great to have it in NPD Need to design it carefully though. There is already some health checking in a device plugin (like https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/rm/health.go#L39) that we need to work nicely with. Even simple detection of a device plugin health is a good starting point here. @AllenXu93 @ZongqiangZhang do you want to work on more detailed design? I definitely will be interested to join the effort |
/kind feature |
Of cource. |
LGTM + 1 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The important question here is what NPD will be collecting comparing to the device plugin. Some designing is needed here |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I believe this correct. We are ignoring just a handful of application specific errors, so detection of device plugin is a good starting point. |
This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.
Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.
This feature request adds the following capabilities:
GPU device monitoring: NPD will collect GPU device info periodically and look for crashes or errors via nvidia-smi/nvml/dcgm tools.
GPU device monitoring: NPD will check GPU device info periodically to detect if a GPU is "stuck" (e.g. nvidia-smi command hangs).
TBD: GPU runtime monitoring: NPD will check for crashes or OOM issues reported in nvidia logs.
Specifically, this feature request includes:
Looking forward to your feedback!
The text was updated successfully, but these errors were encountered: