DRA: ResourceSlice Status for Device Health Tracking #5283

johnbelamaric · 2025-05-06T22:35:13Z

Enhancement Description

Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.

Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.

This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.

/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu

One-line enhancement description (can be used as a release note): Enable DRA drivers to store device health and other device status in the ResourceSlice
Kubernetes Enhancement Proposal: TBD
Discussion Link: https://docs.google.com/document/d/1Zz_xhPemY28EqpcKSLPl-S7tWnOHPU_IDLVmZOoGS5k/edit?tab=t.0#bookmark=id.j08fcqbp6h8d
Primary contact (assignee): @johnbelamaric
Responsible SIGs: Node
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.34
- Beta release target (x.y):
- Stable release target (x.y):
Alpha
- KEP (k/enhancements) update PR(s):
- Code (k/k) update PR(s):
- Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

The text was updated successfully, but these errors were encountered:

SergeyKanzhelev · 2025-05-09T17:14:11Z

cc @ArangoGutierrez

kannon92 · 2025-05-09T20:18:58Z

Is #4680 related to this?

For #4680, it seems we never did implement the DRA health status.

johnbelamaric · 2025-05-09T20:34:28Z

It is related. We need to reconcile if that one is enough, or if we want both. With that one, I could achieve a similar result to this enhancement if I grabbed all devices with an admin access request. That could work for tooling/automation but is not a good UX for user troubleshooting.

So, I can see a place for both of these. We just need to decide.

ArangoGutierrez · 2025-05-13T16:05:05Z

Given that #4680 is halfway through (Device plugin part merged), I think it makes sense to continue that work, and just see how we can integrate this proposal

ArangoGutierrez · 2025-05-13T16:08:45Z

@johnbelamaric Once we decide how this KEP would align with #4680 I would like to volunteer to take on this KEP implementation, if that's ok with you.

johnbelamaric · 2025-05-14T16:30:19Z

@johnbelamaric Once we decide how this KEP would align with #4680 I would like to volunteer to take on this KEP implementation, if that's ok with you.

Awesome, thanks.

johnbelamaric · 2025-05-16T21:44:41Z

@ArangoGutierrez are you interested in working on the design/KEP and working with @Jpsassine to figure out how this KEP and #4680 fit together? Or are you more focused on the implementation side?

johnbelamaric · 2025-05-16T21:45:09Z

@ArangoGutierrez are you interested in working on the design/KEP and working with @Jpsassine to figure out how this KEP and #4680 fit together? Or are you more focused on the implementation side?

This KEP currently has no owner, see https://docs.google.com/presentation/d/1COH8dG8qZ9jEMXzQVdco74XmroZQTMYcUgwAES8FYWU/edit?resourcekey=0-RCm3FFt8pxiK-y-_YHUkOg&slide=id.g353812bec9d_0_26#slide=id.g353812bec9d_0_26

johnbelamaric · 2025-05-29T19:30:44Z

/assign @ArangoGutierrez

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 6, 2025

github-project-automation bot added this to SIG Node 1.33 KEPs planning May 6, 2025

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 6, 2025

github-project-automation bot added this to Dynamic Resource Allocation May 6, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation May 6, 2025

pohly moved this from 🆕 New to 📋 Backlog in Dynamic Resource Allocation May 8, 2025

johnbelamaric added this to AI Critical KEPs (KEP issues, not PRs) May 28, 2025

johnbelamaric moved this to Pre-Alpha in AI Critical KEPs (KEP issues, not PRs) May 28, 2025

k8s-ci-robot assigned ArangoGutierrez May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: ResourceSlice Status for Device Health Tracking #5283

DRA: ResourceSlice Status for Device Health Tracking #5283

johnbelamaric commented May 6, 2025

SergeyKanzhelev commented May 9, 2025

Uh oh!

kannon92 commented May 9, 2025 •

edited

Loading

Uh oh!

johnbelamaric commented May 9, 2025

Uh oh!

ArangoGutierrez commented May 13, 2025

Uh oh!

ArangoGutierrez commented May 13, 2025

Uh oh!

johnbelamaric commented May 14, 2025

Uh oh!

johnbelamaric commented May 16, 2025

Uh oh!

johnbelamaric commented May 16, 2025

Uh oh!

johnbelamaric commented May 29, 2025

Uh oh!

DRA: ResourceSlice Status for Device Health Tracking #5283

DRA: ResourceSlice Status for Device Health Tracking #5283

Comments

johnbelamaric commented May 6, 2025

Enhancement Description

SergeyKanzhelev commented May 9, 2025

Uh oh!

kannon92 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnbelamaric commented May 9, 2025

Uh oh!

ArangoGutierrez commented May 13, 2025

Uh oh!

ArangoGutierrez commented May 13, 2025

Uh oh!

johnbelamaric commented May 14, 2025

Uh oh!

johnbelamaric commented May 16, 2025

Uh oh!

johnbelamaric commented May 16, 2025

Uh oh!

johnbelamaric commented May 29, 2025

Uh oh!

kannon92 commented May 9, 2025 •

edited

Loading