Skip to content

DRA: ResourceSlice Status for Device Health Tracking #5283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
johnbelamaric opened this issue May 6, 2025 · 9 comments
Open
4 tasks

DRA: ResourceSlice Status for Device Health Tracking #5283

johnbelamaric opened this issue May 6, 2025 · 9 comments
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@johnbelamaric
Copy link
Member

Enhancement Description

Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.

Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.

This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.

/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 6, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 6, 2025
@pohly pohly moved this from 🆕 New to 📋 Backlog in Dynamic Resource Allocation May 8, 2025
@SergeyKanzhelev
Copy link
Member

cc @ArangoGutierrez

@kannon92
Copy link
Contributor

kannon92 commented May 9, 2025

Is #4680 related to this?

For #4680, it seems we never did implement the DRA health status.

@johnbelamaric
Copy link
Member Author

It is related. We need to reconcile if that one is enough, or if we want both. With that one, I could achieve a similar result to this enhancement if I grabbed all devices with an admin access request. That could work for tooling/automation but is not a good UX for user troubleshooting.

So, I can see a place for both of these. We just need to decide.

@ArangoGutierrez
Copy link
Contributor

Given that #4680 is halfway through (Device plugin part merged), I think it makes sense to continue that work, and just see how we can integrate this proposal

@ArangoGutierrez
Copy link
Contributor

@johnbelamaric Once we decide how this KEP would align with #4680 I would like to volunteer to take on this KEP implementation, if that's ok with you.

@johnbelamaric
Copy link
Member Author

@johnbelamaric Once we decide how this KEP would align with #4680 I would like to volunteer to take on this KEP implementation, if that's ok with you.

Awesome, thanks.

@johnbelamaric
Copy link
Member Author

@ArangoGutierrez are you interested in working on the design/KEP and working with @Jpsassine to figure out how this KEP and #4680 fit together? Or are you more focused on the implementation side?

@johnbelamaric
Copy link
Member Author

@ArangoGutierrez are you interested in working on the design/KEP and working with @Jpsassine to figure out how this KEP and #4680 fit together? Or are you more focused on the implementation side?

This KEP currently has no owner, see https://docs.google.com/presentation/d/1COH8dG8qZ9jEMXzQVdco74XmroZQTMYcUgwAES8FYWU/edit?resourcekey=0-RCm3FFt8pxiK-y-_YHUkOg&slide=id.g353812bec9d_0_26#slide=id.g353812bec9d_0_26

@johnbelamaric
Copy link
Member Author

/assign @ArangoGutierrez

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 📋 Backlog
Status: No status
Development

No branches or pull requests

5 participants