-
Notifications
You must be signed in to change notification settings - Fork 1.5k
DRA: ResourceSlice Status for Device Health Tracking #5283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is related. We need to reconcile if that one is enough, or if we want both. With that one, I could achieve a similar result to this enhancement if I grabbed all devices with an admin access request. That could work for tooling/automation but is not a good UX for user troubleshooting. So, I can see a place for both of these. We just need to decide. |
Given that #4680 is halfway through (Device plugin part merged), I think it makes sense to continue that work, and just see how we can integrate this proposal |
@johnbelamaric Once we decide how this KEP would align with #4680 I would like to volunteer to take on this KEP implementation, if that's ok with you. |
Awesome, thanks. |
@ArangoGutierrez are you interested in working on the design/KEP and working with @Jpsassine to figure out how this KEP and #4680 fit together? Or are you more focused on the implementation side? |
This KEP currently has no owner, see https://docs.google.com/presentation/d/1COH8dG8qZ9jEMXzQVdco74XmroZQTMYcUgwAES8FYWU/edit?resourcekey=0-RCm3FFt8pxiK-y-_YHUkOg&slide=id.g353812bec9d_0_26#slide=id.g353812bec9d_0_26 |
/assign @ArangoGutierrez |
Enhancement Description
Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.
Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.
This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.
/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: