Skip to content

Graduate KEP-2340 to Stable #5330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2025

Conversation

serathius
Copy link
Contributor

@serathius serathius commented May 22, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 22, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 22, 2025
@deads2k
Copy link
Contributor

deads2k commented May 22, 2025

Thank you for the detail qualification notes.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, serathius

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2025
@k8s-ci-robot k8s-ci-robot merged commit 397f701 into kubernetes:master May 22, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone May 22, 2025
@serathius
Copy link
Contributor Author

/cc @liggitt

@k8s-ci-robot k8s-ci-robot requested a review from liggitt May 22, 2025 18:24
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PRR looks ok to me, but I'm not fully aligned with the remove fallback part.

With qualification results showing that fallback is needed and we can go back to the original design.
We should fail the requests and rely on rate limiting to prevent cascading failure. I.e. `Retry-After` HTTP header (for
well-behaved clients) and [Apiserver Priority and Fairness](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md).
The main reason for that is the added complexity and incorrect handling in APF that assumes that request cost doesn't change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not 100% convinced here. I agree that incorrect APF handling imposes some risk, but on the other hand, if watchcache is lagging by 3s+ all the time, after removing fallback, you will not be able to make a consistent list at all.

I would personally be more willing to take the APF risk here (especially we didn't see any materialization of it) and not expose us to a different risk of not being to perform consistent lists at all.

@deads2k @jpbetz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @liggitt

Copy link
Contributor Author

@serathius serathius May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My perspective is that APF handling presents a more immediate and critical threat. Fallbacks are predominantly observed during etcd upgrades, which logically leads to watch breakage and necessitates watch cache reinitialization. Requests caught during processing will fall back to etcd before the cache starts blocking requests (due to it reinitializing), with their cost estimate being based on the cache.

Consider scenarios where a significant number of these requests fall back during an upgrade. If these requests, instead of being properly deferred by APF, cause the memory usage of LIST requests to surge to hundreds of megabytes each, it could easily trigger a massive memory consumption spike. I've personally identified cases with over 100 such fallbacks to etcd. Upgrades are already disruptful, adding fallback risks further instability due to OOMs.

Therefore, I advocate that we prioritize APF safety and API server availability during planned operations like upgrades, rather than accepting the potential for degraded LIST availability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying there is no risk due to APF. I'm saying that you're completely ignoring the other side.

If my watchcache is lagging by 3s+ (and you can get to this state purely because of thundering herd of writes) - then you no longer can perform any consistent list. You can't debug, you can't reinitialize components.
I don't think we can just ignore that.

If you're worried about planned operations - I'm fairly sure there are actually other options we can make.
We generally can distinguish those - because then the watch breaks and watchcache switches to not-ready state.
We can easily distinguish this situation and in case watchcache is not-ready - fail the request instead of fallback.
Or some variation of it.

I just don't think we can just remove fallback for every case.

Copy link
Contributor Author

@serathius serathius May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying there is no risk due to APF. I'm saying that you're completely ignoring the other side.

I'm prioritizing APF, because I'm not comfortable leaving a loophole. Overall I think we should be stricter about memory management in apiserver, and by leaving such fallback we will never be able to keep apiserver memory in check. I have seen cases where apiserver memory grew from 10 to over 400GB in seconds. Such cases are worse than you described, because they lead to OOMs and full unavailability, there is nothing left to debug. I agree with the original proposal that unavailability of consistent reads has less impact than cascading failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the original proposal that unavailability of consistent reads has less impact than cascading failure.

What changed compared to that moment is that there was supposed to be a way to force list from etcd. So you had a way to list in case of issues.
With no explicit way to request list from etcd, we lost that.

I'm not feeling comfortable with the APF issue either - but I know what would be first AI if we face an incident that I described - it would be "add fallback". Fallback isn't great - but I think we need some solution, than just ignoring the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I reanalysed the fallback again. Looks like upgrades are not the main trigger. It's hard to identify a single cause. Based on that I think we should just graduate what we have as it works well. We can leave further revisions for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants