-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Graduate KEP-2340 to Stable #5330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34b514c
to
24fe732
Compare
Thank you for the detail qualification notes. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @liggitt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the PRR looks ok to me, but I'm not fully aligned with the remove fallback part.
With qualification results showing that fallback is needed and we can go back to the original design. | ||
We should fail the requests and rely on rate limiting to prevent cascading failure. I.e. `Retry-After` HTTP header (for | ||
well-behaved clients) and [Apiserver Priority and Fairness](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md). | ||
The main reason for that is the added complexity and incorrect handling in APF that assumes that request cost doesn't change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not 100% convinced here. I agree that incorrect APF handling imposes some risk, but on the other hand, if watchcache is lagging by 3s+ all the time, after removing fallback, you will not be able to make a consistent list at all.
I would personally be more willing to take the APF risk here (especially we didn't see any materialization of it) and not expose us to a different risk of not being to perform consistent lists at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @liggitt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My perspective is that APF handling presents a more immediate and critical threat. Fallbacks are predominantly observed during etcd upgrades, which logically leads to watch breakage and necessitates watch cache reinitialization. Requests caught during processing will fall back to etcd before the cache starts blocking requests (due to it reinitializing), with their cost estimate being based on the cache.
Consider scenarios where a significant number of these requests fall back during an upgrade. If these requests, instead of being properly deferred by APF, cause the memory usage of LIST requests to surge to hundreds of megabytes each, it could easily trigger a massive memory consumption spike. I've personally identified cases with over 100 such fallbacks to etcd. Upgrades are already disruptful, adding fallback risks further instability due to OOMs.
Therefore, I advocate that we prioritize APF safety and API server availability during planned operations like upgrades, rather than accepting the potential for degraded LIST availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying there is no risk due to APF. I'm saying that you're completely ignoring the other side.
If my watchcache is lagging by 3s+ (and you can get to this state purely because of thundering herd of writes) - then you no longer can perform any consistent list. You can't debug, you can't reinitialize components.
I don't think we can just ignore that.
If you're worried about planned operations - I'm fairly sure there are actually other options we can make.
We generally can distinguish those - because then the watch breaks and watchcache switches to not-ready state.
We can easily distinguish this situation and in case watchcache is not-ready - fail the request instead of fallback.
Or some variation of it.
I just don't think we can just remove fallback for every case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying there is no risk due to APF. I'm saying that you're completely ignoring the other side.
I'm prioritizing APF, because I'm not comfortable leaving a loophole. Overall I think we should be stricter about memory management in apiserver, and by leaving such fallback we will never be able to keep apiserver memory in check. I have seen cases where apiserver memory grew from 10 to over 400GB in seconds. Such cases are worse than you described, because they lead to OOMs and full unavailability, there is nothing left to debug. I agree with the original proposal that unavailability of consistent reads has less impact than cascading failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the original proposal that unavailability of consistent reads has less impact than cascading failure.
What changed compared to that moment is that there was supposed to be a way to force list from etcd. So you had a way to list in case of issues.
With no explicit way to request list from etcd, we lost that.
I'm not feeling comfortable with the APF issue either - but I know what would be first AI if we face an incident that I described - it would be "add fallback". Fallback isn't great - but I think we need some solution, than just ignoring the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I reanalysed the fallback again. Looks like upgrades are not the main trigger. It's hard to identify a single cause. Based on that I think we should just graduate what we have as it works well. We can leave further revisions for the future.
Ref #2340
/assign @jpbetz @wojtek-t @deads2k