Kubernetes API Priority & Fairness
With more & more adoption of Kubernetes in the industry, there is steep rise in the number of custom controllers being deployed to the cluster. Lack of proper resiliencies among these custom controllers could lead to unregulated traffic resulting in cluster instabilities.
A need was felt to prioritise a given set of traffic from all different clients without starving critical control plane traffic.
This GitHub issue essentially captured the above described situation.
Typical traffics handled by a Kubernetes API server
- Kubernetes api server will be requesting itself even if there is no client requests at all. Example of such scenario is when the api server has to fetch the status of the cluster.
- Kubernetes provides extension knobs which kicks off when API server is serving an incoming request. Such web hooks (admission/mutating) can spawn requests back to kube api server.
- Controller bugs (unpaginated list requests or retrying storm of failed requests) could result in issuing a request loop to api servers
- Daemons
All or any of these possesses a tremendous potential to overwhelm an API server which could lead to cluster instabilities.
What are the available options to tackle this situation?
- At the source (Client Side) — Kubernetes client-go provides rate limiter to tackle the situation; however client side rate limiting have shortcomings where a user can choose to opt out from rate limiting.
- At the gateway (Server Side) — Kubernetes releases prior to 1.20, had a simple mechanism for protecting itself against CPU & memory overloads -
- max-requests-inflight (Default — 400) Max number of non-mutating requests in flight at a given time.
- max-mutating-requests-inflight (Default — 200) Max number of mutating requests in flight at a given time.
- min-request-timeout (Default — 1800) Min number of seconds a handler must keep a request open before timing it out.
These command-line flags were used to regulate the volume of inbound requests. Apart from the distinction between mutating & readonly, no other distinctions were made among requests; consequently there can be undesirable scenarios where one subset of the request load crowds out other parts of the request load; thus resulting in still overwhelming the server.
So, surely these rail guards were not enough!
API Priority & Fairness Introduction
Due to the shortcomings explained in the previous sections, APF (API Priority & Fairness) was eventually introduced as Kubernetes default with the release 1.20 and along with it introduces a flow control mechanism allowing platform owner to define API level policies to regulate inbound requests to the API server.
By classifying api requests into flows & priorities, APF manages and throttles all inbound requests in a prioritised and fair manner.
Every requests will be matched with exactly one flow schema which assigns the requests to a priority level.
- When requests of a priority level are being throttled, requests of other priority level remain unaffected.
- To further enforce fairness among requests of a priority level, the matching flow schema associates requests with flows, where requests originating from same source are assigned the same flow distinguisher.
Below is an excerpt of the default available -
Among others, there is a system-leader-election flowschemas and it seems to be associated with leader-election prioritylevelconfigurations.
The rules describe the list of criteria used to identify matching requests. The flow schema matches a request if and only if:
- At least one of its subjects matches the subject making the request and
- At least one of its resourceRules or nonResourceRules matches the verb and (non-)resource being requested
The distinguishedMethod defines how the flow distinguishers are computed:
- ByUser where requests originated from the same subject are grouped into the same flow so that users can’t overwhelm each other
- ByNamespace where requests originated from the same namespace are grouped into the same flow so that workloads in one namespace can’t overwhelm those in other namespaces
- An empty string where all requests are grouped into a single flow
When matching requests, a flow schema with a lower matchingPrecedence has higher precedence than one with a higher matchingPrecendence.
The priorityLevelConfiguration refers to the priority level configuration resource that specifies the flow control attributes -
- The limited.assuredConcurrencyShares defines the concurrency shares used to calculate the assured concurrency value
- The limited.limitResponse defines the strategy to handle requests that can’t be executed immediately.
limited.limitResponse.type supports two values:
- Queue where requests are queued. In Queue limit response type, the queueing behaviour can further be configured by adjusting the properties of limited.limitResponse.queuing.
- Reject where requests are dropped with an HTTP 429 error