In-Place Resource Resize for Kubernetes Pods

Author: Vinay Kulkarni

If you have deployed Kubernetes pods with CPU and/or memory resources specified, you may have noticed that changing the resource values involves restarting the pod. This has been a disruptive operation for running workloads... until now.

In Kubernetes v1.27, we have added a new alpha feature that allows users to resize CPU/memory resources allocated to pods without restarting the containers. To facilitate this, the resources field in a pod's containers now allow mutation for cpu and memory resources. They can be changed simply by patching the running pod spec.

This also means that resources field in the pod spec can no longer be relied upon as an indicator of the pod's actual resources. Monitoring tools and other such applications must now look at new fields in the pod's status. Kubernetes queries the actual CPU and memory requests and limits enforced on the running containers via a CRI (Container Runtime Interface) API call to the runtime, such as containerd, which is responsible for running the containers. The response from container runtime is reflected in the pod's status.

In addition, a new restartPolicy for resize has been added. It gives users control over how their containers are handled when resources are resized.

What's new in Kubernetes v1.27?

Besides the addition of restart policy for resize in the pod's spec, three new fields have been added to the pod's status.

allocatedResources field in the container's statuses reflects the node resources allocated to the pod's containers.
resources field in the container's statuses reflects the actual resources (requests and limits) configured on running containers as reported by container runtime.
resize field in pod's status shows status of the last requested resize operation.
- A value of Proposed is an acknowledgement of the requested resize and indicates that request was validated and recorded.
- A value of InProgress indicates that the node has accepted the resize request and is in the process of applying the resize request to the pod's containers.
- A value of Deferred means that the requested resize cannot be granted at this time, and the node will keep retrying. The resize may be granted when other pods leave and free up node resources.
- A value of Infeasible is a signal that node cannot accommodate the requested resize. This can happen if the requested resize exceeds the maximum resources the node can ever allocate for a pod.

When to use this feature

Here are a few examples where this feature may be useful:

Pod is running on node but with either too much or too little resources.
Pods are not being scheduled due to lack of sufficient CPU or memory in a cluster that is underutilized by overprovisioned running pods.
Evicting certain stateful pods that are in need of more resources to schedule them on bigger nodes is an expensive or disruptive operation when there exist other lower priority pods on that node which can be either be resized down or moved to make room for stateful pods.

My goal for this feature, besides lowering the cost of running Kubernetes workloads, is to see a tangible reduction in the carbon footprint of such workloads.

How to use this feature

In order to use this feature in v1.27, the InPlacePodVerticalScaling feature gate must be enabled. A local cluster with this feature enabled can be started as shown below:


FEATURE_GATES=InPlacePodVerticalScaling=true ./hack/local-up-cluster.sh

Once the local cluster is up and running, Kubernetes users can schedule pods with resources and resize the pods via kubectl. An example of how to use this feature is illustrated in the following demo video.

Additionally, Karla Saur has written a great blog post that illustrates how this feature can be used with minikube.

Example Use Cases

Cloud-based Development Environment

In this scenario, developers or development teams write their code locally but build and test their code in Kubernetes pods with consistent configs that reflect production use. Such pods need minimal resources when the developers are writing code, but need significantly more CPU and memory when they build the code or run a battery of tests. This use case can leverage in-place pod resize feature (with a little help from eBPF) to quickly resize pod's resources and avoid kernel OOM (out of memory) killer from terminating their processes.

The below KubeCon North America 2022 conference talk illustrates this use case.

Java processes initialization CPU requirements

Some Java applications may need significantly more CPU during initialization than what is needed during normal process operation time. If such applications specify CPU requests and limits suited for normal operation, they may suffer from very long startup times. Such pods can request higher CPU values at the time of pod creation, and can be resized down to normal running needs once the application has finished initializing.

Known Issues

This feature enters Kubernetes v1.27 at alpha stage. Below are a few known issues users may encounter:

containerd versions below v1.6.9 do not have the CRI support needed for full end-to-end operation of this feature. Attempts to resize pods will appear to be stuck in the InProgress state, and resources field in the pod's status are never updated even though the new resources may have been enacted on the running containers.
Pod resize may encounter a race condition with other pod updates, causing delayed enactment of pod resize.
Reflecting the resized container resources in pod's status may take a long time.
Static CPU management policy is not supported with this feature.

Credits

This feature is a result of the efforts of a very collaborative Kubernetes community. Here's a little shoutout to just a few of the many many people that contributed countless hours of their time and helped make this happen.

@thockin for detail-oriented API design and air-tight code reviews.
@derekwaynecarr for simplifying the design and thorough API and node reviews.
@dchen1107 for bringing vast knowledge from Borg and helping us avoid pitfalls.
@ruiwen-zhao for adding containerd support that enabled full E2E implementation.
@wangchen615 for implementing comprehensive E2E tests and driving scheduler fixes.
@bobbypage for invaluable help getting CI ready and quickly investigating issues, covering for me on my vacation.
@Random-Liu for thorough kubelet reviews and identifying problematic race conditions.
@Huang-Wei, @ahg-g, @alculquicondor for helping get scheduler changes done.
@mikebrow @marosset for reviews on short notice that helped CRI changes make it into v1.25.
@endocrimes, @ehashman for helping ensure that the oft-overlooked tests are in good shape.
@mrunalp for reviewing cgroupv2 changes and ensuring clean handling of v1 vs v2.
@liggitt, @gjkim42 for tracking down, root-causing important missed issues post-merge.
@SergeyKanzhelev for supporting and shepherding various issues during the home stretch.
@pdgetrf for making the first prototype a reality.
@dashpole for bringing me up to speed on 'the Kubernetes way' of doing things.
@bsalamat, @kgolab for very thoughtful insights and suggestions in the early stages.
@sftim, @tengqm for ensuring docs are easy to follow.
@dims for being omnipresent and helping make merges happen at critical hours.
Release teams for ensuring that the project stayed healthy through this major change.

And finally, a BIG thanks to my very supportive management Dr. Xiaoning Ding and Dr. Ying Xiong for their patience and encouragement.

References

In-place pod resize feature design proposal - the KEP document.
In-place pod resize enhancement tracking issue.
Implementation of in-place pod resize feature - the main pull request.