Kubernetes Checkpointing — A Definitive Guide!

Ensuring Reliable and Efficient Workloads in Your Cluster!

Published in

FAUN — Developer Community 🐾

9 min readApr 22, 2023

Checkpointing is a technique for ensuring that applications can recover from failures & maintain their state. It captures the state of a running process, including its memory, file descriptors, & other metadata. This information is stored as a checkpoint, which can later be used to resume the process from the same point in time, allowing for seamless recovery from failures or migration between hosts.

Kubernetes brought in the support for Checkpointing in it’s 1.25 release as an alpha feature. In this post, we’ll explore the concept of Kubernetes checkpointing, its benefits, & how you can leverage it to improve your application’s fault tolerance.

CRIU — at a glance!

To implement Kubernetes checkpointing, you’ll need to use a container runtime that supports CRIU (Checkpoint/Restore in Userspace).

CRIU tool, in it’s simple terms, helps in taking a snapshot of a program while it’s running & then being able to resume it later, just like you might pause and resume a video or a video game.

CRIU checkpoint high level workflow

Freeze the process - CRIU temporarily stops the target process to ensure a consistent snapshot. Using the PTRACE interface CRIU takes control over the process (CRIU actually always operates on a process tree; one process and all its child processes) and pauses that process.
Collect process information - In the next step, code is injected via PTRACE in the paused process. CRIU calls this code parasite-code. The parasite-code then runs from within the process’s address space and can access and dump/save/checkpoint the memory content of the process. CRIU gathers information about the process, such as its process ID (PID), parent process ID (PPID), process group ID (PGID), and other metadata.
Save memory - CRIU scans the process’s memory regions and saves them in the checkpoint image. It can also save memory in an incremental manner to reduce the checkpoint size, by tracking memory changes since the last checkpoint.
Save file descriptors - CRIU collects information about open file descriptors, such as files, pipes, and sockets, and saves them to the checkpoint image.
Save process hierarchy — If the process has child processes or threads, CRIU saves information about their state, hierarchy, and relationships.
Save CPU state - CRIU saves the CPU state, including registers and instruction pointers, allowing the process to resume execution from the same point upon restoration.
Save other resources - CRIU captures additional resources associated with the process, such as timers, signals, and process credentials.
Unfreeze the process - Once the checkpoint is complete, CRIU resumes the process by sending a CONT signal, allowing it to continue running.

CRIU restore high level workflow

Preparation - CRIU first sets up the environment for restoring the process. This includes creating a new process with a unique process ID (PID) and setting up resources such as namespaces and file descriptors.
Memory restoration - The saved memory snapshot is mapped back into the new process’s memory space. This includes restoring memory regions, shared memory, and any other memory-related data.
File descriptors - The file descriptors that were open during the checkpoint are restored. This includes regular files, pipes, sockets, and other types of file descriptors.
Process hierarchy - If the check-pointed process had any child processes or threads, CRIU recreates them and restores their state.
CPU state - CRIU restores the CPU state, including registers and instruction pointers, so the process can continue executing from the point it was check-pointed.
Other resources - Additional resources such as timers, signals, and process credentials are restored to match the state at the time of the checkpoint.
Finalising - After all the resources are restored, CRIU unblocks the process, allowing it to continue running from where it left off.

CRIU with Docker

Docker has experimental support for CRIU (Checkpoint/Restore in Userspace), allowing you to checkpoint and restore running containers. So, it is required one to enable Docker experimental features to use it.

Workflow of CRIU with Docker

Install CRIU- First, ensure that you have CRIU installed on your system. The installation process varies depending on your distribution, but you can find detailed instructions in the official CRIU documentation: https://criu.org/Installation
Enable experimental features in Docker- You need to enable experimental features in Docker to access the CRIU-related functionality. For Docker Desktop, you can enable experimental features in the settings. For Docker Engine, you can enable experimental features by editing the daemon.json configuration file and adding “experimental”: true. Make sure to restart the Docker service after making the changes.

sanjit@sanjit-virtual-machine:~$ sudo nano /etc/docker/daemon.json

sanjit@sanjit-virtual-machine:~$ cat /etc/docker/daemon.json 
{
"experimental": true
}

sanjit@sanjit-virtual-machine:~$ sudo systemctl restart dockerUse the checkpoint and restore commands- Once you have CRIU installed and Docker’s experimental features enabled, you can use the following commands to checkpoint and restore containers:

Create a checkpoint — To create a checkpoint of a running container, use the docker checkpoint create command, followed by the container name or ID and a checkpoint name:


sanjit@sanjit-virtual-machine:~$ sudo docker run -d --name nginx nginx

sanjit@sanjit-virtual-machine:~$ sudo docker ps -a
CONTAINER ID   IMAGE         COMMAND                  CREATED          STATUS                      PORTS     NAMES
2bb5121b7ecf   nginx         "/docker-entrypoint.…"   31 seconds ago   Up 30 seconds               80/tcp    nginx

sanjit@sanjit-virtual-machine:~$ sudo docker checkpoint create nginx nginx-checkpoint
nginx-checkpoint

List checkpoints- You can list checkpoints for a specific container using the docker checkpoint ls command:

sanjit@sanjit-virtual-machine:~$ sudo docker checkpoint ls nginx
CHECKPOINT NAME
nginx-checkpoint

Restore a container from a checkpoint- To restore a container from a checkpoint, first, you need to remove the original container (if it’s still running), and then use the docker start command with the — checkpoint option:

sanjit@sanjit-virtual-machine:~$ sudo docker start --checkpoint nginx-checkpoint nginx

sanjit@sanjit-virtual-machine:~$ sudo docker ps -a
CONTAINER ID   IMAGE         COMMAND                  CREATED          STATUS                      PORTS     NAMES
2bb5121b7ecf   nginx         "/docker-entrypoint.…"   2 minutes ago    Up 5 seconds                80/tcp    nginx

CRIU with Kubernetes

Kubernetes does not have built-in support for CRIU (Checkpoint/Restore in Userspace) directly. Keep in mind that the integration of CRIU with Kubernetes is not as seamless as it would be with a single container runtime, and you will need to perform some manual steps to manage checkpoints and restore processes.

Multi layer complexity with Kubernetes Checkpoint Restore

It is also important to mention that at the time of writing the checkpointing functionality is to be considered as an alpha level feature in CRI-O and Kubernetes and the security implications are still under consideration.

Prerequisites

Following are the pre-requisites one need to ensure before proceeding with further steps —

The feature is behind a feature gate, so make sure to enable the ContainerCheckpoint gate before you can use the new feature.
A v1.25+ Kubernetes cluster and container runtime that supports container checkpointing -

containerd - support is currently under discussion. See containerd pull request #6965 for more details.
CRI-O - v1.25 has support for container checkpointing.

3. To use checkpointing in combination with CRI-O, the runtime needs to be started with the command-line option --enable-criu-support=true.

4. To use restoring back into a Pod, one need to set --drop-infra-ctr to false

5. As the checkpointing functionality is provided by CRIU it is also necessary to install CRIU. Usually runc or crun depend on CRIU and therefore it is installed automatically.

Checkpointing

Once containers and pods are running it is possible to create a checkpoint. Checkpointing is currently only exposed on the kubelet level. Triggering this kubelet API will request the creation of a checkpoint from CRI-O. CRI-O requests a checkpoint from your low-level runtime (for example, runc). Seeing that request, runc invokes the criu tool to do the actual checkpointing.

Once the checkpointing has finished the checkpoint should be available at /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar

You could then use that tar archive to restore the container somewhere else.

To checkpoint a container, you can run curl on the node where that container is running, and trigger a checkpoint -

$ curl -X POST "https://localhost:10250/checkpoint/namespace/podId/container"

For a container named counter in a pod named counters in a namespace named default the kubelet API endpoint is reachable at:

$ curl -X POST "https://localhost:10250/checkpoint/default/counters/counter"

Restoring

To restore the previously check-pointed container directly in Kubernetes it is necessary to convert the checkpoint archive into an image that can be pushed to a registry.

One possible way to convert the local checkpoint archive consists of the following steps with the help of buildah -

$ newcontainer=$(buildah from scratch)
$ buildah add $newcontainer /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar /
$ buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=<container-name> $newcontainer
$ buildah commit $newcontainer checkpoint-image:latest
$ buildah rm $newcontainer

The resulting image is not standardised and only works in combination with CRI-O. Please consider this image format as pre-alpha. There are ongoing discussions to standardize the format of checkpoint images like this. Important to remember is that this not yet standardized image format only works if CRI-O has been started with --enable-criu-support=true. The security implications of starting CRI-O with CRIU support are not yet clear and therefore the functionality as well as the image format should be used with care.

Now, you’ll need to push that image to a container image registry. For example:

$ buildah push localhost/checkpoint-image:latest container-image-registry.example/user/checkpoint-image:latest

To restore this checkpoint image (container-image-registry.example/user/checkpoint-image:latest), the image needs to be listed in the specification for a Pod. Here's an example manifest -

apiVersion: v1
kind: Pod
metadata:
  namePrefix: example-
spec:
  containers:
  - name: <container-name>
    image: container-image-registry.example/user/checkpoint-image:latest
  nodeName: <destination-node>

Kubernetes schedules the new Pod onto a node. The Kubelet on that node instructs the container runtime (CRI-O in this example) to create and start a container based on an image specified as registry/user/checkpoint-image:latest. CRI-O detects that registry/user/checkpoint-image:latest is a reference to checkpoint data rather than a container image. Then, instead of the usual steps to create and start a container, CRI-O fetches the checkpoint data and restores the container from that specified checkpoint.

Conclusion

Kubernetes checkpointing is a powerful tool for enhancing the fault tolerance and resiliency of your containerised applications. Even though it is at its alpha release and will be undergoing changes to reduce friction in it’s adoption, one cannot deny the capability it brings along with it. By implementing a well-planned checkpointing strategy, you can minimise downtime, improve resource usage, and simplify application migration.

I am concluding this discussion by outlining some of the advantages and recommended approaches that can be implemented through Kubernetes Checkpointing.

Benefits

Enhanced fault tolerance — Checkpointing enables applications to recover from failures by resuming from the last known checkpoint. This reduces downtime & helps ensure that your application remains highly available.
Simplified migration- Checkpointing makes it easier to move running applications between hosts. By saving the application’s state, you can migrate it to a different node without losing progress or causing disruptions.
One can leverage this feature if your app takes lot of time for warming up. This will substantially reduce the startup time for your app.
Improved scaling- With checkpointing, you can easily scale your application to meet fluctuating demand. If a node becomes overloaded, you can migrate the application to another node with more resources, ensuring optimal performance.
Efficient resource usage- Checkpointing allows you to suspend long-running applications, freeing up resources for other tasks. When the application is needed again, it can be resumed from its checkpoint, ensuring no progress is lost.

Best Practices for Kubernetes Checkpointing

Regularly create checkpoints- To minimise data loss in the event of a failure, create checkpoints at regular intervals, depending on your application’s requirements.
Monitor and manage resources- Checkpointing can consume significant system resources, particularly memory. Monitor your cluster’s resource usage and adjust your checkpointing strategy as needed to avoid performance issues.
Test your checkpointing strategy- Regularly test your checkpointing process to ensure it works as expected and can recover your application in case of failures.
Automate checkpoint management- Use automation tools like cron jobs or Kubernetes operators to create and manage checkpoints on a predefined schedule, ensuring your application is always protected.