checkpoint rootfs diff is wasteful #24826

hanwen-flow · 2024-12-12T14:40:24Z

Issue Description

Container checkpointing with CRIU works for me, but its speed is disappointing. What currently happens is:

checkpoint process into checkpoint/
discover changed files
put changed files into rootfs-diff.tar
put deleted files into a deleted.json
tar up all of the above in the final checkpoint file.

In my case, my containers have large local FS differences. On my laptop, the tarring runs at 400mb/s, so 10G of file system differences takes 25s to create the rootfs-diff.tar archive. Then, this data goes through tar again which takes another 25s.

wouldn't it be better to insert the rootfs diff directly into the snapshot tar (perhaps under a rootfs-diff/ directory) ? Then my large file content goes through tar only once, yielding a 2x speedup.

Note that Go isn't particularly efficient with tar files either, see golang/go#70807.

Steps to reproduce the issue

Steps to reproduce the issue
1.
2.
3.

Describe the results you received

Describe the results you expected

podman info output

If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

mheon · 2024-12-12T21:04:03Z

@adrianreber Any thoughts?

adrianreber · 2024-12-13T07:48:06Z

@hanwen-flow Sounds like a good idea. You should do a proof of concept and open a PR to verify it.

Another thing which would be interesting is to see if using --compress=none changes anything during checkpointing. The checkpoint archive is compressed with zstd, but you can switch to none. Maybe that also helps.

The main problem seems to be that we write the file-system changes twice. Thinking about your approach we will still need to write it twice. Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

Thinking more about it. If we write the content from the container to the rootfs-diff.tar it is written uncompressed. Is writing the data to a tar archive, uncompressed, so much slower than writing to a directory? My expectation, without measuring it, would be that there should not be much difference. The --compress flag only controls the final tar archive. The internal tar archive is explicitly not compressed to avoid compressing data twice.

If you find a way to write the data only once, that would be great. If you have an idea how to improve it, please open a PR.

hanwen-flow · 2024-12-13T08:15:56Z

Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

why? CRIU has to create files in a temp directory, because it is a separate process, but podman can create a tar.Writer in golang for the final file directly, insert the CRIU snapshots (from the file system tmp dir) and read the changed files from the container directly.

if you do this without compression, the whole process could use kernel file copying (assuming the linked golang proposal goes through). This requires dropping the gratuitous io.Pipe() calls to connect tar streams to output sinks.

hanwen-flow · 2024-12-13T08:28:15Z

Re. compression: I observed that zstd compression on the final archive incurs 10% overhead.

adrianreber · 2024-12-13T08:32:27Z

Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory.

why?

Good question.

CRIU has to create files in a temp directory, because it is a separate process, but podman can create a tar.Writer in golang for the final file directly, insert the CRIU snapshots (from the file system tmp dir) and read the changed files from the container directly.

There is a tool called criu-image-streamer which tries to avoid it. So it would be possible to also optimize this step.

if you do this without compression, the whole process could use kernel file copying (assuming the linked golang proposal goes through). This requires dropping the gratuitous io.Pipe() calls to connect tar streams to output sinks.

Sounds great, please open a PR. From my point of view there is nothing against having this.

Luap99 · 2024-12-13T10:34:29Z

Do we need to worry about backwards compat here? If the layout is changed podman would be unable to load a checkpoint created on a previous version or the other way around.

hanwen-flow · 2024-12-13T13:42:50Z

There is a tool called criu-image-streamer which tries to avoid it. So it would be possible to also optimize this step.

I had a quick look. That is a small tool, so it could be reimplemented in podman or one of its dependencies, I guess?

re. compatibility: we could maintain the tar-in-tar format, and stream the embedded rootfs tar directly into the final tar file without going to disk.

rst0git · 2024-12-13T13:50:50Z

That is a small tool, so it could be reimplemented in podman or one of its dependencies, I guess?

A better place to re-implement image streaming would be go-criu. We use these bindings in Podman. We can propose this as a GSoC project for next year: https://www.criu.org/Google_Summer_of_Code_Ideas

hanwen · 2024-12-13T16:38:47Z

and stream the embedded rootfs tar directly into the final tar file without going to disk.

now I remember again: we can't do this: we need to know the size of the file in advance, because the metadata (inclding size) precedes the file content.

This yields a 2x speed up on creating and restoring snapshots, as it avoids creating and extracting the temporary rootfs-diff.tar file. Tested with the following recipe $ cat files.sh #!/bin/sh dd if=/dev/urandom of=test bs=1M count=100 for i in $(seq 1 10); do cp test test$i & done wait rm /usr/bin/znew $ sudo ~/vc/containers/podman/podman run -dt -p 8080:80/tcp docker.io/library/httpd; \ sudo ~/vc/containers/podman/podman container cp files.sh $(sudo ~/vc/containers/podman/podman ps -l --format "{{.ID}}"):files.sh ;\ sudo ~/vc/containers/podman/podman container exec -it -l /bin/sh -x files.sh ;\ (cd ../containers/podman/ && go build -tags "selinux seccomp" ./cmd/podman) && sudo ~/vc/containers/podman/podman container checkpoint -e foo.tar -l && \ sudo ~/vc/containers/podman/podman container restore -i foo.tar --name restored For productionizing, this would need * supporting (de)compression, which I left out for simplicity * a decision on what to do with backward compatibility * similar treatment of the devshm and volumes tar files * a more principled approach to storing multiple sections in a single tar file? * upstreaming changes to the archive package in the storage repo (including tests) Addresses containers#24826

hanwen-flow · 2025-01-09T14:11:30Z

I created #24984 , which seems to work according to the test I described in the commit message.

Regarding format stability: we could reap some benefits without changing the format, by decompressing the rootfs-diff.tar directly from checkpoint tar file. That would only need some extensions to the untar functions in the archive package.

The decompression speed (ie. restore speed) is more important to me than the checkpoint speed, so that is an acceptable compromise to me.

I suspect we'll be running this on BTRFS or XFS so we can take advantage of zero-copy (see golang/go#70807 (comment)) which yields a much more substantial speed up. For that, it is important that there are no intermediate buffers in the tar handling.

hanwen-flow · 2025-01-16T17:30:38Z

I suspect we'll be running this on BTRFS or XFS so we can take advantage of zero-copy (see golang/go#70807 (comment)) which yields a much more substantial speed up. For that, it is important that there are no intermediate buffers in the tar handling.

I tried prototyping this in https://github.com/hanwen-flow/podman/commits/hacktar/. However, this doesn't readily work: the rootfs diff is read from /var/lib/containers/..../${ID}/merged which is the overlayfs mount, and therefore on a different file system. With this, copy_file_range errors out with EXDEV.

Since podman can decide what the diff looks like, it should be possible to get at the files, which are under the ...${ID}/diff which.

hanwen-flow · 2025-01-20T10:18:49Z

Got it to work; see #25050

Given the speed-up of this approach (upwards of 10x), I think pursuing a measly 2x speed up is not really worth it.

adrianreber · 2025-01-20T11:22:17Z

Sorry, but I am confused. Can you describe again what the difference between your two approaches is and what the improvement is. I think I understand it, but I am not 100% sure.

Somewhere you wrote you are waiting for feedback how to continue. If you have an improvement I would like to see it be merged in Podman. Backwards compatibility is important, but it sounds like you found a way to solve it, right?

hanwen-flow · 2025-01-20T11:29:55Z

Re. CRIU-streamer: for my use-case, I always want to save the memory snapshot onto disk. IIUC, using criu-streamer doesn't really help, the bottleneck will be in syncing the contents to stable storage. Once it is on stable storage, copying around the data will be essentially free because of reflink copying.

hanwen-flow · 2025-01-20T14:04:43Z

Sorry, but I am confused. Can you describe again what the difference between your two approaches is and what the improvement is. I think I understand it, but I am not 100% sure.

current situation

(checkpoint)

create memory snapshot *.img in userdata/ dir
create rootfs-diff.tar
create snapshot.tar from 1. and 2.

(restore)

extract snapshot.tar (*.img, *.tar)
extract rootfs-diff.tar

problem: on most file systems, cost of copying files is proportional to file size, and we have to copy the files twice to handle rootfs-diff.tar

solution 1.

Use a CoW file system; then copying files with copy_file_range(2) is essentially free. This is #25050

solution 2.

Skip rootfs-diff.tar as an intermediate.

solution 2a: fix checkpoint & restore

Inline the rootfs diffs directly in snapshot.tar; this is an incompatible format change, and speeds up both checkpoint and restore. Implemented in #24984.

solution 2b: fix restore

Keep the format, but use a custom extraction: instead of extracting rootfs-tar.tar, when processing snapshot.tar, when rootfs-diff.tar is encountered, directly pipe it into a tar.Reader, extracting into the overlayfs mount.

Solution 1 is fully backward compatible, and achieves far superior speeds. This solution has my preference, and the changes are modest enough that we can patch them locally. It only works on CoW filesystems (ZFS, XFS, BTRFS). It requires vendoring golang's archive/tar package until https://go-review.googlesource.com/c/go/+/642736 goes through. I am not sure how acceptable this is for the maintainers of the containers/storage repo.

rst0git · 2025-01-20T14:24:14Z

@hanwen We are currently working on adding support for encrypting rootfs changes and integrating this feature with Podman and CRI-O:

Have you considered a use-case where the content of rootfs-diff.tar is compressed and encrypted?

hanwen-flow · 2025-01-20T14:46:44Z

Have you considered a use-case where the content of rootfs-diff.tar is compressed and encrypted?

The user running podman has permission to see a running container's contents anyway, so encrypting the snapshot while it's on the machine running podman doesn't bring any security. If the snapshot is shipped off somewhere, you could just encrypt the whole snapshot tar file, either on the sending side or on the receiving server, depending on who you trust more. So, I don't understand the threat model that you are addressing.

in our case, we are not concerned with protecting the snapshots from prying eyes.

the step that involves going from unencrypted file system to encrypted stream, can not be done through reflink copies, of course, but if you encrypt the snapshot tar, you can still reap the benefits of free copying when you create the rootfs diff tar.

rst0git · 2025-01-20T17:18:25Z

The user running podman has permission to see a running container's contents anyway, so encrypting the snapshot while it's on the machine running podman doesn't bring any security. If the snapshot is shipped off somewhere, you could just encrypt the whole snapshot tar file, either on the sending side or on the receiving server, depending on who you trust more. So, I don't understand the threat model that you are addressing.

The main difference is that processes in a running container reside in system memory, while snapshots are saved to persistent storage. Each user with access to the storage can read sensitive data stored in snapshots (e.g., passwords, API tokens, etc.). Our goal is to introduce built-in support for end-to-end encryption, where checkpoint data is encrypted before it is written to disk and decrypted as part of the restore process.

In addition, container checkpoints can be very large (i.e., hundreds of gigabytes). While encrypting the whole snapshot tar file can be used, writing the data to disk as plain text and then reading it back to encrypt it not only increases storage requirements and performance overhead but also introduces security risks, as an attacker could compromise the data flow and potentially gain access to sensitive information during this operation.

I hope this helps clarify the threat model.

hanwen-flow · 2025-01-20T17:39:44Z

Each user with access to the storage

which users are those? The storage dir in podman is only readable by root. If you can read the storage (ie. if you are root), all security guarantees are gone, as root can read secrets directly from /dev/kmem.

In addition, container checkpoints can be very large (i.e., hundreds of gigabytes). While encrypting the whole snapshot tar file can be used, writing the data to disk as plain text and then reading it back to encrypt it not only increases storage requirements and performance overhead

For rootfs, with reflink copies on a CoW filesystem, there is no storage overhead, as the blocks are shared between all copies, including the files that the container itself uses as its file system.

for the memory image it is attractive to stream the data directly from memory into a network pipe (see eg. criu-streamer) to avoid overhead, but attackers that can bypass access controls on the FS can probably find other ways to get at your sensitive data, even if you avoid storing it.

rst0git · 2025-01-20T18:23:32Z

which users are those? The storage dir in podman is only readable by root. If you can read the storage (ie. if you are root), all security guarantees are gone, as root can read secrets directly from /dev/kmem

I was referring to the use-cases where checkpoints are exported to a custom location (podman container checkpoint --export) or distributed as OCI images (podman container checkpoint --create-image)

hanwen · 2025-01-20T19:23:29Z

(podman container checkpoint --export) or distributed as OCI images (podman container checkpoint --create-image)

in those case, you can just run the stream through encryption just before it leaves podman, something along the lines of podman checkpoint --export /dev/stdout | gpg --encrypt ? I guess I am missing something.

rst0git · 2025-01-21T07:26:31Z

in those case, you can just run the stream through encryption just before it leaves podman, something along the lines of podman checkpoint --export /dev/stdout | gpg --encrypt ? I guess I am missing something.

File encryption with gpg, openssl, age as well as OCI image encryption (imgcrypt) can be used as alternative solutions. For example, we use these as a baseline in our evaluation. One of the drawbacks of this approach is that forensic analysis tools like checkpointctl would need to decrypt the whole checkpoint to be able to analyze it.

podman checkpoint --export /dev/stdout | gpg --encrypt

This is easy to do on a single node with Podman but not so much in multi-tenant Kubernetes clusters.

hanwen-flow · 2025-01-21T11:48:50Z

One of the drawbacks of this approach is that forensic analysis tools like checkpointctl would need to decrypt the whole checkpoint to be able to analyze it.

If you implement encryption from inside snapshost, the snapshot still is a tar file, no? In order to get at the memory image, you still have to read through the entire file because tar does not support seeking. Golang AES decryption is roughly as fast as reading from an SSD (4Gbyte/sec, see eg. https://go-review.googlesource.com/c/go/+/125316), so does that solve anything?

IMO, The basic problem with tar is a poor format because it is not seekable. This is also what makes serving OCI images over FUSE annoying (see https://github.com/awslabs/soci-snapshotter).

I think this problem should rather be solved by having a file format that is better suited for seeking. I admit that I don't know how one would convince the entire ecosystem to move to a new format.

rst0git · 2025-01-21T14:20:41Z

If you implement encryption from inside snapshost, the snapshot still is a tar file, no? In order to get at the memory image, you still have to read through the entire file because tar does not support seeking.

In checkpointctl, we extract only the specific files we need from the tar archive depending on what options/commands are being used.

hanwen-flow · 2025-01-21T14:47:19Z

In checkpointctl, we extract only the specific files we need from the tar archive depending on what options/commands are being used.

that's my point. You are reading the entire tar file. If you are reading the entire file into user space (esp. if it actually comes from remote cloud storage, i.e., is transported over TLS or similar) decrypting the stream as a whole will add just little overhead.

hanwen-flow added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2024

hanwen-flow mentioned this issue Dec 12, 2024

docker checkpoint/restore is slow checkpoint-restore/criu#2519

Open

hanwen-flow mentioned this issue Jan 9, 2025

Inline rootfs diff - DRAFT #24984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint rootfs diff is wasteful #24826

checkpoint rootfs diff is wasteful #24826

hanwen-flow commented Dec 12, 2024

mheon commented Dec 12, 2024

adrianreber commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

adrianreber commented Dec 13, 2024

Luap99 commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

rst0git commented Dec 13, 2024 •

edited

Loading

hanwen commented Dec 13, 2024

hanwen-flow commented Jan 9, 2025

hanwen-flow commented Jan 16, 2025

hanwen-flow commented Jan 20, 2025

adrianreber commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

rst0git commented Jan 20, 2025 •

edited

Loading

hanwen-flow commented Jan 20, 2025

rst0git commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

rst0git commented Jan 20, 2025

hanwen commented Jan 20, 2025

rst0git commented Jan 21, 2025

hanwen-flow commented Jan 21, 2025

rst0git commented Jan 21, 2025

hanwen-flow commented Jan 21, 2025

checkpoint rootfs diff is wasteful #24826

checkpoint rootfs diff is wasteful #24826

Comments

hanwen-flow commented Dec 12, 2024

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

mheon commented Dec 12, 2024

adrianreber commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

adrianreber commented Dec 13, 2024

Luap99 commented Dec 13, 2024

hanwen-flow commented Dec 13, 2024

rst0git commented Dec 13, 2024 • edited Loading

hanwen commented Dec 13, 2024

hanwen-flow commented Jan 9, 2025

hanwen-flow commented Jan 16, 2025

hanwen-flow commented Jan 20, 2025

adrianreber commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

current situation

solution 1.

solution 2.

solution 2a: fix checkpoint & restore

solution 2b: fix restore

rst0git commented Jan 20, 2025 • edited Loading

hanwen-flow commented Jan 20, 2025

rst0git commented Jan 20, 2025

hanwen-flow commented Jan 20, 2025

rst0git commented Jan 20, 2025

hanwen commented Jan 20, 2025

rst0git commented Jan 21, 2025

hanwen-flow commented Jan 21, 2025

rst0git commented Jan 21, 2025

hanwen-flow commented Jan 21, 2025

rst0git commented Dec 13, 2024 •

edited

Loading

rst0git commented Jan 20, 2025 •

edited

Loading