-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint rootfs diff is wasteful #24826
Comments
@adrianreber Any thoughts? |
@hanwen-flow Sounds like a good idea. You should do a proof of concept and open a PR to verify it. Another thing which would be interesting is to see if using The main problem seems to be that we write the file-system changes twice. Thinking about your approach we will still need to write it twice. Once from the container to the temporary directory and the second time while creating the final tar archive from the temporary directory. Thinking more about it. If we write the content from the container to the If you find a way to write the data only once, that would be great. If you have an idea how to improve it, please open a PR. |
why? CRIU has to create files in a temp directory, because it is a separate process, but podman can create a tar.Writer in golang for the final file directly, insert the CRIU snapshots (from the file system tmp dir) and read the changed files from the container directly. if you do this without compression, the whole process could use kernel file copying (assuming the linked golang proposal goes through). This requires dropping the gratuitous io.Pipe() calls to connect tar streams to output sinks. |
Re. compression: I observed that zstd compression on the final archive incurs 10% overhead. |
Good question.
There is a tool called criu-image-streamer which tries to avoid it. So it would be possible to also optimize this step.
Sounds great, please open a PR. From my point of view there is nothing against having this. |
Do we need to worry about backwards compat here? If the layout is changed podman would be unable to load a checkpoint created on a previous version or the other way around. |
I had a quick look. That is a small tool, so it could be reimplemented in podman or one of its dependencies, I guess? re. compatibility: we could maintain the tar-in-tar format, and stream the embedded rootfs tar directly into the final tar file without going to disk. |
A better place to re-implement image streaming would be go-criu. We use these bindings in Podman. We can propose this as a GSoC project for next year: https://www.criu.org/Google_Summer_of_Code_Ideas |
now I remember again: we can't do this: we need to know the size of the file in advance, because the metadata (inclding size) precedes the file content. |
This yields a 2x speed up on creating and restoring snapshots, as it avoids creating and extracting the temporary rootfs-diff.tar file. Tested with the following recipe $ cat files.sh #!/bin/sh dd if=/dev/urandom of=test bs=1M count=100 for i in $(seq 1 10); do cp test test$i & done wait rm /usr/bin/znew $ sudo ~/vc/containers/podman/podman run -dt -p 8080:80/tcp docker.io/library/httpd; \ sudo ~/vc/containers/podman/podman container cp files.sh $(sudo ~/vc/containers/podman/podman ps -l --format "{{.ID}}"):files.sh ;\ sudo ~/vc/containers/podman/podman container exec -it -l /bin/sh -x files.sh ;\ (cd ../containers/podman/ && go build -tags "selinux seccomp" ./cmd/podman) && sudo ~/vc/containers/podman/podman container checkpoint -e foo.tar -l && \ sudo ~/vc/containers/podman/podman container restore -i foo.tar --name restored For productionizing, this would need * supporting (de)compression, which I left out for simplicity * a decision on what to do with backward compatibility * similar treatment of the devshm and volumes tar files * a more principled approach to storing multiple sections in a single tar file? * upstreaming changes to the archive package in the storage repo (including tests) Addresses containers#24826
I created #24984 , which seems to work according to the test I described in the commit message. Regarding format stability: we could reap some benefits without changing the format, by decompressing the rootfs-diff.tar directly from checkpoint tar file. That would only need some extensions to the untar functions in the archive package. The decompression speed (ie. restore speed) is more important to me than the checkpoint speed, so that is an acceptable compromise to me. I suspect we'll be running this on BTRFS or XFS so we can take advantage of zero-copy (see golang/go#70807 (comment)) which yields a much more substantial speed up. For that, it is important that there are no intermediate buffers in the tar handling. |
I tried prototyping this in https://github.com/hanwen-flow/podman/commits/hacktar/. However, this doesn't readily work: the rootfs diff is read from Since podman can decide what the diff looks like, it should be possible to get at the files, which are under the |
Got it to work; see #25050 Given the speed-up of this approach (upwards of 10x), I think pursuing a measly 2x speed up is not really worth it. |
Sorry, but I am confused. Can you describe again what the difference between your two approaches is and what the improvement is. I think I understand it, but I am not 100% sure. Somewhere you wrote you are waiting for feedback how to continue. If you have an improvement I would like to see it be merged in Podman. Backwards compatibility is important, but it sounds like you found a way to solve it, right? |
Re. CRIU-streamer: for my use-case, I always want to save the memory snapshot onto disk. IIUC, using criu-streamer doesn't really help, the bottleneck will be in syncing the contents to stable storage. Once it is on stable storage, copying around the data will be essentially free because of reflink copying. |
current situation(checkpoint)
(restore)
problem: on most file systems, cost of copying files is proportional to file size, and we have to copy the files twice to handle rootfs-diff.tar solution 1.Use a CoW file system; then copying files with copy_file_range(2) is essentially free. This is #25050 solution 2.Skip rootfs-diff.tar as an intermediate. solution 2a: fix checkpoint & restoreInline the rootfs diffs directly in snapshot.tar; this is an incompatible format change, and speeds up both checkpoint and restore. Implemented in #24984. solution 2b: fix restoreKeep the format, but use a custom extraction: instead of extracting rootfs-tar.tar, when processing snapshot.tar, when rootfs-diff.tar is encountered, directly pipe it into a tar.Reader, extracting into the overlayfs mount. Solution 1 is fully backward compatible, and achieves far superior speeds. This solution has my preference, and the changes are modest enough that we can patch them locally. It only works on CoW filesystems (ZFS, XFS, BTRFS). It requires vendoring golang's archive/tar package until https://go-review.googlesource.com/c/go/+/642736 goes through. I am not sure how acceptable this is for the maintainers of the containers/storage repo. |
@hanwen We are currently working on adding support for encrypting rootfs changes and integrating this feature with Podman and CRI-O:
Have you considered a use-case where the content of rootfs-diff.tar is compressed and encrypted? |
The user running podman has permission to see a running container's contents anyway, so encrypting the snapshot while it's on the machine running podman doesn't bring any security. If the snapshot is shipped off somewhere, you could just encrypt the whole snapshot tar file, either on the sending side or on the receiving server, depending on who you trust more. So, I don't understand the threat model that you are addressing. in our case, we are not concerned with protecting the snapshots from prying eyes. the step that involves going from unencrypted file system to encrypted stream, can not be done through reflink copies, of course, but if you encrypt the snapshot tar, you can still reap the benefits of free copying when you create the rootfs diff tar. |
The main difference is that processes in a running container reside in system memory, while snapshots are saved to persistent storage. Each user with access to the storage can read sensitive data stored in snapshots (e.g., passwords, API tokens, etc.). Our goal is to introduce built-in support for end-to-end encryption, where checkpoint data is encrypted before it is written to disk and decrypted as part of the restore process. In addition, container checkpoints can be very large (i.e., hundreds of gigabytes). While encrypting the whole snapshot tar file can be used, writing the data to disk as plain text and then reading it back to encrypt it not only increases storage requirements and performance overhead but also introduces security risks, as an attacker could compromise the data flow and potentially gain access to sensitive information during this operation. I hope this helps clarify the threat model. |
which users are those? The storage dir in podman is only readable by root. If you can read the storage (ie. if you are root), all security guarantees are gone, as root can read secrets directly from /dev/kmem.
For rootfs, with reflink copies on a CoW filesystem, there is no storage overhead, as the blocks are shared between all copies, including the files that the container itself uses as its file system. for the memory image it is attractive to stream the data directly from memory into a network pipe (see eg. criu-streamer) to avoid overhead, but attackers that can bypass access controls on the FS can probably find other ways to get at your sensitive data, even if you avoid storing it. |
I was referring to the use-cases where checkpoints are exported to a custom location ( |
in those case, you can just run the stream through encryption just before it leaves podman, something along the lines of |
File encryption with gpg, openssl, age as well as OCI image encryption (
This is easy to do on a single node with Podman but not so much in multi-tenant Kubernetes clusters. |
If you implement encryption from inside snapshost, the snapshot still is a tar file, no? In order to get at the memory image, you still have to read through the entire file because tar does not support seeking. Golang AES decryption is roughly as fast as reading from an SSD (4Gbyte/sec, see eg. https://go-review.googlesource.com/c/go/+/125316), so does that solve anything? IMO, The basic problem with tar is a poor format because it is not seekable. This is also what makes serving OCI images over FUSE annoying (see https://github.com/awslabs/soci-snapshotter). I think this problem should rather be solved by having a file format that is better suited for seeking. I admit that I don't know how one would convince the entire ecosystem to move to a new format. |
In |
that's my point. You are reading the entire tar file. If you are reading the entire file into user space (esp. if it actually comes from remote cloud storage, i.e., is transported over TLS or similar) decrypting the stream as a whole will add just little overhead. |
Issue Description
Container checkpointing with CRIU works for me, but its speed is disappointing. What currently happens is:
In my case, my containers have large local FS differences. On my laptop, the tarring runs at 400mb/s, so 10G of file system differences takes 25s to create the rootfs-diff.tar archive. Then, this data goes through tar again which takes another 25s.
wouldn't it be better to insert the rootfs diff directly into the snapshot tar (perhaps under a rootfs-diff/ directory) ? Then my large file content goes through tar only once, yielding a 2x speedup.
Note that Go isn't particularly efficient with tar files either, see golang/go#70807.
Steps to reproduce the issue
Steps to reproduce the issue
1.
2.
3.
Describe the results you received
Describe the results you received
Describe the results you expected
Describe the results you expected
podman info output
If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
Additional environment details
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
The text was updated successfully, but these errors were encountered: