-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add encrypted scratch space for container sandbox #2212
base: main
Are you sure you want to change the base?
Add encrypted scratch space for container sandbox #2212
Conversation
a6c6fe4
to
c19898a
Compare
5fb77d7
to
ab7fccf
Compare
ab7fccf
to
0e961cf
Compare
src/cloud-api-adaptor/podvm-mkosi/mkosi.skeleton/usr/lib/repart.d/30-scratch.conf
Show resolved
Hide resolved
Would it also make sense to extend this approach to use add-on disk like available with some instance types ? Not needed for initial iteration, but is this something we should consider and lay the foundation now?
I think it would be better to also add an annotation (eg |
for future extensibility, I think we can freely iterate. We don't currently have a version contract between PodVM and CAA. We would have to look at the user-facing api maybe to make this possible.
I agree, I didn't spot any existing fitting annotation in kata at first glance, so we might have to add it. |
0e961cf
to
6655aa3
Compare
6655aa3
to
34a553c
Compare
Yes, a new annotation will need to be added. |
ea95283
to
a097ec7
Compare
a097ec7
to
ba8cc4c
Compare
4b05766
to
c0c9f4c
Compare
45bdba1
to
6b24c6c
Compare
6b24c6c
to
a402bae
Compare
We discussed it in the last community call and concluded to start with a caa-wide option. once this is merged, we can start add something to kata-runtime to make it tweakable per-pod |
a402bae
to
3521c2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think commits 1 & 2 need some tidying as there is new code added in 1 that is changed in 2 which was a bit confusing. Additionally it would have been nicer (for me to review) to split the resource refactor out from commit 1, into a previous commit. It's not a blocker, maybe adding a sentence to explain that in the commit message #1 would be helpful for future people looking back at this?
@@ -76,6 +76,7 @@ else | |||
touch resources/buildBootableImage | |||
nix develop ..#podvm-mkosi --command mkosi --environment=VARIANT_ID=production | |||
qemu-img convert -f raw -O qcow2 build/system.raw build/podvm-$(PODVM_DISTRO)-$(ARCH).qcow2 | |||
qemu-img resize build/podvm-$(PODVM_DISTRO)-$(ARCH).qcow2 +100M |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any significance to +100M, or is it just a starter amount of space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to have a bit of size, so repart will not stumble over it at launch, attempting to create a partition without space. not sure if needs to be +100m, could just also work with +10m, have to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't have an issue with the 100M, just wanted to check if there was special logic behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only being done on the x86 path here as only Azure is supporting this scratch space at the moment I assume? Do you see any obstacles in the code for s390x support if/when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about partitioning schemes on s390x. if systemd-repart works in the same way, it should work ootb. The cloud-provider code would need to consider the .storage
property when provisioning a VM and then repart would grab whatever it available to allocate it to the scratch space.
The CI wasn't building debug images, due string/bool mismatch. Signed-off-by: Magnus Kulke <[email protected]>
3521c2d
to
e3fbec7
Compare
This adds the configuration for an encrypted scratch space on an mkosi image. At bootup a /dev/sda4 partition will be created and encrypted with LUKS using an ephemeral key. The partition will use the space available on the image volume. By default the qcow2 image has 100mb allocated for this space. This amount of space will only work for very small images, hence we do not mount the scratch space to `/run/kata-container` by default. If the kata-agent service units encounters a `/run/peerpod/mount-scratch` file it will mount the encrypted partition `/dev/sda4` to `/run/kata-containers`. This file is provisioned by `process-user-data`, configured by the CAA daemonset. A new CAA parameter has been introduced that allows to specify the disk size. If we have a disk size that is larger than 0 a cloud provider can attempt to consider this size to create space for an encrypted scratch partition that can be mounted to /run/container. Signed-off-by: Magnus Kulke <[email protected]>
e3fbec7
to
0b40295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good starting point to me. Thanks @mkulke!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
fixes #1921
This is a PoC to address the problem of secure scratch space in which we can download and unpack images to the sandbox, write files at runtime etc.
Currently this is either backed by storage that is not protected by the TEE (packer) or as tmpfs mounts that will consume ram during runtime (mkosi). For larger images like pytorch w/ cuda this implies in theory that we have to set aside 8GB of ram just to be able to download and unpack the image. tmpfs mounts are by default limited to a ratio of the ram, so we might end up having to overprovision the machine to be able to run a workload.
The proposes change introduces a CAA-wide parameter that allows to specify the disk size. A cloud provider can use this as an indicator of the size of the to-be-provisioned podvm disk. A zero value means that we stick to the default size of the image.
if a size has been given (the mkosi image is ~0.5gb) it will extend the image to that size. At launch the podvm will claim that space and create an ad-hoc encrypted volume on it. Depending on the presence of a marker file in user-data the agent unit will mount that to
/run/kata-containers
prior to launching kata-agent.I'm not super happy about the API (I would prefer to specify it per pod) and the implementation (the marker file) but I wanted to push early to get feedback. (also, it's only implemented for azure at the moment)
FWIW it does what it's supposed to do, see some numbers on #1958