Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add checkpoint uds-core slim package #818

Merged
merged 53 commits into from
Nov 18, 2024
Merged

feat: add checkpoint uds-core slim package #818

merged 53 commits into from
Nov 18, 2024

Conversation

Racer159
Copy link
Contributor

@Racer159 Racer159 commented Sep 25, 2024

Description

This adds a ~75% faster way to deploy or reset a full uds-core cluster (theoretically would work for other preloaded things like testing GitLab Runner w/GitLab too).

Normal:
image

Checkpoint:
image

Tradeoffs:

  • Requires sudo - not sure of a great way around this without mangling volume permissions for containerd
  • May become unwieldy with more permutations (i.e. with layers work)
  • The cluster would be fully published (so all credentials are reused)

Related Issue

Fixes #N/A

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Checklist before merging

@Racer159 Racer159 changed the title feat: add frozen uds-core slim package feat: add checkpoint uds-core slim package Sep 27, 2024
@Racer159 Racer159 marked this pull request as ready for review September 27, 2024 22:54
@Racer159 Racer159 requested a review from a team as a code owner September 27, 2024 22:54
@Racer159 Racer159 self-assigned this Sep 27, 2024
@Racer159
Copy link
Contributor Author

Racer159 commented Sep 28, 2024

Checkpoint task passed in this PR (except for the actual publish task)
image

Copy link
Contributor

@catsby catsby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an approver but the code does look good to me. I would like to see more information on how to use this package though so it's more clear on how/why/when someone would want to use it.

packages/checkpoint-dev/README.md Outdated Show resolved Hide resolved
packages/checkpoint-dev/zarf.yaml Show resolved Hide resolved
tasks.yaml Outdated Show resolved Hide resolved
.github/workflows/checkpoint.yaml Outdated Show resolved Hide resolved
.github/actions/setup/action.yaml Outdated Show resolved Hide resolved
packages/checkpoint-dev/zarf.yaml Show resolved Hide resolved
packages/checkpoint-dev/checkpoint.sh Outdated Show resolved Hide resolved
@bburky
Copy link
Member

bburky commented Oct 25, 2024

Probably ignore all of the following, I tried testing CRIU (docker checkpoint or podman container checkpoint). It almost works with podman... except for not supporting nested containers). Recording my notes here anyway.


Did you try docker checkpoint which uses CRIU and is somewhat meant for this purpose? ...It is still experimental though and requires "experimental": "true" in your /etc/docker/daemon.json, and install a CRIU package into your linux distro...
https://docs.docker.com/reference/cli/docker/checkpoint/
https://criu.org/Docker

If you use --checkpoint-dir you can save the checkpoint to disk and restore it after recreating the container (possibly on a different machine). There seems to be a bug during restore, but there's a workaround, see below.

docker rm -f count
sudo rm -rf /tmp/checkpoint

docker run -d --name=count busybox /bin/sh -c 'for i in $(seq 9999999); do echo "$i" && sleep 1; done'
docker checkpoint create --checkpoint-dir=/tmp/checkpoint count checkpoint1
docker rm count

docker create --name count busybox
# Apparently `docker start --checkpoint-dir` is broken, use workaround: https://github.com/moby/moby/issues/37344#issuecomment-450782189
# docker start --checkpoint-dir /tmp/checkpoint --checkpoint checkpoint1 count
sudo mv /tmp/checkpoint/checkpoint1 "/var/lib/docker/containers/$(docker ps -aq --no-trunc --filter name=count)/checkpoints/"
docker start --checkpoint=checkpoint1 count

docker ps
docker logs -f count

The biggest downside would be this is near impossible to use with Docker Desktop. A big advantage is the cluster never actually "stops", it's magically paused and resumed elsewhere.

Podman seems to support this too, and seems to be a bit more fully supported. k3d (somewhat) supports Podman too. Unlike docker, Podman's CRIU support includes volumes, and capturing multiple containers at once. It can apparently pack the checkpoint into an OCI image too (useful for publishing to GHCR?)

Except... this whole idea may be useless because don't think CRIU supports checkpointing nested namespaces (which is how k3d works to embed sub containers inside it's parent docker container for the k8s node)
https://github.com/checkpoint-restore/criu/blob/v4.0/criu/include/namespaces.h#L47-L48

limactl start template://podman-rootful
export DOCKER_HOST=unix://$HOME/.lima/podman-rootful/sock/podman.sock
k3d cluster create

limactl shell podman-rootful sudo podman container checkpoint --export=/tmp/lima/checkpoint.tgz k3d-k3s-default-server-0 k3d-k3s-default-serverlb
# Error:
#   Can't dump nested pid namespace for 4663

corang
corang previously approved these changes Nov 8, 2024
Copy link
Contributor

@mjnagel mjnagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments + need an update to the release-please config to ensure the checkpoint zarf.yaml is versioned properly: https://github.com/defenseunicorns/uds-core/blob/gotta-go-fast/release-please-config.json#L14

.github/workflows/checkpoint.yaml Outdated Show resolved Hide resolved
.github/workflows/checkpoint.yaml Show resolved Hide resolved
.github/workflows/checkpoint.yaml Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
packages/checkpoint-dev/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mjnagel mjnagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work on this!!! Would be great to revisit the macOS support at some point and look at other places we could checkpoint things as well.

Copy link
Contributor

@mjnagel mjnagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops missed this one - need an update to the release-please config to ensure the checkpoint zarf.yaml is versioned properly: https://github.com/defenseunicorns/uds-core/blob/gotta-go-fast/release-please-config.json#L14

mjnagel
mjnagel previously approved these changes Nov 18, 2024
@mjnagel mjnagel merged commit d95f6be into main Nov 18, 2024
27 checks passed
@mjnagel mjnagel deleted the gotta-go-fast branch November 18, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants