Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate if we want to move to BuildKit for automatic Docker image handling #89

Open
beckerhe opened this issue Jul 20, 2023 · 1 comment

Comments

@beckerhe
Copy link
Collaborator

BuildKit allows an easy management of Docker images in a CI scenario as it can automatically rebuild images when the image's input files (Dockerfile and context) changes. So no manual builds and pushes are needed.

One of many tutorials describing the workflow is here: https://testdriven.io/blog/faster-ci-builds-with-docker-cache/

PR #77 implements some parts of it in the first commit (We later decided to not implement it for now).

@beckerhe
Copy link
Collaborator Author

beckerhe commented Jul 20, 2023

I'm going to document the different points we discussed whenever I have some time.

1. Layer Caching

Building the docker images from scratch only works well when we have decent and reliable caching. We need two types of caching:

  1. Caching of main branch Docker images
  2. Caching of changes to Docker images in PRs between different iterations of the PR

The simplest approach is using GitHub Action's built-in cache. Unfortunately it's limited to 10GiB of total cache space per repo which we think we will shortly exceed - especially with the CUDA containers.

Another option is to use a Container Registry for layer caching which is the most well supported option in BuildKit. We can use GCP's Artifact Registry for that which means we are not limited by size. We would create two registries - one for main branch caching and one for PR-based layer caching. Only trusted build runners can write into the formerm but for PR-based caching we have to be careful what jobs can write into the layer cache for PRs.

We cannot give untrusted runnners (which we want to support in the future) write access to that second registry. If we were doing that we might be vulnerable to cache poisining attack.

One option to solve this is to have an isolated cache per PR - as it's implemented by GHA's caching system. As far as I've seen Docker registries don't support that (We would also need some federated authentication system to support that which makes it even more complicated)

So only 2 options remain:

  1. Use GHA cache for the PR caches (2nd type of cache from above). We might exceed the cache limit which basically disable the caching between PR iterations, but it would still work. The big advantage is that we have proper cache isolation.
  2. Only build docker images on trusted runners: Whenever we want to use a specific Docker image on an untrusted runner, we first build it on a trusted runnner and push the cached layer to the caching registry. The untrusted runner can then pull the cached layers from that registry but not write to this registry. If we chose that workflow we also need some postsubmit job which cleans up the layer cache when a PR gets merged. I assume we also need a second job which cleans up old layers after some time (from closed PRs that have not been merged for example)

I think both options have their advantages. The first allows a simpler workflow design with the disadvantage of disabled caching for large docker containers. The second has no size limit but needs more plumbing around (like the cleanup jobs for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant