Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Improve Docker Module Build Speed #221

Closed
4 tasks done
d-ryan-ashcraft opened this issue Jul 22, 2024 · 2 comments · Fixed by #229
Closed
4 tasks done

Feature: Improve Docker Module Build Speed #221

d-ryan-ashcraft opened this issue Jul 22, 2024 · 2 comments · Fixed by #229
Labels
enhancement New feature or request
Milestone

Comments

@d-ryan-ashcraft
Copy link
Collaborator

d-ryan-ashcraft commented Jul 22, 2024

Description

Upon migrating from orphedomos to fabric8's docker-maven-plugin, docker module build speed has increased substantially (+2.5 hours). Both approaches leverage Docker buildx/buildkit. We need to look at the next level of information to determine what is slowing these modules down and how the situation can be improved.

DOD

There are two primary reasons the build has slowed down post-migration:

  • Orphedomos was executing buildx via the default driver, whereas fabric8 is executing buildx via the docker-container driver. This change means a lot of extra I/O time to transfer files and docker layers to and from the docker-container driver. Currently, there is no way for fabric8 to execute the default driver without using the basic build. Unfortunately, this has some negative side effects:
    • Several modules fail with cryptic errors when building in this fashion
    • There is no output while the containers are building, making debugging and cache hits much more difficult to track down
  • When deploying, multiple architecture docker builds are executed via the docker-container driver. For the non-native architecture, this is very slow.

Several opportunities exist to improve things, but they will not all be handled in this ticket. We'll look to shave off some time and then regroup to see what makes sense after letting those changes bake in a bit and be used by the team. A second or third ticket is expected.

Future ideas that may be worth exploring:

Test Strategy/Script

  • Run the build locally to ensure it builds normally
  • Run the CI build to ensure it builds normally

Build Time Changes

Module Original CI Time Updated CI Time Difference
Docker::Airflow 1h 25m REMOVED -85m 0s
Docker::Data Lineage HTTP Consumer 0m 35s 0m 13s -0m 25s
Docker::Configuration Store 0m 19s 0m 11s -0m 8s
Docker::FastAPI 1m 44s 1m 22s -0m 22s
Docker::Hive Server 3m 13s 2m 36s -0m 37s
Docker::Hiver MySql 6m 20s 6m 05s -0m 15s
Docker::Jenkins Controller 11m 50s 4m 4s -7m 46s
Docker::Jenkins Agent 0m 9s 6m 0s +5m 51s
Docker::Kafka 1m 20s REMOVED -1m 20s
Docker::Metadata 1m 56s 1m 23s -0m 33s
Docker::MLFlow 13m 1s REMOVED -13m 1s
Docker::Model Training APIs::REST 2m 13s 1m 30s -0m 43s
Docker::Model Training API Sagemaker 6m 28s 3m 45s -2m 43s
Docker::Nvidia 7m 55s 7m 41s -0m 14s
Docker::Pipeline Invocation Servier 1m 47s 0m 52s -0m 55s
Docker::Policy Decision Point 0m 27s 0m 22s -0m 5s
Docker::Quarkus 2m 54s 2m 49s -0m 3s
Docker::Spark 13m 52s 11m 33s -2m 19s
Docker::Spark Infrastructure 0m 50s 0m 28s -0m 22s
Docker::Spark Operator 16m 20s 15m 53s -0m 27s
Docker::Vault 5m 59s 4m 2s -1m 57s
Docker::Versioning 13m 19s 7m 9s -6m 10s
Totals 3h 18m 1h 18m -1h 58m
@d-ryan-ashcraft d-ryan-ashcraft added the enhancement New feature or request label Jul 22, 2024
@d-ryan-ashcraft
Copy link
Collaborator Author

Investigation Notes

We looked into a number of opportunities to see how to improve build times, especially on CI where we use multi-architecture builds.  Some high level results using the aissemble-mlflow module for timings:

Docker Configuration Average Run Time
Base docker 61.2s
gzip enabled 61.2s
bzip2 enabled 63.4s
buildx docker 105.3s
buildx docker gzip enabled 109.0s
buildx docker bzip2 enabled 114.0s
buildx with cache-to and cache-from enabled 55.73s

Additionally, if the default buildx docker driver (local) is used, we get the fastest build times. However, this also results in several errors in our existing modules where they fail to build as is (and do not have obvious errors). When using the default driver (done by forking and slightly modifying the docker-maven-plugin), we discovered that (as expected), the buildx docker-container driver needed to support multiple architecture targets cannot use the default driver's output and has to rebuild both. But it will rebuilt both simultaneously. This seems like an improvement over the sequential behavior we get when building via the standard approach with the base docker-container driver setup.

As such, we are looking at a first increment of build improvements that uses the following configuration:

  • Only build the docker container on the docker-maven-plugin:build phase one mvn install (the normal local use case)
  • When the CI profile is activated, skip the build during docker-maven-plugin:build and instead let both arm64 and amd64 build simultaneously during docker-maven-plugin:deploy
  • In all cases, use a local cache-to and cache-from
    • NOTE: This makes cache tracking and expiration a bit more tricky, but let's see how it plays out
  • The following containers have been removed and pegged in their respective charts to the 1.7.0 release
    • The worst offender in terms of build time is aissemble-airflow which racks up a whopping 1 hour, 25 minute build time on both architectures. We don't directly use this or recommend using it. We'd like to remove all references, but that is beyond the scope of this ticket.
    • aissemble-mlfow comes in at 13.5 minutes. We aren't adding much, if any, value with this container and will soon upgrade it to the latest version in another ticket.

Initial results with this approach show the nearly 4 hour CI build time down to a still long, but much more manageable ~2 hours. Certainly more to improve upon, but this should get us well back towards more reasonable operations.

d-ryan-ashcraft added a commit that referenced this issue Jul 24, 2024
…general-docker/airflow.docker.file.vm

Co-authored-by: Emily Wilkins <[email protected]>
d-ryan-ashcraft added a commit that referenced this issue Jul 24, 2024
d-ryan-ashcraft added a commit that referenced this issue Jul 24, 2024
d-ryan-ashcraft added a commit that referenced this issue Jul 25, 2024
d-ryan-ashcraft added a commit that referenced this issue Jul 25, 2024
#221 🚀 improve build performance; #230 🔧 use ghcr.io as snapshot repo
@colinpalmer-pro
Copy link
Contributor

Test steps passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants