Skip to content

Latest commit

 

History

History
740 lines (545 loc) · 42.3 KB

IMAGES.rst

File metadata and controls

740 lines (545 loc) · 42.3 KB

Airflow has two images (build from Dockerfiles):

  • CI image (Dockerfile.ci) - used for running tests and local development
  • Production image (Dockerfile) - used to run production-ready Airflow installations

The images are named as follows:

apache/airflow:<BRANCH_OR_TAG>-python<PYTHON_MAJOR_MINOR_VERSION>[-ci][-manifest]

where:

  • BRANCH_OR_TAG - branch or tag used when creating the image. Examples: master, v1-10-test, 1.10.10 The master and v1-10-test labels are built from branches so they change over time. The 1.10.* and in the future 2.* labels are build from git tags and they are "fixed" once built.
  • PYTHON_MAJOR_MINOR_VERSION - version of python used to build the image. Examples: 3.5, 3.7
  • The -ci suffix is added for CI images
  • The -manifest is added for manifest images (see below for explanation of manifest images)

The easiest way to build those images is to use BREEZE.rst.

Note! Breeze by default builds production image from local sources. You can change it's behaviour by providing --install-airflow-version parameter, where you can specify the tag/branch used to download Airflow package from in github repository. You can also change the repository itself by adding --dockerhub-user and --dockerhub-repo flag values.

You can build the CI image using this command:

./breeze build-image

You can build production image using this command:

./breeze build-image --production-image

By adding --python <PYTHON_MAJOR_MINOR_VERSION> parameter you can build the image version for the chosen python version.

The images are build with default extras - different extras for CI and production image and you can change the extras via the --extras parameters. You can see default extras used via ./breeze flags.

For example if you want to build python 3.7 version of production image with "all" extras installed you should run this command:

./breeze build-image --python 3.7 --extras "all" --production-image

The command that builds the CI image is optimized to minimize the time needed to rebuild the image when the source code of Airflow evolves. This means that if you already have the image locally downloaded and built, the scripts will determine whether the rebuild is needed in the first place. Then the scripts will make sure that minimal number of steps are executed to rebuild parts of the image (for example, PIP dependencies) and will give you an image consistent with the one used during Continuous Integration.

The command that builds the production image is optimised for size of the image.

In Breeze by default, the airflow is installed using local sources of Apache Airflow.

You can also build production images from PIP packages via providing --install-airflow-version parameter to Breeze:

./breeze build-image --python 3.7 --extras=gcp --production-image --install-airflow-version=1.10.12

This will build the image using command similar to:

pip install apache-airflow[sendgrid]==1.10.12 \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.12/constraints-3.6.txt"

You can also build production images from specific Git version via providing --install-airflow-reference parameter to Breeze (this time constraints are taken from the constraints-master branch which is the HEAD of development for constraints):

pip install "https://github.com/apache/airflow/archive/<tag>.tar.gz#egg=apache-airflow" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.6.txt"

Default mechanism used in Breeze for building CI images uses images pulled from DockerHub or GitHub Image Registry. This is done to speed up local builds and CI builds - instead of 15 minutes for rebuild of CI images, it takes usually less than 3 minutes when cache is used. For CI builds this is usually the best strategy - to use default "pull" cache. This is default strategy when BREEZE.rst builds are performed.

For Production Image - which is far smaller and faster to build, it's better to use local build cache (the standard mechanism that docker uses. This is the default strategy for production images when BREEZE.rst builds are performed. The first time you run it, it will take considerably longer time than if you use the pull mechanism, but then when you do small, incremental changes to local sources, Dockerfile image= and scripts further rebuilds with local build cache will be considerably faster.

You can also disable build cache altogether. This is the strategy used by the scheduled builds in CI - they will always rebuild all the images from scratch.

You can change the strategy by providing one of the --build-cache-local, --build-cache-pulled or even --build-cache-disabled flags when you run Breeze commands. For example:

./breeze build-image --python 3.7 --build-cache-local

Will build the CI image using local build cache (note that it will take quite a long time the first time you run it).

./breeze build-image --python 3.7 --production-image --build-cache-pulled

Will build the production image with pulled images as cache.

./breeze build-image --python 3.7 --production-image --build-cache-disabled

Will build the production image from the scratch.

You can also turn local docker caching by setting DOCKER_CACHE variable to "local", "pulled", "disabled" and exporting it.

export DOCKER_CACHE="local"

or

export DOCKER_CACHE="disabled"

By default images are pulled and pushed from and to DockerHub registry when you use Breeze's push-image or build commands.

Our images are named like that:

apache/airflow:<BRANCH_OR_TAG>[-<PATCH>]-pythonX.Y         - for production images
apache/airflow:<BRANCH_OR_TAG>[-<PATCH>]-pythonX.Y-ci      - for CI images
apache/airflow:<BRANCH_OR_TAG>[-<PATCH>]-pythonX.Y-build   - for production build stage

For example:

apache/airflow:master-python3.6                - production "latest" image from current master
apache/airflow:master-python3.6-ci             - CI "latest" image from current master
apache/airflow:v1-10-test-python2.7-ci         - CI "latest" image from current v1-10-test branch
apache/airflow:1.10.10-python3.6               - production image for 1.10.10 release
apache/airflow:1.10.10-1-python3.6             - production image for 1.10.10 with some patches applied

You can see DockerHub images at https://hub.docker.com/repository/docker/apache/airflow

By default DockerHub registry is used when you push or pull such images. However for CI builds we keep the images in GitHub registry as well - this way we can easily push the images automatically after merge requests and use such images for Pull Requests as cache - which makes it much it much faster for CI builds (images are available in cache right after merged request in master finishes it's build), The difference is visible especially if significant changes are done in the Dockerfile.CI.

The images are named differently (in Docker definition of image names - registry URL is part of the image name if DockerHub is not used as registry). Also GitHub has its own structure for registries each project has its own registry naming convention that should be followed. The name of images for GitHub registry are:

docker.pkg.github.com/apache/airflow/<BRANCH>-pythonX.Y       - for production images
docker.pkg.github.com/apache/airflow/<BRANCH>-pythonX.Y-ci    - for CI images
docker.pkg.github.com/apache/airflow/<BRANCH>-pythonX.Y-build - for production build state

Note that we never push or pull TAG images to GitHub registry. It is only used for CI builds

You can see all the current GitHub images at https://github.com/apache/airflow/packages

In order to interact with the GitHub images you need to add --github-registry flag to the pull/push commands in Breeze. This way the images will be pulled/pushed from/to GitHub rather than from/to DockerHub. Images are build locally as apache/airflow images but then they are tagged with the right GitHub tags for you.

You can read more about the CI configuration and how CI builds are using DockerHub/GitHub images in CI.rst.

Note that you need to be committer and have the right to push to DockerHub and GitHub and you need to be logged in. Only committers can push images directly.

The CI image is used by Breeze as shell image but it is also used during CI build. The image is single segment image that contains Airflow installation with "all" dependencies installed. It is optimised for rebuild speed. It installs PIP dependencies from the current branch first - so that any changes in setup.py do not trigger reinstalling of all dependencies. There is a second step of installation that re-installs the dependencies from the latest sources so that we are sure that latest dependencies are installed.

The production image is a multi-segment image. The first segment "airflow-build-image" contains all the build essentials and related dependencies that allow to install airflow locally. By default the image is build from a released version of Airflow from Github, but by providing some extra arguments you can also build it from local sources. This is particularly useful in CI environment where we are using the image to run Kubernetes tests. See below for the list of arguments that should be provided to build production image from the local sources.

The image is primarily optimised for size of the final image, but also for speed of rebuilds - the 'airflow-build-image' segment uses the same technique as the CI builds for pre-installing PIP dependencies. It first pre-installs them from the right github branch and only after that final airflow installation is done from either local sources or remote location (PIP or github repository).

You can build the default production image with standard docker build command but they will only build default versions of the image and will not use the dockerhub versions of images as cache.

The following build arguments (--build-arg in docker build command) can be used for CI images:

Build argument Default value Description
PYTHON_BASE_IMAGE python:3.6-slim-buster Base python image
AIRFLOW_VERSION 2.0.0.dev0 version of Airflow
PYTHON_MAJOR_MINOR_VERSION 3.6 major/minor version of Python (should match base image)
DEPENDENCIES_EPOCH_NUMBER 2 increasing this number will reinstall all apt dependencies
PIP_NO_CACHE_DIR true if true, then no pip cache will be stored
HOME /root Home directory of the root user (CI image has root user as default)
AIRFLOW_HOME /root/airflow Airflow’s HOME (that’s where logs and sqlite databases are stored)
AIRFLOW_SOURCES /opt/airflow Mounted sources of Airflow
PIP_DEPENDENCIES_EPOCH_NUMBER 3 increasing that number will reinstall all PIP dependencies
CASS_DRIVER_NO_CYTHON 1 if set to 1 no CYTHON compilation is done for cassandra driver (much faster)
AIRFLOW_REPO apache/airflow the repository from which PIP dependencies are pre-installed
AIRFLOW_BRANCH master the branch from which PIP dependencies are pre-installed
AIRFLOW_CI_BUILD_EPOCH 1 increasing this value will reinstall PIP dependencies from the repository from scratch
AIRFLOW_EXTRAS all extras to install
ADDITIONAL_AIRFLOW_EXTRAS   additional extras to install
ADDITIONAL_PYTHON_DEPS   additional python dependencies to install
ADDITIONAL_DEV_DEPS   additional apt dev dependencies to install
ADDITIONAL_RUNTIME_DEPS   additional apt runtime dependencies to install

Here are some examples of how CI images can built manually. CI is always built from local sources.

This builds the CI image in version 3.7 with default extras ("all").

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7

This builds the CI image in version 3.6 with "gcp" extra only.

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.6 --build-arg AIRFLOW_EXTRAS=gcp

This builds the CI image in version 3.6 with "apache-beam" extra added.

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.6 --build-arg ADDITIONAL_AIRFLOW_EXTRAS="apache-beam"

This builds the CI image in version 3.6 with "mssql" additional package added.

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.6 --build-arg ADDITIONAL_PYTHON_DEPS="mssql"

This builds the CI image in version 3.6 with "gcc" and "g++" additional apt dev dependencies added.

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.6 --build-arg ADDITIONAL_DEV_DEPS="gcc g++"

This builds the CI image in version 3.6 with "jdbc" extra and "default-jre-headless" additional apt runtime dependencies added.

docker build . -f Dockerfile.ci --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.6 --build-arg AIRFLOW_EXTRAS=jdbc --build-arg ADDITIONAL_RUNTIME_DEPS="default-jre-headless"

The following build arguments (--build-arg in docker build command) can be used for production images:

Build argument Default value Description
PYTHON_BASE_IMAGE python:3.6-slim-buster Base python image
PYTHON_MAJOR_MINOR_VERSION 3.6 major/minor version of Python (should match base image)
AIRFLOW_VERSION 2.0.0.dev0 version of Airflow
AIRFLOW_REPO apache/airflow the repository from which PIP dependencies are pre-installed
AIRFLOW_BRANCH master the branch from which PIP dependencies are pre-installed initially
AIRFLOW_CONSTRAINTS_REFERENCE constraints-master reference (branch or tag) from Github repository from which constraints are used. By default it is set to constraints-master but can be constraints-1-10 for 1.10.* versions or it could point to specific version for example constraints-1.10.12
AIRFLOW_EXTRAS (see Dockerfile) Default extras with which airflow is installed
ADDITIONAL_AIRFLOW_EXTRAS   Optional additional extras with which airflow is installed
ADDITIONAL_PYTHON_DEPS   Optional python packages to extend the image with some extra dependencies
ADDITIONAL_DEV_DEPS   additional apt dev dependencies to install
ADDITIONAL_RUNTIME_DEPS   additional apt runtime dependencies to install
AIRFLOW_HOME /opt/airflow Airflow’s HOME (that’s where logs and sqlite databases are stored)
AIRFLOW_UID 50000 Airflow user UID
AIRFLOW_GID 50000 Airflow group GID. Note that most files created on behalf of airflow user belong to the root group (0) to keep OpenShift Guidelines compatibility
AIRFLOW_USER_HOME_DIR /home/airflow Home directory of the Airflow user
CASS_DRIVER_BUILD_CONCURRENCY 8 Number of processors to use for cassandra PIP install (speeds up installing in case cassandra extra is used).

There are build arguments that determine the installation mechanism of Apache Airflow for the production image. There are three types of build:

  • From local sources (by default for example when you use docker build .)
  • You can build the image from released PyPi airflow package (used to build the official Docker image)
  • You can build the image from any version in GitHub repository(this is used mostly for system testing).
Build argument What to specify
AIRFLOW_INSTALL_SOURCES Should point to the sources of of Apache Airflow. It can be either "." for installation from local sources, "apache-airflow" for installation from packages and URL to installation from GitHub repository (see below) to install from any GitHub version
AIRFLOW_INSTALL_VERSION Optional - might be used for package installation case to set Airflow version for example "==1.10.10"
AIRFLOW_CONSTRAINTS_REFERENCE reference (branch or tag) from Github where constraints file is taken from. By default it is constraints-master but can be constraints-1-10 for 1.10.* constraint or if you want to point to specific varsion ``constraints-1.10.12
AIRFLOW_WWW In case of Airflow 2.0 it should be "www", in case of Airflow 1.10 series it should be "www_rbac". See examples below
AIRFLOW_SOURCES_FROM Sources of Airflow. Set it to "empty" to avoid costly Docker context copying in case of installation from the package or from GitHub URL. See examples below
AIRFLOW_SOURCES_TO Target for Airflow sources. Set to "/empty" to avoid costly Docker context copying in case of installation from the package or from GitHub URL. See examples below

This builds production image in version 3.6 with default extras from the local sources (master version of 2.0 currently):

docker build .

This builds the production image in version 3.7 with default extras from 1.10.12 tag and constraints taken from constraints-1-10-12 branch in Github.

docker build . \
  --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7 \
  --build-arg AIRFLOW_INSTALL_SOURCES="https://github.com/apache/airflow/archive/1.10.12.tar.gz#egg=apache-airflow" \
  --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-1-10" \
  --build-arg AIRFLOW_BRANCH="v1-10-test" \
  --build-arg AIRFLOW_SOURCES_FROM="empty" \
  --build-arg AIRFLOW_SOURCES_TO="/empty"

This builds the production image in version 3.7 with default extras from 1.10.12 Pypi package and constraints taken from 1.10.12 tag in Github and pre-installed pip dependencies from the top of v1-10-test branch.

docker build . \
  --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7 \
  --build-arg AIRFLOW_INSTALL_SOURCES="apache-airflow" \
  --build-arg AIRFLOW_INSTALL_VERSION="==1.10.12" \
  --build-arg AIRFLOW_BRANCH="v1-10-test" \
  --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-1.10.12" \
  --build-arg AIRFLOW_SOURCES_FROM="empty" \
  --build-arg AIRFLOW_SOURCES_TO="/empty"

This builds the production image in version 3.7 with additional airflow extras from 1.10.10 Pypi package and additional python dependencies and pre-installed pip dependencies from 1.10.10 tagged constraints.

docker build . \
  --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7 \
  --build-arg AIRFLOW_INSTALL_SOURCES="apache-airflow" \
  --build-arg AIRFLOW_INSTALL_VERSION="==1.10.10" \
  --build-arg AIRFLOW_BRANCH="v1-10-test" \
  --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-1.10.10" \
  --build-arg AIRFLOW_SOURCES_FROM="empty" \
  --build-arg AIRFLOW_SOURCES_TO="/empty" \
  --build-arg ADDITIONAL_AIRFLOW_EXTRAS="mssql,hdfs"
  --build-arg ADDITIONAL_PYTHON_DEPS="sshtunnel oauth2client"

This builds the production image in version 3.7 with additional airflow extras from 1.10.10 Pypi package and additional apt dev and runtime dependencies.

docker build . \
  --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \
  --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7 \
  --build-arg AIRFLOW_INSTALL_SOURCES="apache-airflow" \
  --build-arg AIRFLOW_INSTALL_VERSION="==1.10.12" \
  --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-1-10" \
  --build-arg AIRFLOW_SOURCES_FROM="empty" \
  --build-arg AIRFLOW_SOURCES_TO="/empty" \
  --build-arg ADDITIONAL_AIRFLOW_EXTRAS="jdbc"
  --build-arg ADDITIONAL_DEV_DEPS="gcc g++"
  --build-arg ADDITIONAL_RUNTIME_DEPS="default-jre-headless"

Together with the main CI images we also build and push image manifests. Those manifests are very small images that contain only results of the docker inspect for the image. This is in order to be able to determine very quickly if the image in the docker registry has changed a lot since the last time. Unfortunately docker registry (specifically DockerHub registry) has no anonymous way of querying image details via API, you need to download the image to inspect it. We overcame it in the way that always when we build the image we build a very small image manifest and push it to registry together with the main CI image. The tag for the manifest image is the same as for the image it refers to with added -manifest suffix. The manifest image for apache/airflow:master-python3.6-ci is named apache/airflow:master-python3.6-ci-manifest.

Sometimes the image needs to be rebuilt from scratch. This is required, for example, when there is a security update of the Python version that all the images are based on and new version of the image is pushed to the repository. In this case it is usually faster to pull the latest images rather than rebuild them from scratch.

You can do it via the --force-pull-images flag to force pulling the latest images from the Docker Hub.

For production image:

./breeze build-image --force-pull-images --production-image

For CI image Breeze automatically uses force pulling in case it determines that your image is very outdated, however uou can also force it with the same flag.

./breeze build-image --force-pull-images
Both images have a set of scripts that can be used in the image. Those are:
  • /entrypoint - entrypoint script used when entering the image
  • /clean-logs - script for periodic log cleaning

The entrypoint in the CI image contains all the initialisation needed for tests to be immediately executed. It is copied from scripts/in_container/entrypoint_ci.sh.

The default behaviour is that you are dropped into bash shell. However if RUN_TESTS variable is set to "true", then tests passed as arguments are executed

The entrypoint performs those operations:

  • checks if the environment is ready to test (including database and all integrations). It waits until all the components are ready to work
  • installs older version of Airflow (if older version of Airflow is requested to be installed via INSTALL_AIRFLOW_VERSION variable.
  • Sets up Kerberos if Kerberos integration is enabled (generates and configures Kerberos token)
  • Sets up ssh keys for ssh tests and restarts teh SSH server
  • Sets all variables and configurations needed for unit tests to run
  • Reads additional variables set in files/airflow-breeze-config/variables.env by sourcing that file
  • In case of CI run sets parallelism to 2 to avoid excessive number of processes to run
  • In case of CI run sets default parameters for pytest
  • In case of running integration/long_running/quarantined tests - it sets the right pytest flags
  • Sets default "tests" target in case the target is not explicitly set as additional argument
  • Runs system tests if RUN_SYSTEM_TESTS flag is specified, otherwise runs regular unit and integration tests

The entrypoint in the PROD image contains all the initialisation needed for tests to be immediately executed. It is copied from scripts/in_container/entrypoint_prod.sh.

The PROD image entrypoint works as follows:

  • In case the user is not "airflow" (with undefined user id) and the group id of the user is set to 0 (root), then the user is dynamically added to /etc/passwd at entry using USER_NAME variable to define the user name. This is in order to accommodate the OpenShift Guidelines
  • If AIRFLOW__CORE__SQL_ALCHEMY_CONN variable is passed to the container and it is either mysql or postgres SQL alchemy connection, then the connection is checked and the script waits until the database is reachable.
  • If no AIRFLOW__CORE__SQL_ALCHEMY_CONN variable is set or if it is set to sqlite SQL alchemy connection then db reset is executed.
  • If AIRFLOW__CELERY__BROKER_URL variable is passed and scheduler, worker of flower command is used then the connection is checked and the script waits until the Celery broker database is reachable.
  • If first argument equals to "bash" - it dropped in bash shell or executes bash command if you specify extra arguments. For example:
docker run -it apache/airflow:master-python3.6 bash -c "ls -la"
total 16
drwxr-xr-x 4 airflow root 4096 Jun  5 18:12 .
drwxr-xr-x 1 root    root 4096 Jun  5 18:12 ..
drwxr-xr-x 2 airflow root 4096 Jun  5 18:12 dags
drwxr-xr-x 2 airflow root 4096 Jun  5 18:12 logs
  • If first argument is equal to "python" - you are dropped in python shell or python commands are executed if you pass extra parameters. For example:
> docker run -it apache/airflow:master-python3.6 python -c "print('test')"
test
  • If there are any other arguments - they are passed to "airflow" command
> docker run -it apache/airflow:master-python3.6 --help

usage: airflow [-h]
               {celery,config,connections,dags,db,info,kerberos,plugins,pools,roles,rotate_fernet_key,scheduler,sync_perm,tasks,users,variables,version,webserver}
               ...

positional arguments:

  Groups:
    celery              Start celery components
    connections         List/Add/Delete connections
    dags                List and manage DAGs
    db                  Database operations
    pools               CRUD operations on pools
    roles               Create/List roles
    tasks               List and manage tasks
    users               CRUD operations on users
    variables           CRUD operations on variables

  Commands:
    config              Show current application configuration
    info                Show information about current Airflow and environment
    kerberos            Start a kerberos ticket renewer
    plugins             Dump information about loaded plugins
    rotate-fernet-key   Rotate encrypted connection credentials and variables
    scheduler           Start a scheduler instance
    sync-perm           Update permissions for existing roles and DAGs
    version             Show the version
    webserver           Start a Airflow webserver instance

optional arguments:
  -h, --help            show this help message and exit

The production images have been released for the first time in 1.10.10 release of Airflow as "Alpha" quality ones. Between 1.10.10 the images are being improved and the 1.10.10 images should be patched and published several times separately in order to test them with the upcoming Helm Chart.

Those images are for development and testing only and should not be used outside of the development community.

The images were pushed with tags following the pattern: apache/airflow:1.10.10.1-alphaN-pythonX.Y. Patch level is an increasing number (starting from 1).

Those are alpha-quality releases however they contain the officially released Airflow 1.10.10 code. The main changes in the images are scripts embedded in the images.

The following versions were pushed:

Patch Tag pattern Description
1 1.10.10.1-alpha1-pythonX.Y Support for parameters added to bash and python commands
2 1.10.10-1-alpha2-pythonX.Y Added "/clean-logs" script

The commits used to generate those images are tagged with prod-image-1.10.10.1-alphaN tags.