Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive docker image #709

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ bld/

# Visual Studio 2015/2017 cache/options directory
.vs/
# Visual Studio Code cache/options directory
.vscode/
# Uncomment if you have tasks that create the project's static files in wwwroot
#wwwroot/

Expand Down
1 change: 1 addition & 0 deletions binder/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
FROM mcr.microsoft.com/dotnet-spark:2.4.6-0.12.1-interactive
7 changes: 7 additions & 0 deletions binder/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# .NET for Apache Spark Interactive

This interactive notebook contains allows you to explore .NET for Apache Spark in your web-browser.

To launch it, just click the button below:

[![Binder](./dotnet-spark-binder.svg)](https://mybinder.org/v2/gh/indy-3rdman/spark/docker_images_init?urlpath=lab/tree/nb/)
1 change: 1 addition & 0 deletions binder/dotnet-spark-binder.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
69 changes: 69 additions & 0 deletions docker/images/interactive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# .NET for Apache Spark interactive Docker image

## Description

This directory contains the source code to build a docker interactive image by using [jupyter/base-notebook](https://hub.docker.com/r/jupyter/base-notebook) as foundation.

## Building

To build the image, just execute the [build.sh](build.sh) bash script. Per default it should build an image using the latest supported versions of .NET Core, Apache Spark and .NET for Apache Spark.

You can also build for different versions, by specifying one of the following options:

```bash
-a, --apache-spark
-d, --dotnet-spark
```

For more details please run

```bash
build.sh -h
```

Please note, that not all version combinations are supported, however.

## The image build stages

Using different stages makes sense to efficiently build multiple images that are based on the same .NET core SDK etc, but are using different .NET for Apache Spark or Apache Spark versions.
In that way, dependencies (e.g. .NET Core SDK) do not have to be downloaded again and again, while building an image for a different version. This saves time and bandwidth.

The three stages used in the build process are:

- ### **dotnet-interactive**

Builds on the jupyter/base-notebook image and installs the .NET Core SDK, along with Microsoft.DotNet.Interactive.

- ### **dotnet-spark-base (interactive)**

Adds the specified .NET for Apache Spark version to the dotnet-interactive image and also copies/builds the HelloSpark example into the image. HelloSpark is also use to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode.

- ### **dotnet-spark (interactive)**

Gets/installs the specified Apache Spark version and adds the example notebooks.

## Docker Run Example

To start a new container based on the dotnet-spark interactive image, just run the following command.

```bash
docker run --name dotnet-spark-interactive -d -p 8888:8888 3rdman/dotnet-spark:interactive-latest
```

After that, examine the logs of the container to get the correct URL that is required to connect to Juypter using the authentication token.

```bash
docker logs -f dotnet-spark-interactive
```

![launch](img/dotnet-interactive-docker-launch.gif)

It is important to start the .NET for Apache Spark backend in debug mode first, before using it in any of the notebooks.

The helper script start-spark-debug.sh can do this for you, as demonstrated below.

![debug](img/dotnet-interactive-start-debug.gif)

Once the backend is running, please open 02-basic-example.ipynb to learn how you can use .NET for Apache Spark in your own notebooks.

![example](img/dotnet-interactive-basic-example.gif)
251 changes: 251 additions & 0 deletions docker/images/interactive/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
#!/usr/bin/env bash

# Create different versions of the .NET for Apache Spark interactive docker image
# based on the Apach Spark and .NET for Apache Spark version.

set -o errexit # abort on nonzero exitstatus
set -o nounset # abort on unbound variable
set -o pipefail # don't hide errors within pipes

readonly image_repository='3rdman'
readonly supported_apache_spark_versions=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

"2.3.0" "2.3.1" "2.3.2" "2.3.3" "2.3.4"
"2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6" "2.4.7"
"3.0.0" "3.0.1"
)
readonly supported_dotnet_spark_versions=("1.0.0")
readonly dotnet_core_version=3.1

dotnet_spark_version=1.0.0
dotnet_spark_jar=""
apache_spark_version=3.0.1
apache_spark_short_version="${apache_spark_version:0:3}"

main() {
# Parse the options an set the related variables
while [[ "$#" -gt 0 ]]; do
case $1 in
-a|--apache-spark) opt_check_apache_spark_version "$2"; shift ;;
-d|--dotnet-spark) opt_check_dotnet_spark_version "$2"; shift ;;
-h|--help) print_help
exit 1 ;;
*) echo "Unknown parameter passed: $1"; exit 1 ;;
esac
shift
done

echo "Building .NET for Apache Spark ${dotnet_spark_version} runtime image with Apache Spark ${apache_spark_version}"

# execute the different build stages
cleanup

set_dotnet_spark_jar
build_dotnet_interactive
build_dotnet_spark_base_interactive
build_dotnet_spark_interactive

trap finish EXIT ERR

exit 0
}

#######################################
# Checks if the provided Apache Spark version number is supported
# Arguments:
# The version number string
# Result:
# Sets the global variable apache_spark_version if supported,
# otherwise exits with a related message
#######################################
opt_check_apache_spark_version() {
local provided_version="${1}"
local valid_version=""

for value in "${supported_apache_spark_versions[@]}"
do
[[ "${provided_version}" = "$value" ]] && valid_version="${provided_version}"
done

if [ -z "${valid_version}" ]
then
echo "${provided_version} is an unsupported Apache Spark version."
exit 1 ;
else
apache_spark_version="${valid_version}"
apache_spark_short_version="${apache_spark_version:0:3}"
fi
}

#######################################
# Checks if the provided .NET for Apache Spark version number is supported
# Arguments:
# The version number string
# Result:
# Sets the global variable dotnet_spark_version if supported,
# otherwise exits with a related message
#######################################
opt_check_dotnet_spark_version() {
local provided_version="${1}"
local valid_version=""

for value in "${supported_dotnet_spark_versions[@]}"
do
[[ "${provided_version}" = "$value" ]] && valid_version="${provided_version}"
done

if [ -z "${valid_version}" ]
then
echo "${provided_version} is an unsupported .NET for Apache Spark version."
exit 1 ;
else
dotnet_spark_version="${valid_version}"
fi
}

#######################################
# Replaces every occurence of search_string by replacement_string in a file
# Arguments:
# The file name
# The string to search for
# The string to replace the search string with
# Result:
# An updated file with the replaced string
#######################################
replace_text_in_file() {
local filename=${1}
local search_string=${2}
local replacement_string=${3}

sh -c 'sed -i.bak "s/$1/$2/g" "$3" && rm "$3.bak"' _ "${search_string}" "${replacement_string}" "${filename}"
}

#######################################
# Sets the microsoft-spark JAR name based on the Apache Spark version
#######################################
set_dotnet_spark_jar() {
local scala_version="2.11"
local short_spark_version="${apache_spark_short_version//./-}"

case "${apache_spark_version:0:1}" in
2) scala_version=2.11 ;;
3) scala_version=2.12 ;;
esac

dotnet_spark_jar="microsoft-spark-${short_spark_version}_${scala_version}-${dotnet_spark_version}.jar"
}

#######################################
# Runs the docker build command with the related build arguments
# Arguments:
# The image name (incl. tag)
# Result:
# A local docker image with the specified name
#######################################
build_image() {
local image_name="${1}"
local build_args="--build-arg dotnet_core_version=${dotnet_core_version}
--build-arg dotnet_spark_version=${dotnet_spark_version}
--build-arg SPARK_VERSION=${apache_spark_version}
--build-arg DOTNET_SPARK_JAR=${dotnet_spark_jar}"
local cmd="docker build ${build_args} -t ${image_name} ."

echo "Building ${image_name}"

${cmd}
}

#######################################
# Use the Dockerfile in the sub-folder dotnet-interactive to build the image of the first stage
# Result:
# A dotnet-interactive docker image tagged with the .NET core version
#######################################
build_dotnet_interactive() {
local image_name="dotnet-interactive:${dotnet_core_version}"

cd dotnet-interactive
build_image "${image_name}"
cd ~-
}

#######################################
# Use the Dockerfile in the sub-folder dotnet-spark-base to build the image of the second stage
# The image contains the specified .NET for Apache Spark version
# Result:
# A dotnet-spark-base-interactive docker image tagged with the .NET for Apache Spark version
#######################################
build_dotnet_spark_base_interactive() {
local image_name="dotnet-spark-base-interactive:${dotnet_spark_version}"

cd dotnet-spark-base
build_image "${image_name}"
cd ~-
}

#######################################
# Use the Dockerfile in the sub-folder dotnet-spark to build the image of the last stage
# The image contains the specified Apache Spark version
# Result:
# A dotnet-spark docker image tagged with the .NET for Apache Spark version, Apache Spark version and the suffix -interactive
#######################################
build_dotnet_spark_interactive() {
local image_name="${image_repository}/dotnet-spark:${dotnet_spark_version}-${apache_spark_version}-interactive"

cd dotnet-spark
cp --recursive templates/scripts ./bin
cp --recursive templates/HelloSpark ./HelloSpark

replace_text_in_file HelloSpark/HelloSpark.csproj "<TargetFramework><\/TargetFramework>" "<TargetFramework>netcoreapp${dotnet_core_version}<\/TargetFramework>"
replace_text_in_file HelloSpark/HelloSpark.csproj "PackageReference Include=\"Microsoft.Spark\" Version=\"\"" "PackageReference Include=\"Microsoft.Spark\" Version=\"${dotnet_spark_version}\""

replace_text_in_file HelloSpark/README.txt "netcoreappX.X" "netcoreapp${dotnet_core_version}"
replace_text_in_file HelloSpark/README.txt "spark-X.X.X" "spark-${apache_spark_short_version}.x"
replace_text_in_file HelloSpark/README.txt "microsoft-spark-${apache_spark_short_version}.x-X.X.X.jar" "${dotnet_spark_jar}"

replace_text_in_file bin/start-spark-debug.sh "microsoft-spark-X.X.X.jar" "${dotnet_spark_jar}"

replace_text_in_file 02-basic-example.ipynb "nuget: Microsoft.Spark,X.X.X" "${dotnet_spark_version}"

build_image "${image_name}"
cd ~-
}

#######################################
# Remove the temporary folders created during the different build stages
#######################################
cleanup()
{
cd dotnet-spark
rm --recursive --force bin
rm --recursive --force HelloSpark
cd ~-
}

finish()
{
result=$?
cleanup
exit ${result}
}

#######################################
# Display the help text
#######################################
print_help() {
cat <<HELPMSG
Usage: build.sh [OPTIONS]"

Builds a .NET for Apache Spark interactive docker image

Options:
-a, --apache-spark A supported Apache Spark version to be used within the image
-d, --dotnet-spark The .NET for Apache Spark version to be used within the image
-h, --help Show this usage help

If -a or -d is not defined, default values are used

Apache Spark: $apache_spark_version
.NET for Apache Spark: $dotnet_spark_version
HELPMSG
}

main "${@}"
37 changes: 37 additions & 0 deletions docker/images/interactive/dotnet-interactive/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
FROM jupyter/base-notebook:ubuntu-18.04
LABEL maintainer="Martin Kandlbinder <[email protected]>"

ARG DOTNET_CORE_VERSION=3.1
ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)

DOTNET_RUNNING_IN_CONTAINER=true \
DOTNET_USE_POLLING_FILE_WATCHER=true \
NUGET_XMLDOC_MODE=skip \
PATH="${PATH}:${HOME}/.dotnet/tools"

USER root

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
apt-utils \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up now. Java obviously is required by spark.

dialog \
libgssapi-krb5-2 \
libicu60 \
openjdk-8-jdk \
software-properties-common \
unzip \
&& wget -q https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \
&& dpkg -i packages-microsoft-prod.deb \
&& add-apt-repository universe \
&& apt-get install -y apt-transport-https \
&& apt-get update \
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
&& apt-get clean && rm -rf /var/lib/apt/lists/* \
&& rm -rf packages-microsoft-prod.deb

COPY ./nuget.config ${HOME}/nuget.config

USER ${NB_USER}

RUN pip install nteract_on_jupyter \
&& dotnet tool install -g Microsoft.dotnet-interactive \
&& dotnet interactive jupyter install
Loading