Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive docker image #709

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions docker/images/interactive/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## Description

This directory contains the source code to build a docker interactive image by using the [jupyter/base-notebook](https://hub.docker.com/r/jupyter/base-notebook) as foundation.
This directory contains the source code to build a docker interactive image by using [jupyter/base-notebook](https://hub.docker.com/r/jupyter/base-notebook) as foundation.

## Building

To build the image run the [build.sh](build.sh) bash script. Per default it should build an image using the latest supported versions of .NET Core, Apache Spark and .NET for Apache Spark.
To build the image, just execute the [build.sh](build.sh) bash script. Per default it should build an image using the latest supported versions of .NET Core, Apache Spark and .NET for Apache Spark.

You can also build for different versions, by specifying one of the following options:

Expand Down Expand Up @@ -34,20 +34,20 @@ The three stages used in the build process are:

Builds on the jupyter/base-notebook image and installs the .NET Core SDK, along with Microsoft.DotNet.Interactive.

- ### **dotnet-spark-interactive-base**
- ### **dotnet-spark-base (interactive)**

Adds the specified .NET for Apache Spark version to the dotnet-interactive image and also copies/builds the HelloSpark example into the image. HelloSpark is also use to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode.

- ### **dotnet-spark (interactive)**

Gets/installs the specified Apache Spark version and add the example notebooks.
Gets/installs the specified Apache Spark version and adds the example notebooks.

## Docker Run Example

To start a new container based on the dotnet-spark interactive image, just run the following command.

```bash
docker run --name dotnet-spark-interactive -d -p 8888:8888 mcr.microsoft.com/dotnet-spark:interactive-latest
docker run --name dotnet-spark-interactive -d -p 8888:8888 3rdman/dotnet-spark:interactive-latest
```

After that, examine the logs of the container to get the correct URL that is required to connect to Juypter using the authentication token.
Expand All @@ -62,4 +62,8 @@ It is important to start the .NET for Apache Spark backend in debug mode first,

The helper script start-spark-debug.sh can do this for you, as demonstrated below.

![example](img/dotnet-interactive-docker-example.gif)
![debug](img/dotnet-interactive-start-debug.gif)

Once the backend is running, please open 02-basic-example.ipynb to learn how you can use .NET for Apache Spark in your own notebooks.

![example](img/dotnet-interactive-basic-example.gif)
97 changes: 0 additions & 97 deletions docker/images/interactive/apache-spark/02-run-dotnet-spark.ipynb

This file was deleted.

30 changes: 0 additions & 30 deletions docker/images/interactive/apache-spark/Dockerfile

This file was deleted.

This file was deleted.

65 changes: 45 additions & 20 deletions docker/images/interactive/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,17 @@ set -o nounset # abort on unbound variable
set -o pipefail # don't hide errors within pipes

readonly image_repository='3rdman'
readonly supported_apache_spark_versions=("2.3.3" "2.3.4" "2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6")
readonly supported_dotnet_spark_versions=("0.9.0" "0.10.0" "0.11.0" "0.12.1")
readonly supported_apache_spark_versions=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

"2.3.0" "2.3.1" "2.3.2" "2.3.3" "2.3.4"
"2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6" "2.4.7"
"3.0.0" "3.0.1"
)
readonly supported_dotnet_spark_versions=("1.0.0")
readonly dotnet_core_version=3.1

dotnet_spark_version=0.12.1
apache_spark_version=2.4.6
dotnet_spark_version=1.0.0
dotnet_spark_jar=""
apache_spark_version=3.0.1
apache_spark_short_version="${apache_spark_version:0:3}"

main() {
Expand All @@ -34,8 +39,9 @@ main() {
# execute the different build stages
cleanup

set_dotnet_spark_jar
build_dotnet_interactive
build_dotnet_spark_interactive_base
build_dotnet_spark_base_interactive
build_dotnet_spark_interactive

trap finish EXIT ERR
Expand Down Expand Up @@ -113,6 +119,21 @@ replace_text_in_file() {
sh -c 'sed -i.bak "s/$1/$2/g" "$3" && rm "$3.bak"' _ "${search_string}" "${replacement_string}" "${filename}"
}

#######################################
# Sets the microsoft-spark JAR name based on the Apache Spark version
#######################################
set_dotnet_spark_jar() {
local scala_version="2.11"
local short_spark_version="${apache_spark_short_version//./-}"

case "${apache_spark_version:0:1}" in
2) scala_version=2.11 ;;
3) scala_version=2.12 ;;
esac

dotnet_spark_jar="microsoft-spark-${short_spark_version}_${scala_version}-${dotnet_spark_version}.jar"
}

#######################################
# Runs the docker build command with the related build arguments
# Arguments:
Expand All @@ -122,7 +143,10 @@ replace_text_in_file() {
#######################################
build_image() {
local image_name="${1}"
local build_args="--build-arg dotnet_core_version=${dotnet_core_version} --build-arg dotnet_spark_version=${dotnet_spark_version} --build-arg SPARK_VERSION=${apache_spark_version}"
local build_args="--build-arg dotnet_core_version=${dotnet_core_version}
--build-arg dotnet_spark_version=${dotnet_spark_version}
--build-arg SPARK_VERSION=${apache_spark_version}
--build-arg DOTNET_SPARK_JAR=${dotnet_spark_jar}"
local cmd="docker build ${build_args} -t ${image_name} ."

echo "Building ${image_name}"
Expand All @@ -144,43 +168,44 @@ build_dotnet_interactive() {
}

#######################################
# Use the Dockerfile in the sub-folder dotnet-spark to build the image of the second stage
# Use the Dockerfile in the sub-folder dotnet-spark-base to build the image of the second stage
# The image contains the specified .NET for Apache Spark version plus the HelloSpark example
# for the correct TargetFramework and Microsoft.Spark package version
# Result:
# A dotnet-interactive-spark-base docker image tagged with the .NET for Apache Spark version
# A dotnet-spark-base-interactive docker image tagged with the .NET for Apache Spark version
#######################################
build_dotnet_spark_interactive_base() {
local image_name="dotnet-spark-interactive-base:${dotnet_spark_version}"
build_dotnet_spark_base_interactive() {
local image_name="dotnet-spark-base-interactive:${dotnet_spark_version}"

cd dotnet-spark
cd dotnet-spark-base
cp --recursive templates/HelloSpark ./HelloSpark

replace_text_in_file HelloSpark/HelloSpark.csproj "<TargetFramework><\/TargetFramework>" "<TargetFramework>netcoreapp${dotnet_core_version}<\/TargetFramework>"
replace_text_in_file HelloSpark/HelloSpark.csproj "PackageReference Include=\"Microsoft.Spark\" Version=\"\"" "PackageReference Include=\"Microsoft.Spark\" Version=\"${dotnet_spark_version}\""

replace_text_in_file HelloSpark/README.txt "netcoreappX.X" "netcoreapp${dotnet_core_version}"
replace_text_in_file HelloSpark/README.txt "spark-X.X.X" "spark-${apache_spark_short_version}.x"
replace_text_in_file HelloSpark/README.txt "spark-${apache_spark_short_version}.x-X.X.X.jar" "spark-${apache_spark_short_version}.x-${dotnet_spark_version}.jar"
replace_text_in_file HelloSpark/README.txt "microsoft-spark-${apache_spark_short_version}.x-X.X.X.jar" "${dotnet_spark_jar}"

build_image "${image_name}"
cd ~-

}

#######################################
# Use the Dockerfile in the sub-folder apache-spark to build the image of the last stage
# Use the Dockerfile in the sub-folder dotnet-spark to build the image of the last stage
# The image contains the specified Apache Spark version
# Result:
# A dotnet-spark docker image tagged with the Apache Spark version, .NET for Apache Spark version and the suffix -interactive
# A dotnet-spark docker image tagged with the .NET for Apache Spark version, Apache Spark version and the suffix -interactive
#######################################
build_dotnet_spark_interactive() {
local image_name="${image_repository}/dotnet-spark:${apache_spark_version}-${dotnet_spark_version}-interactive"
local image_name="${image_repository}/dotnet-spark:${dotnet_spark_version}-${apache_spark_version}-interactive"

cd apache-spark
cd dotnet-spark
cp --recursive templates/scripts ./bin

replace_text_in_file bin/start-spark-debug.sh "microsoft-spark-X.X.X" "microsoft-spark-${apache_spark_short_version}.x"
replace_text_in_file bin/start-spark-debug.sh "microsoft-spark-X.X.X.jar" "${dotnet_spark_jar}"

replace_text_in_file 02-basic-example.ipynb "nuget: Microsoft.Spark,X.X.X" "${dotnet_spark_version}"

build_image "${image_name}"
cd ~-
Expand All @@ -191,10 +216,10 @@ build_dotnet_spark_interactive() {
#######################################
cleanup()
{
cd apache-spark
cd dotnet-spark
rm --recursive --force bin
cd ~-
cd dotnet-spark
cd dotnet-spark-base
rm --recursive --force HelloSpark
cd ~-
}
Expand Down
31 changes: 15 additions & 16 deletions docker/images/interactive/dotnet-interactive/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,18 +1,11 @@
FROM jupyter/base-notebook:ubuntu-18.04

ARG NB_USER=jovyan
ARG NB_UID=1000
ARG DOTNET_CORE_VERSION=3.1
ARG DEBIAN_FRONTEND=noninteractive

ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION
ENV USER ${NB_USER}
ENV NB_UID ${NB_UID}
ENV HOME /home/${NB_USER}
ENV PATH="${PATH}:${HOME}/.dotnet/tools"

ENV \
DOTNET_RUNNING_IN_CONTAINER=true \
ENV DOTNET_RUNNING_IN_CONTAINER=true \
DOTNET_USE_POLLING_FILE_WATCHER=true \
NUGET_XMLDOC_MODE=skip \
DOTNET_TRY_CLI_TELEMETRY_OPTOUT=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being proposed as part of the .NET project, telemetry should remain enabled.

Expand All @@ -21,24 +14,30 @@ USER root

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \
libc6 libgcc1 libgssapi-krb5-2 libicu60 libssl1.1 libstdc++6 zlib1g \
apt-utils \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up now. Java obviously is required by spark.

bash \
dialog \
libc6 \
libgcc1 \
libgssapi-krb5-2 \
libicu60 \
libssl1.1 \
libstdc++6 zlib1g \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple packages listed together, should get split apart so that zlib1g is not overlooked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in a separate line as well now.

openjdk-8-jdk \
software-properties-common \
unzip \
&& wget -q --show-progress --progress=bar:force:noscroll https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: I've typically seen Dockerfile avoid using --show-progress as it does have a perf impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That raises an interesting point about the purpose of the Dockerfile(s). As far as I am aware, the focus at the moment is to enable an user to build the image(s) her/himself, instead of automating the image build process. That's why I thought it would be useful to show the download progress. Now, for small downloads that doesn't really matter and I therefore have removed it. However, I have added the following line to the dotnet-spark/Dockerfile

&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \

as the spark download can take a while. Does that make sense?

&& dpkg -i packages-microsoft-prod.deb \
&& add-apt-repository universe \
&& apt-get install -y apt-transport-https \
&& apt-get update \
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
&& apt-get autoremove -y --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean && rm -rf /var/lib/apt/lists/* \
&& rm -rf packages-microsoft-prod.deb

COPY ./nuget.config ${HOME}/nuget.config

RUN chown -R ${NB_UID} ${HOME}

USER ${USER}
USER ${NB_USER}

RUN pip install nteract_on_jupyter \
&& dotnet tool install -g Microsoft.dotnet-interactive \
Expand Down
Loading