Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Getting Started

Graham Wheeler edited this page Apr 28, 2016 · 21 revisions

Getting Started

Using DataLab in Google Cloud

The easiest way to use Google Cloud DataLab is on Google Cloud Platform. Head over the DataLab site, and deploy an instance into a Cloud Project, so you can easily work with data in other cloud services such as BigQuery, and deploy your data pipelines for execution on the cloud.

Using DataLab locally

DataLab is built and packaged as a docker container. You will need docker configured and running locally. If you're on Mac or Windows, the easiest way to get docker is via the Docker Toolbox. Download and install that, then open the Kitematic app that it installed which will create and start a 'default' VM and start the Docker server.

Add a port mapping so you can use localhost for Datalab:

VBoxManage modifyvm default --natpf1 "datalab,tcp,,8081,,8081"

Clone the Datalab repo, build it and run it:

git clone https://github.com/GoogleCloudPlatform/datalab.git
cd datalab
source ./tools/initenv.sh
rm -rf build/
cd sources/
./build.sh
cd ../containers/datalab
./build.sh
./run.sh

Then open your browser to http://localhost:8081.

Note that to use any Google Cloud functionality you will need to set the project ID. You can do this by calling:

set_project_id('myproject')

within a cell in your notebook, or by setting an environment before running Datalab (in which case the project will be used as the default for all notebooks):

PROJECT_ID='myproject' ./run.sh

Replace 'myproject' with an appropriate ID for your use.

Using Datalab on GCE

First, build the container and push it to a container registry (see the previous section for building, and look at the stage.sh script instead of the run.sh script for an example of how to push the container). Then create a containers.yaml file. In the example below, replace IMAGE with the path to the container on GCR, Docker Hub or wherever you publish your container.

apiVersion: v1
kind: Pod
metadata:
  name: datalab
spec:
  containers:
    - name: datalab
      image: IMAGE
      command: ['/datalab/run.sh']
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 8080
          hostPort: 8080

Make sure to set the project to the one you want to use for Datalab:

PROJECT_ID=my-datalab-project  # Set an appropriate ID here
gcloud config set project $PROJECT_ID

You can then use the YAML file to create a running compute instance on GCE:

LABEL=my-datalab  # pick a name here
ZONE=us-central1-a  # pick a zone here
MACHINE_TYPE=f1-micro  # pick a machine type here

gcloud compute instances create $LABEL --image container-vm --metadata-from-file google-container-manifest=containers.yaml --zone $ZONE --machine-type $MACHINE_TPE --scopes cloud-platform

Expose the port through ssh tunneling:

gcloud compute ssh --zone $ZONE --ssh-flag="-L" --ssh-flag="8082:localhost:8080" $LABEL

This will open a tunnel from localhost:8082 to the GCE instance. The tunnel will stay open until you close the SSH shell. If you have just created the instance then it has to do a docker pull to get the container which can take about four minutes on first run. You can check when Datalab is ready by running:

sudo docker ps

in the SSH shell. You should see one or two containers running; once there are two Datalab is ready. You can now connect to http://localhost:8082 on your local machine to start using Datalab.

Note on first execution you will (i) have to click through a EULA page and (ii) may have to enable the resource manager API for your project if it is not yet enabled; this is so we can get the project ID for the service account (and enable the new project listing APIs that are used for that).

IMPORTANT: Currently notebooks are stored in the container and if you restart the GCE instance they will be lost! These instructions will be updated soon with instructions for using persistent disk images. For now you would have to manually back up your notebooks, by downloading them from the web interface or creating a git repo in the GCE image and pushing them to some remote.

Clone this wiki locally