Docker-Data-Science-Workflow

This is a workflow I developed that allow me to spin up a more powerful machine when working with data. By using Docker and GIT/Github, I can easily transfer between my development computer and an instance in the cloud.

Please note that there are two folders (NLP and General) that contains a Dockerfile and docker-compose file in each one. I included two different sets of files because there are some extra installation steps when using NLP that I found useful and did not want to include in every docker package. If you are not using NLP then just use the files from the General folder.

Steps to Setup

Before you can use this, there is some prep work you will have to do:

Create AWS Account
Create IAM user/role that can access S3
Create AMI Image.
Store data in S3

I will not go into detail on how to create an AWS Account, IAM role or launching an EC2 instance, but I will outline creating the AMI Image.

Create AMI

Launch a micro(t2) Instance in EC2 and use the Amazon Linux AMI. Select a security group that will allow you to access the Jupyter notebook.
SSH into the Instance.
Perform updates (sudo yum update)
Install git (sudo yum -y git)
Install Docker (sudo yum -u docker)
Change the user settings (sudo usermod -a -G docker ec2-user)
Install docker compose. Update with the latest version (sudo curl -L https://github.com/docker/compose/releases/download/1.9.0/docker-compose-`uname -s-uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null )
Update the folder settings (sudo chmodchmod +x /usr/local/bin/docker-compose)
Start the service (sudo service docker start)
Turn the config on (sudo chkconfig docker on)
In the EC2 console, create an Image (Actions>>Image>>Create Image)

Now when you need to launch a new instance, you can build off of this image which will already have git, docker and docker-compose installed.

Steps to Launch

Create a git repo in github
Clone the repo on your development computer
Add the Dockerfile and docker-compose Files
Add any additional libraries to be installed in the Dockerfile
Add the .env file and add in the project name, and the AWS credentials
Add the .pem file to the folder for easy access
Commit the dockerfile and docker-compose files to git. DO NOT commit the .env file or the .pem file.
Launch an EC2 instance using the AMI built previously
Execute command: "git clone " to clone the repo
Create a .env file on the instance ("touch .env", then "nano .env")
Execute command "docker-compose up". Then docker will download the neccessary files and link to jupyter notebook will appear.
Copy and paste the link into your browser and change "localhost" with the public IP address of the EC2 instance.

Note: Any notebook created will be in a src folder.

Code to Access S3

JSON

client = boto3.client('s3')

obj = client.get_object(Bucket='bucket_name', Key='file_name')

json_data = obj['Body'].read().decode('utf-8')

data = pd.read_json(json_data, lines=True)

CSV

s3 = boto3.client('s3')

obj = s3.get_object(Bucket='bucket_name', Key='file_name')

data = pd.read_csv(io.BytesIO(obj['Body'].read()))

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
General		General
NLP		NLP
Diagram.png		Diagram.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docker-Data-Science-Workflow

Steps to Setup

Create AMI

Steps to Launch

Code to Access S3

JSON

CSV

About

Releases

Packages

Brandyn-Adderley-Blog/Docker-Data-Science-Workflow

Folders and files

Latest commit

History

Repository files navigation

Docker-Data-Science-Workflow

Steps to Setup

Create AMI

Steps to Launch

Code to Access S3

JSON

CSV

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages