This is a workflow I developed that allow me to spin up a more powerful machine when working with data. By using Docker and GIT/Github, I can easily transfer between my development computer and an instance in the cloud.
Please note that there are two folders (NLP and General) that contains a Dockerfile and docker-compose file in each one. I included two different sets of files because there are some extra installation steps when using NLP that I found useful and did not want to include in every docker package. If you are not using NLP then just use the files from the General folder.
Before you can use this, there is some prep work you will have to do:
- Create AWS Account
- Create IAM user/role that can access S3
- Create AMI Image.
- Store data in S3
I will not go into detail on how to create an AWS Account, IAM role or launching an EC2 instance, but I will outline creating the AMI Image.
- Launch a micro(t2) Instance in EC2 and use the Amazon Linux AMI. Select a security group that will allow you to access the Jupyter notebook.
- SSH into the Instance.
- Perform updates (sudo yum update)
- Install git (sudo yum -y git)
- Install Docker (sudo yum -u docker)
- Change the user settings (sudo usermod -a -G docker ec2-user)
- Install docker compose. Update with the latest version (sudo curl -L https://github.com/docker/compose/releases/download/1.9.0/docker-compose-`uname -s
-
uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null ) - Update the folder settings (sudo chmodchmod +x /usr/local/bin/docker-compose)
- Start the service (sudo service docker start)
- Turn the config on (sudo chkconfig docker on)
- In the EC2 console, create an Image (Actions>>Image>>Create Image)
Now when you need to launch a new instance, you can build off of this image which will already have git, docker and docker-compose installed.
- Create a git repo in github
- Clone the repo on your development computer
- Add the Dockerfile and docker-compose Files
- Add any additional libraries to be installed in the Dockerfile
- Add the .env file and add in the project name, and the AWS credentials
- Add the .pem file to the folder for easy access
- Commit the dockerfile and docker-compose files to git. DO NOT commit the .env file or the .pem file.
- Launch an EC2 instance using the AMI built previously
- Execute command: "git clone " to clone the repo
- Create a .env file on the instance ("touch .env", then "nano .env")
- Execute command "docker-compose up". Then docker will download the neccessary files and link to jupyter notebook will appear.
- Copy and paste the link into your browser and change "localhost" with the public IP address of the EC2 instance.
Note: Any notebook created will be in a src folder.
client = boto3.client('s3')
obj = client.get_object(Bucket='bucket_name', Key='file_name')
json_data = obj['Body'].read().decode('utf-8')
data = pd.read_json(json_data, lines=True)
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket_name', Key='file_name')
data = pd.read_csv(io.BytesIO(obj['Body'].read()))