Name	Name	Last commit message	Last commit date
Latest commit arifwider referring to current issue with handling large files Dec 23, 2017 c23c878 · Dec 23, 2017 History 113 Commits
.ipynb_checkpoints	.ipynb_checkpoints	exploring validation and empirical baseline	Nov 29, 2017
deployment	deployment	added info about how start with Jupyter on EC2	Dec 22, 2017
notebooks	notebooks	removed IP address of EC2 machine (which does not exist anymore, anyw…	Dec 7, 2017
src	src	removed some features	Dec 22, 2017
test	test	fixed tests	Dec 20, 2017
.gitignore	.gitignore	Move temporary files to build directory	Dec 20, 2017
.travis.yml	.travis.yml	fixed tests	Dec 20, 2017
README.md	README.md	referring to current issue with handling large files	Dec 23, 2017
requirements.txt	requirements.txt	added requirements.txt for dependency management	Dec 20, 2017

Name

Last commit message

Last commit date

arifwider

referring to current issue with handling large files

Dec 23, 2017

c23c878 · Dec 23, 2017

113 Commits

.ipynb_checkpoints

exploring validation and empirical baseline

Nov 29, 2017

deployment

added info about how start with Jupyter on EC2

Dec 22, 2017

notebooks

removed IP address of EC2 machine (which does not exist anymore, anyw…

Dec 7, 2017

src

removed some features

Dec 22, 2017

test

fixed tests

Dec 20, 2017

.gitignore

Move temporary files to build directory

Dec 20, 2017

.travis.yml

fixed tests

Dec 20, 2017

README.md

referring to current issue with handling large files

Dec 23, 2017

requirements.txt

added requirements.txt for dependency management

Dec 20, 2017

TWDE Datalab (on AWS)

Getting started on AWS

We have been exploring different ways to deploy the code on AWS. Our first approach was through creating Elastic Map Reduce clusters, but since we settled on pandas instead of Spark at some point, we haven't been doing distributed computing very much. Therefore, there are two main ways we are using AWS resources: AWS Data Pipeline and Jupyter on EC2. We have been using the former to run our decision tree model on larger data sets and the latter (Jupyter on EC2) to run the Prophet time series model.

IMPORTANT: The software in the Git repository does not contains AWS credentials or any other way to access an AWS account. So, please make sure you have access to an AWS account. If you want to use the AWS account of the TWDE Datalab reach out the maintainers.

Data Pipeline

If you haven't done so, install the AWS command line tools. If you are doing this now, please don't forget to configure your credentials, too.

pip install awscli
aws configure (this will ask you for your credentials and store them in ~/.aws)

Now run a deployment script from the deployment directory

cd deployment
./deploy-pipeline.sh -j all -n {name for the pipeline goes here}

This script will do the following:

create a shell script based on run_pipeline.sh
upload the shell script to S3
create an AWS data pipeline following pipeline-definition.json
start the pipeline

The output (and logs) are available via the AWS console. Unfortunately, we've run into some issues with large file sizes, which are documented here #25.

Getting started using Jupyter on EC2

Another, maybe even simpler way to exploit cloud computing, is by installing Anaconda on AWS EC2 instance and setting up Jupyter Notebooks on AWS.

For running our Prophet time series model, we published a ready to go AMI image tw_datalab_prophet_forecast_favorita that already includes the relevant Jupyter notebooks. Just search for this image in 'Community AMIs' when launching an EC2 machine and make sure you open port 8888. Then ssh into your machine and start the Jupyter server:

jupyter notebook --no-browser --port=8888

Afterwards you should be able to open Jupyter in your browser at https://ec2-{public-ip-of-ec2-machine}.{my-region}.compute.amazonaws.com:8888. When asked for a password, simply type 'datalab'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TWDE Datalab (on AWS)

Getting started on AWS

Data Pipeline

Getting started using Jupyter on EC2

About

Releases

Packages

Contributors 5

Languages

License

thoughtworks/twde-datalab

Folders and files

Latest commit

History

Repository files navigation

TWDE Datalab (on AWS)

Getting started on AWS

Data Pipeline

Getting started using Jupyter on EC2

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages