Decision Tree Workflow

Our workflow is divided into several jobs, which can be run one after another automatically when called by sh run_decisiontree_pipeline.sh; each job uses data from the latest output of the step before. The workflow looks like this:

Step 1: Denormalization (`src/merger.py`)

We denormalize the data because machine learning algorithms typically prefer one input matrix. Denormalization turns n > 1 tables into 1 table. This one table is not how you typically store data in a database -- we're undoing the normalisation that allows relational databases to be efficient.

Suppose there is a table called sales, a table called items, and a table called stores, which we combine into a table that contains the same data but less efficiently.

Sales:

Transaction Id	Item	Store Name
1	Cheese	Zimmerstrasse Store
2	Cabbage	Erich-Fromm Platz Store
3	Carrots	Zimmerstrasse Store

Items:

Item Name	Category
Cheese	Dairy
Cabbage	Produce
Carrots	Produce

Stores

Store Name	City
Erich-Fromm Platz Store	Frankfurt
Zimmerstrasse Store	Berlin

De-normalized

Transaction Id	Item	Store Name	City	Category
1	Cheese	Zimmerstrasse Store	Berlin	Dairy
2	Cabbage	Erich-Fromm Platz Store	Frankfurt	Produce
3	Carrots	Zimmerstrasse Store	Berlin	Produce

Now we have a table that's ready be analyzed. So.

src/merger.py:

downloads raw data from s3://twde-datalab/raw/ or loads it from your local harddrive
joins files together based on columns they have in common
adds columns to the DataFrame which are extracted out of the other columns
- extrapolating from dates (2015-08-10) to day of the week (Mon, Tues, ...)
saves its output to merger/bigTable.csv

Step 2: Validation Preparation (`src/splitter.py`)

We split the data into training data and validation data each time we run the pipeline. Training data is used to make our model, and validation data is then compared to the model, as if we've been provided new data points. This prevents us from overfitting our model, and gives us a sanity check for whether we're improving the model or not.

Consider the following graphs; think of each trend line as a model of the data.

The first model, a linear model on the left, fails to capture the convex shape of the data. It wont be predictive for new data points.
- This is called underfitting.
The second model captures the general trend of the data and is likely to continue generally describing the data even as new data points are provided.
- This model is neither underfit nor overfit, it's just right.
The third model, the polynomial trend line all the way on the right, describes the data we have perfectly, but it's unlikely to be accurate for data points further down the x axis, if it were ever provided new data.
- This is called overfitting.
- It's tempting to overfit a model because of how well it describes the data we already have, but it's much better to have a generally-right-but-never-perfectly-right model than a right-all-the-time-but-only-for-the-data-we-already-have model.

If we don't randomly withhold some of the data from ourselves and then evaluate our model against that withheld data, we will inevitably overfit the model and lose our general predictivity.

Step 3: Machine Learning Models (`src/decision_tree.py`)

Step 3 of the pipeline is to supply data to a machine learning algorithm (or several) and make predictions by generalizing the data with a model. We provide our decision tree with the train data from splitter, and the algorithm learns to map different values of the various column in the data to the unit_sales column. This mapping is the model. Then we use the model to make predictions on the validation data that splitter created for us, and see how well the model performed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

decision_tree_overview.md

decision_tree_overview.md

Decision Tree Workflow

Step 1: Denormalization (`src/merger.py`)

Sales:

Items:

Stores

De-normalized

Step 2: Validation Preparation (`src/splitter.py`)

Step 3: Machine Learning Models (`src/decision_tree.py`)

Files

decision_tree_overview.md

Latest commit

History

decision_tree_overview.md

File metadata and controls

Decision Tree Workflow

Step 1: Denormalization (src/merger.py)

Sales:

Items:

Stores

De-normalized

Step 2: Validation Preparation (src/splitter.py)

Step 3: Machine Learning Models (src/decision_tree.py)

Step 1: Denormalization (`src/merger.py`)

Step 2: Validation Preparation (`src/splitter.py`)

Step 3: Machine Learning Models (`src/decision_tree.py`)