Our workflow is divided into several jobs, which can be run one after another automatically when called by sh run_decisiontree_pipeline.sh
; each job uses data from the latest output of the step before. The workflow looks like this:
We denormalize the data because machine learning algorithms typically prefer one input matrix. Denormalization turns n > 1 tables into 1 table. This one table is not how you typically store data in a database -- we're undoing the normalisation that allows relational databases to be efficient.
Suppose there is a table called sales, a table called items, and a table called stores, which we combine into a table that contains the same data but less efficiently.
Transaction Id | Item | Store Name |
---|---|---|
1 | Cheese | Zimmerstrasse Store |
2 | Cabbage | Erich-Fromm Platz Store |
3 | Carrots | Zimmerstrasse Store |
Item Name | Category |
---|---|
Cheese | Dairy |
Cabbage | Produce |
Carrots | Produce |
Store Name | City |
---|---|
Erich-Fromm Platz Store | Frankfurt |
Zimmerstrasse Store | Berlin |
Transaction Id | Item | Store Name | City | Category |
---|---|---|---|---|
1 | Cheese | Zimmerstrasse Store | Berlin | Dairy |
2 | Cabbage | Erich-Fromm Platz Store | Frankfurt | Produce |
3 | Carrots | Zimmerstrasse Store | Berlin | Produce |
Now we have a table that's ready be analyzed. So.
src/merger.py
:
- downloads raw data from
s3://twde-datalab/raw/
or loads it from your local harddrive - joins files together based on columns they have in common
- adds columns to the DataFrame which are extracted out of the other columns
- extrapolating from dates (
2015-08-10
) to day of the week (Mon, Tues, ...)
- extrapolating from dates (
- saves its output to
merger/bigTable.csv
We split the data into training data and validation data each time we run the pipeline. Training data is used to make our model, and validation data is then compared to the model, as if we've been provided new data points. This prevents us from overfitting our model, and gives us a sanity check for whether we're improving the model or not.
Consider the following graphs; think of each trend line as a model of the data.
- The first model, a linear model on the left, fails to capture the convex shape of the data. It wont be predictive for new data points.
- This is called underfitting.
- The second model captures the general trend of the data and is likely to continue generally describing the data even as new data points are provided.
- This model is neither underfit nor overfit, it's just right.
- The third model, the polynomial trend line all the way on the right, describes the data we have perfectly, but it's unlikely to be accurate for data points further down the x axis, if it were ever provided new data.
- This is called overfitting.
- It's tempting to overfit a model because of how well it describes the data we already have, but it's much better to have a generally-right-but-never-perfectly-right model than a right-all-the-time-but-only-for-the-data-we-already-have model.
If we don't randomly withhold some of the data from ourselves and then evaluate our model against that withheld data, we will inevitably overfit the model and lose our general predictivity.
Step 3 of the pipeline is to supply data to a machine learning algorithm (or several) and make predictions by generalizing the data with a model. We provide our decision tree with the train
data from splitter
, and the algorithm learns to map different values of the various column in the data to the unit_sales
column. This mapping is the model. Then we use the model to make predictions on the validation
data that splitter
created for us, and see how well the model performed.