Skip to content
This repository has been archived by the owner on Apr 6, 2020. It is now read-only.

Commit

Permalink
small typos, formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
arifwider committed Dec 7, 2017
1 parent 2cd9ff9 commit a00ec6f
Showing 1 changed file with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This is the onboarding document for the TWDE Datalab. If you want to get involve


## Introduction
The purpose of this project is to help build a foundational knowledge pool around the fields of data science, machine learning, and intelligence empowerment for ThoughtWorkers in Germany. To do so, we've selected a competition from kaggle.com that, broadly speaking, compares to a realistic problem we would tackle for our clients. The specific problem is demand forecasting for an Ecuadorian grocery store company. For specifics, see [the Favorita Grocery Sales Forecasting Kaggle competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)
The purpose of this project is to help build a foundational knowledge pool around the fields of data science, machine learning, and intelligent empowerment for ThoughtWorkers in Germany. To do so, we've selected a competition from kaggle.com that, broadly speaking, compares to a realistic problem we would tackle for our clients. The specific problem is demand forecasting for an Ecuadorian grocery store company. For specifics, see [the Favorita Grocery Sales Forecasting Kaggle competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)

## Data

Expand All @@ -26,26 +26,26 @@ We've been provided [4 years of purchasing history](https://www.kaggle.com/c/fav

Our workflow is divided into several jobs, which can be deployed one after another automatically on Amazon Web Services; each job downloads data from the latest output of the step before.

### Step 1: Denormilization (`src/merger.py`)
### Step 1: Denormalization (`src/merger.py`)
We denormalize the data for:
1. Consistent encoding of variables when we convert features from from `{True, False, NaN}` to `{0, 1, -1}` (or else we might end up with True mapping to 1 or 0 inconsistently)
1. Consistent encoding of variables when we convert features from `{True, False, NaN}` to `{0, 1, -1}` (or else we might end up with True mapping to 1 or 0 inconsistently)
2. Machine learning algorithms, which typically prefer one input matrix

`src/merger.py`:
- 1. downloads raw data from `s3://twde-datalab/raw/`
- 2. joins files together based on columns they have in common
1. downloads raw data from `s3://twde-datalab/raw/`
2. joins files together based on columns they have in common
- one dataset maps date + item sales to store numbers, a second dataset maps store numbers to city, and a third dataset maps city to weather information
- joining these data together, we can now associate the weather in the city on the day items were sold
- 3. adds columns to the DataFrame which are extracted out of the other columns
3. adds columns to the DataFrame which are extracted out of the other columns
- for example, extrapolating from dates (`2015-08-10`) to day of the week (Mon, Tues, ...)
- 4. uploads its (two file) output to `s3://twde-datalab/merger/<timestamp>/{bigTable.hdf,bigTestTable.hdf}`
4. uploads its (two file) output to `s3://twde-datalab/merger/<timestamp>/{bigTable.hdf,bigTestTable.hdf}`
- bigTable is the training for our machine learning algorithms
- bigTestTable is the test data we are to predict sales for, now enriched with data like weather and prices

### Step 2: Validation Preparation (`src/splitter.py`)
We split the data into training data and validation data each time we run the pipeline. Training data is used to make our model, and validation data is then compared to the model, as if we've been provided new data points. This prevents us from overfitting our model, and gives us a sanity check for whether we're improving the model or not.

Consider the following the graphs; think of each trend line as a model of the data.
Consider the following graphs; think of each trend line as a model of the data.

![image](https://user-images.githubusercontent.com/8107614/33661598-f91a92c6-da88-11e7-8a69-8c83fdf44ab1.png)

Expand All @@ -57,13 +57,13 @@ Consider the following the graphs; think of each trend line as a model of the da
- **This is called overfitting.**
- It's tempting to overfit a model because of how well it describes the data we already have, but it's much better to have a generally-right-but-never-perfectly-right model than a right-all-the-time-but-only-for-the-data-we-already-have model.

If we don't randomly withhold some of the data from ourselves and then evaluate our model against that withheld data, we will inevitably overfit the model and lose our general preditivity.
If we don't randomly withhold some of the data from ourselves and then evaluate our model against that withheld data, we will inevitably overfit the model and lose our general predictivity.

### Step 3: Machine Learning Models (`src/decision_tree.py`)
Step 3 of the pipeline is to supply data to a machine learning algorithm (or several) and made predictions on the data asked of us from `test.csv`, as provided by the kaggle competiton. See the [algorithms section](https://github.com/emilyagras/kaggle-favorita/blob/master/README.md#algorithms) below for more details on what we've implemented.
Step 3 of the pipeline is to supply data to a machine learning algorithm (or several) and made predictions on the data asked of us from `test.csv`, as provided by the kaggle competition. See the [algorithms section](https://github.com/emilyagras/kaggle-favorita/blob/master/README.md#algorithms) below for more details on what we've implemented.

## Algorithms
We implement one machine learning model for the time being, which creates a model based on the training data, rates its own accuracy using the validation data, and creates predictions, ready to be submit to kaggle.com from the `train.csv` file that was provided through the competiton.
We implement one machine learning model for the time being, which creates a model based on the training data, rates its own accuracy using the validation data, and creates predictions, ready to be submitted to kaggle.com from the `train.csv` file that was provided through the competiton.

Decision trees are one of the simplest algorithms to implement, which is why we've chosen it for our first approach. More complex variations of decision trees can be used to combate the downsides of decision trees, which maybe you, dear reader, would like to implement for us?

Expand Down

0 comments on commit a00ec6f

Please sign in to comment.