- Authors: Derek Kruszewski, Yi Liu, Rob Blumberg, Carlina Kim
Data analysis project for Group 302 for DSCI (Data Science Workflows): a Master of Data Science Course at the University of British Columbia.
This project attempts to build a regression model to answer the research question:
Given a set of features related to racing horses, can we predict the outcome of a race?
The model produced is able to predict finish times with an R^2 correaltion of 0.909.
The dataset used to answer this question is the Hong Kong Horse Racing Dataset for Experts, publicly available through Kaggle (HorseBaby 2018). This data has been rehosted on github for use with this project's scripts:
https://raw.githubusercontent.com/v5y8/horse_race_data/master
Please ensure the above github repository is used for downloading with Makefile.
The final report can be found here.
There are two ways to replicate the analysis on your local machine.
Note - the instructions below depends on running this in a unix shell (e.g., terminal or Git Bash), if you are using Windows Command Prompt, replace /$(pwd) with PATH_ON_YOUR_COMPUTER.
-
Install and run Docker
-
Clone this Github repository and run the following command at the command line/terminal from the root directory of this project:
docker run --rm -v /$(pwd):/home/DSCI_522_Group_302 v5y8/group_302_environment make -C /home/DSCI_522_Group_302 all
- Toreset the repo to a clean slate, , run the following command at the command line/terminal from the root directory of this project:
docker run --rm -v /$(pwd):/home/DSCI_522_Group_302 v5y8/group_302_environment make -C /home/DSCI_522_Group_302 clean
This method require all dependencies below to be installed before running the analysis. Run the following command in the terminal at the root directory of this project (script takes 15-20 minutes to fully execute):
make all
To reset this repository to a clean state, run the following command in the terminal at the root directory of this project:
make clean
The relationships between the scripts, data files and final outputs are summarised in the dependency diagram below.
Python 3.7.5 and Python Packages:
- pandas 0.25.3
- docopt 0.6.2
- numpy 1.17.4
- scikit-learn 0.22
- altair 3.2.0
- pandas-profiling 2.3.0
- matplotlib 3.1.1
- seaborn 0.9.0
- selenium 3.141.0
R version 3.6.1 and R packages:
We welcome all contributions to this project! If you notice a bug, or have a feature request, please open up an issue here. If you'd like to contribute a feature or bug fix, you can fork our repo and submit a pull request. We will review pull requests within 7 days. All contributors must abide by our code of conduct.
HorseBaby. 2018. “Horse Racing Dataset for Experts (Hong Kong).” https://www.kaggle.com/hrosebaby/horse-racing-dataset-for-experts-hong-kong.