- Authors: Jarvis Nederlof, Roc Zhang, Jack Tan
We have built a regression model using a light gradient boosting model to predict the number of expected minutes an NBA basketball player will play in an upcoming game. Our final model performed well on an unseen test data set, achieving mean squared error of 38.24 with a coefficient of determination of 0.65. Both metrics showed better performance compared to a players 5-game average minutes played (our evaluation metric) of 50.24 and 0.55,
The data set used in this project is of the NBA Enhanced Box Score and Standings (2012 - 2018) created by Paul Rossotti, hosted on Kaggle.com. It was sourced using APIs from xmlstats. A copy of this dataset is hosted on a separate remote repository located here to allow easy download with authenticating a Kaggle account. The particular data file used can be accessed here. Each row in the data set represents a player's box score statistics for a particular game. The box score statistics are determined by statisticians working for the NBA. There were 151,493 data examples (rows).
The final report can be found here.
You can run this analysis a few different ways. Start by cloning/downloading this repository, and navigate to the root of the project using the command line.
To run the analysis using Docker type the following (fill <PATH_ON_YOUR_COMPUTER> with the absolute path to the root of this project on your computer):
> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes' all
To clean up the analysis type:
> docker run --rm -v <PATH_ON_YOUR_COMPUTER>:/home/nba_minutes jnederlo/nba_minutes make -C '/home/nba_minutes` clean
The Docker container is hosted on Docker Hub and can be viewed here. The Dockerfile
can be viewed here.
Alternatively, you can use make
commands from the root of the directory of this project to reproduce the analysis. The commands are listed as follows:
##### General commands #####
# Run the whole workflow
make all
# Clean all of the workflow outputs
make clean
##### Run the workflow one at a time in order #####
# Download the data and save to file
make data/2012-18_playerBoxScore.csv
# Wrangle and preprocess the data - generate features and save data to a file
make data/player_data_ready.csv
# Run the Exploratory Data Analysis (EDA) - save results in a file
make results/EDA-correl_df_neg_9.csv results/EDA-correl_df_pos_20.csv results/EDA-feat_corr.png results/EDA-hist_y.png
# Train the models and make predictions - generate figures for final report
make results/modelling-gbm_importance.png results/modelling-residual_plot.png results/modelling-score_table.csv
# Generate the final report
make report.pdf
You can view the Makefile
here.
If running locally, and not with Docker, make sure you have the required dependencies installed.
- Python 3.7.5 and Python packages:
- pandas==0.25.2
- numpy==1.17.2
- docopt==0.6.2
- requests==2.20.0
- tqdm==4.41.1
- selenium==3.141.0
- altair==4.0.1
- scikit-learn==0.22.1
- matplotlib==3.1.2
- selenium==3.141.0
- termcolor==1.1.0
- jupyterlab==1.2.3
- lightgbm==2.3.1
- xgboost==0.90
- R version 3.6.1 and R packages:
- tidyverse==1.2.1
- docopt==0.6.2
- System requirement:
- ChromeDriver==79.0.3945.36 # $ brew cask install chromedriver click here for more information
- Latex (TeX Live 2019) click here for more information
The NBA Minutes Predictor materials here are licensed under the MIT License. If re-using/re-mixing please provide attribution and link to this repository.