Author: Elliott Ribner, Mohammed Salama, Zoe Pan
This project was carried out as a part of the requirements MDS-UBC program
The prediction of Customer default payments is an important issue in risk management by banks and developing accurate predictive tools is highly needed to mitigate losses associated with Credit Risk. In our project, we will use Logistic regression model and the related machine learning techniques to predict customer default payment and therefore help us answer our research question: what features strongly predict default payment.
In this data analysis project, we will be using Data Set that is publicly available from UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science to find what features predict default payment. The data, while hosted by UCI, was originally publicized by Chung Hua University of Taiwan and Tamkang University of Taiwan. The data was collected in 2005 from the Taiwanese market. The data contain 30,000 instances in total and 23 features about customers' payment, bill histories and demographic factors.
Due to imbalanced class in the response variable, we used SMOTE (Synthetic Minority Oversampling Technique) to create a balanced data set to fit the model. RobustScaler
to scale predictors. And we used logistic regression model(LogisticRegression
) and RFE
(recursive feature elimination) to prune and select features that predict default payment. We finally narrowed features from 23 down to 7 that predict default payment: Amount of the given credit (NT dollar), EDUCATION, MARRIAGE, AGE, Past monthly repayment status in September 2005, Past monthly repayment status in September 2005, Amount of previous payment (NT dollar) in September 2005. Comparing it with all features baseline model, prediction accuracy, test overall accuracy, precision improved and AUC is similar.
There are two methods to replicate the analysis in this project :
Install Docker and then download/clone this repository. Next, use the command line to navigate to the root of this downloaded/cloned repo and type the following in the shell:
docker run --rm -v /$(pwd):/home/credit-analysis eribner201/credit-analysis make -C /home/credit-analysis all
To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:
docker run --rm -v /$(pwd):/home/credit-analysis eribner201/credit-analysis make -C /home/credit-analysis clean
Clone this GitHub repository, install the dependencies listed below, and run the following commands at the command line/terminal from the root directory of this project:
make all
To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:
make clean
- Python 3.7.3 and Python packages:
- docopt=0.6.2
- pandas=0.25.3
- sklearn=0.22
- altair=3.2.0
- numpy=1.17.4
- imblearn=0.6.1
- seaborn=0.9.0
- matplotlib=3.1.1
- R version 3.6.1 and R packages:
- tidyverse=1.2.1
- testthat=2.3.1
- docopt=0.6.1
- Jupyter notebook(To render variable values in notebook or html format, install instruction see here):
- jupyter_contrib_nbextensions=0.5.1
- GNU make 4.2.1
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.