Predictive Analysis Using Social Profile in Online P2P Lending Decision

This model is to predict if borrowers in an online peer-to-peer (P2P) lending platform will pay back their loans on time and with lower interest rates. We look at factors related to the borrower, the loan, and their social profile to see how they affect loan performance. By analyzing this data, we can suggest ways for borrowers and lenders to increase their chances of successful lending and repayment.

Introduction

We study the borrower-, loan- and social- related determinants of performance predictability in an online P2P lending market by conceptualizing financial and social strength to predict whether the borrowers could be funded with lower interest, and the lenders would be timely paid.

So, this project consists of two parts :

Binary classification prediction :

LoanStatus : Whether the borrower will repay the loan or not.

multi-regressor prediction :

preferred EMI : Represents the ideal or desired amount that an individual or borrower would like to pay as their monthly installment towards a loan
preferred ROI : Represents the expected or desired rate of return on an investment.
ELA : Represents the maximum loan amount for each loan application based on the criteria set by the lending institution. It can be a significant feature in analyzing and modeling loan data, providing insights into loan approval decisions and potential borrowing capacities.

Data Collection

The data used was provided from Prosper company.

Data Preprocessing

The data was preprocessed by :

Dropping features which aren't important.

-------------------------------------------------------------------------

Handling null values

Missing values in the data : in the columns containing less than 70% of data null values by replacing them with median value ( for numerical features ) and with mode ( for categorical features ) otherwise these features were dropped.

-------------------------------------------------------------------------

Handling outliers

Values which are higher than the upper bound of values and lower than the lower bound of the values : by replacing some with min or max value.

-------------------------------------------------------------------------

Converted the features ( with datatype : object ) including date and time values into suitable format ( datatype : datetime ) to be easily used in the data.

-------------------------------------------------------------------------

Created our four target variables and their relevant features needed in their calculations.

-------------------------------------------------------------------------

EDA ( Exploratory Data Analysis )

We used different plots to visualize the data and see the relationship between features :

Univariate Analysis

It explores each variable in the data set, separately.

-------------------------------------------------------------------------

Bivariate Analysis

It is a statistical method examining how two different features in the data are related.

-------------------------------------------------------------------------

Multivariate Analysis

It involves evaluating multiple variables in the data (more than two) to identify any possible association among them.

Feature Engineering

Drooping features

We dropped all the features used in the four target variables calculation. Besides, dropping not important features.

-------------------------------------------------------------------------

Feature correlation

We dropped the features whuch have high correlation with our target variables and for the others we dropped all features and kept one for each heat map.

-------------------------------------------------------------------------

New features creation

We created two new features based on existing one in the data to be easily used.

-------------------------------------------------------------------------

Data Encoding

We used label encoding for multi-categories features and binary encoding for features which contains ( true , false).

-------------------------------------------------------------------------

Dropping target variables for data

The four target variables were removed from the data and stored in another variables.

Checking for nulls in target variables

We checked for null values existence in the the four targe variables. We reakizes tha there are missing values in 'EMI' target variable. So, we replaced them with the median value.

-------------------------------------------------------------------------

Feature Selection

We used three approaches tp get the relevant features for each target varaiable. These approaches are :
1. MI classification , chi-squared , Extra tree classifier for #binary target variable :
  - Then, we took the top 20 relevant features according to each approach.
```
-  Then, for the relevant features we took the intersection between the three approaches.
```
1. MI regression , f_regression in univariate selection , Extra tree regressor for continous target variables :
  - Then we took the top 20 relevant features according to each approach for each target variable.
EMI

ROI

ELA
Then, for the relevant features for each target variable, we took the union between the three approaches relevamt features.

EMI

ROI

ELA
- Finally, for all of the three target variables we took the union between all the resulted features for the three target variables.

Standard Scaling

After getting all relevant features for all of our target variables, we created new dataframes : one containing relevant features for binary prediction and other containing relevant features for regression prediction. Then we took a copy from each one so that we have a copy used in modeling after applying standard scaling on it and another one used in pipelining for both typed of predictions.
1. For binary target variable ( LoanStatus )
1. For the other three targe variables ( EMI , ROI , ELA )

Model Building

We experimented with various machine learning models , including:

Classification models:

KNN

accuracy score = 94 % --> this score is before Hyperparameter Tunning.

XGBoost Classifier

accuracy score = 96.8% .

Multi Output Regression models:

Polynomial Features creation

Lasso regression model

- R-squared (R2) Score: 85.93 %

Polynomial Regression models

1) Multiple linear regression model

 - R-squared (R2) Score: 85.89%

2) XGB regression model

 - R-squared (R2) Score: 96.46 %

3) Decision Tree regression model

 - R-squared (R2) Score: 90.3 %

4) Ridge regression model

 - R-squared (R2) Score : 85.89 %

Each model was trained using the engineered features, and their performance was evaluated on a test dataset. This iterative process enabled us to identify the strengths and weaknesses of each model and select the most suitable one for our project.

Hyperparameter Tunning

We performed hyperparameter tuning using : Elbow method --> for knn model , grid search cv --> for ridge pipeline .This process involved adjusting various parameters within the model to optimize its performance on the training data. By fine-tuning the model's hyperparameters, we were able to achieve better results and enhance the overall accuracy of our predictions.
After tunning :
1. Using elbow method , we got accuarcy for knn model : 94.3% which is a little bit higher than before tunning.
-------------------------------------------------------------------------

Using GridSearchCV, we got r2 score for ridge pipeline : 85.7% which is a little bit lower than before tunning.

Models Evaluation

We evaluated:

The Classification models using accuracy, precision, recall, and F1-score metrics. These metrics provided a comprehensive view of each model's performance, allowing us to compare them objectively and select the best model for deployment.

KNN model evaluation ( before tunning )

-------------------------------------------------------------------------

KNN model ( after tunning )

-------------------------------------------------------------------------

XGB classification model :

The Multi_Output Regression models using Test accuracy, R-squared (R2) Score and Mean Squared Error.

Lasso Regression model

Polynomial Regression models :

1) Multiple linear regression model

2) XGB regression model

3) Decision Tree regression model

4) Ridge regression model

Pipelining

After data splitting ( x --> relevant features data , y --> target variable/s ), we created :

Classification pipelines :

 1) KNN pipeline was created using standard scaling , PCA with number of components = 4 , KNN model. We got accuarcy : 94.78%

-------------------------------------------------------------------------

2) XGB classifier pipeline was created using standard scaling , PCA with number of components = 4 , XGB classifier model. We got accuarcy : 95.22%

-------------------------------------------------------------------------

Multi-regression pipelines :

MultiTaskLasso pipeline was created using standard scaling , polynomial features with degree = 2 , MultiTaskLasso model with alpha=1e-05. We got accuarcy : 86.35%

Multiple Linear Regression pipeline was created using standard scaling , polynomial features with degree = 2 , LinearRegression model. We got accuarcy : 86.33%

XGB Regression pipeline was created using standard scaling , XGBRegressor model. We got accuarcy : 96.5%

Decision Tree Regression pipeline was created using standard scaling , DecisionTreeRegressor model. We got accuarcy : 90.4%

Ridge Regression pipeline was created using standard scaling , olynomial features with degree = 2 , Ridge model. We got accuarcy : 86.34%

Note : We concluded that by using polynomial features in XGB Regressor , Decision tree regressor models and pipelines , the scores decreased. So, we used it only in the other multi-regression models.

Deployment

Once we have selected and fine-tuned our model :

We saved our two models with highest and best scores one is for classification prediction and the other is for milti-regression prediction using pickle. We provided detailed steps and code for deploying the model.
This included deploying the model on a local server, creating a streamlit website with user-friendly interface and used the saved models files to run the modeling code for prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Deployment_Streamlit.py		Deployment_Streamlit.py
README.md		README.md
Team B Predictive Analysis using P2P Lending Platform final notebook.ipynb		Team B Predictive Analysis using P2P Lending Platform final notebook.ipynb
classification_model.pkl		classification_model.pkl
multi-regressor_model.pkl		multi-regressor_model.pkl

MaryamSayedGitHub/Predictive-Analysis-Using-Social-Profile-in-Online-P2P-Lending

Folders and files

Latest commit

History

Repository files navigation

Predictive Analysis Using Social Profile in Online P2P Lending Decision

Table of Contents

Introduction

Data Collection

Data Preprocessing

-------------------------------------------------------------------------

Handling null values

-------------------------------------------------------------------------

-------------------------------------------------------------------------

Handling outliers

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

EDA ( Exploratory Data Analysis )

Univariate Analysis

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

Bivariate Analysis

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

Multivariate Analysis

Feature Engineering

Drooping features

-------------------------------------------------------------------------

Feature correlation

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

New features creation

-------------------------------------------------------------------------

-------------------------------------------------------------------------

-------------------------------------------------------------------------

Data Encoding

-------------------------------------------------------------------------

Dropping target variables for data

Checking for nulls in target variables

-------------------------------------------------------------------------

-------------------------------------------------------------------------

Feature Selection

EMI

ROI

ELA

EMI

ROI

ELA

Standard Scaling

Model Building

Classification models:

KNN

XGBoost Classifier

Multi Output Regression models:

Polynomial Features creation

Packages