Skip to content

MaryamSayedGitHub/Predictive-Analysis-Using-Social-Profile-in-Online-P2P-Lending

Repository files navigation

Predictive Analysis Using Social Profile in Online P2P Lending Decision

This model is to predict if borrowers in an online peer-to-peer (P2P) lending platform will pay back their loans on time and with lower interest rates. We look at factors related to the borrower, the loan, and their social profile to see how they affect loan performance. By analyzing this data, we can suggest ways for borrowers and lenders to increase their chances of successful lending and repayment.

Table of Contents

Introduction

We study the borrower-, loan- and social- related determinants of performance predictability in an online P2P lending market by conceptualizing financial and social strength to predict whether the borrowers could be funded with lower interest, and the lenders would be timely paid.

So, this project consists of two parts :

  1. Binary classification prediction :
  • LoanStatus : Whether the borrower will repay the loan or not.
  1. multi-regressor prediction :
  • preferred EMI : Represents the ideal or desired amount that an individual or borrower would like to pay as their monthly installment towards a loan
  • preferred ROI : Represents the expected or desired rate of return on an investment.
  • ELA : Represents the maximum loan amount for each loan application based on the criteria set by the lending institution. It can be a significant feature in analyzing and modeling loan data, providing insights into loan approval decisions and potential borrowing capacities.

Data Collection

The data used was provided from Prosper company.

Data Preprocessing

The data was preprocessed by :

  • Dropping features which aren't important.

Screenshot 2023-07-09 232004

-------------------------------------------------------------------------

Screenshot 2023-07-09 232217

Handling null values

  • Missing values in the data : in the columns containing less than 70% of data null values by replacing them with median value ( for numerical features ) and with mode ( for categorical features ) otherwise these features were dropped.

Screenshot 2023-07-09 232051

-------------------------------------------------------------------------

Screenshot 2023-07-09 232113

-------------------------------------------------------------------------

Screenshot 2023-07-09 232132

Handling outliers

  • Values which are higher than the upper bound of values and lower than the lower bound of the values : by replacing some with min or max value.

Screenshot 2023-07-09 232652

-------------------------------------------------------------------------

Screenshot 2023-07-09 232925

-------------------------------------------------------------------------

Screenshot 2023-07-09 232823

-------------------------------------------------------------------------

Screenshot 2023-07-09 232715

-------------------------------------------------------------------------

Screenshot 2023-07-09 233012

-------------------------------------------------------------------------

  • Converted the features ( with datatype : object ) including date and time values into suitable format ( datatype : datetime ) to be easily used in the data.

Screenshot 2023-07-09 233051

-------------------------------------------------------------------------

  • Created our four target variables and their relevant features needed in their calculations.

Screenshot 2023-07-09 232248

-------------------------------------------------------------------------

Screenshot 2023-07-09 233230

-------------------------------------------------------------------------

Screenshot 2023-07-09 233302

-------------------------------------------------------------------------

Screenshot 2023-07-09 233315

-------------------------------------------------------------------------

Screenshot 2023-07-09 233403

-------------------------------------------------------------------------

Screenshot 2023-07-09 233444

-------------------------------------------------------------------------

Screenshot 2023-07-09 233808

-------------------------------------------------------------------------

Screenshot 2023-07-09 233519

-------------------------------------------------------------------------

Screenshot 2023-07-09 233705

-------------------------------------------------------------------------

Screenshot 2023-07-09 233728

-------------------------------------------------------------------------

Screenshot 2023-07-09 233748

-------------------------------------------------------------------------

EDA ( Exploratory Data Analysis )

We used different plots to visualize the data and see the relationship between features :

Univariate Analysis

  • It explores each variable in the data set, separately.

Screenshot 2023-07-09 230641

-------------------------------------------------------------------------

Screenshot 2023-07-09 230618

-------------------------------------------------------------------------

Screenshot 2023-07-09 230857

-------------------------------------------------------------------------

Screenshot 2023-07-10 192231

-------------------------------------------------------------------------

Screenshot 2023-07-09 230912

Bivariate Analysis

  • It is a statistical method examining how two different features in the data are related.

    Screenshot 2023-07-09 231119

-------------------------------------------------------------------------

Screenshot 2023-07-09 231202

-------------------------------------------------------------------------

Screenshot 2023-07-09 231658

-------------------------------------------------------------------------

Screenshot 2023-07-09 231620

-------------------------------------------------------------------------

Screenshot 2023-07-09 231545

-------------------------------------------------------------------------

Screenshot 2023-07-09 231448

-------------------------------------------------------------------------

Screenshot 2023-07-09 231411

-------------------------------------------------------------------------

Screenshot 2023-07-09 231321

-------------------------------------------------------------------------

Screenshot 2023-07-09 231202

-------------------------------------------------------------------------

Multivariate Analysis

  • It involves evaluating multiple variables in the data (more than two) to identify any possible association among them.

Screenshot 2023-07-09 231026

Feature Engineering

Drooping features

  • We dropped all the features used in the four target variables calculation. Besides, dropping not important features.

Screenshot 2023-07-09 233955

-------------------------------------------------------------------------

Feature correlation

  • We dropped the features whuch have high correlation with our target variables and for the others we dropped all features and kept one for each heat map.

Screenshot 2023-07-09 234252

-------------------------------------------------------------------------

Screenshot 2023-07-09 234225

-------------------------------------------------------------------------

Screenshot 2023-07-09 234151

-------------------------------------------------------------------------

New features creation

  • We created two new features based on existing one in the data to be easily used.

Screenshot 2023-07-09 234549

-------------------------------------------------------------------------

Screenshot 2023-07-09 234438

-------------------------------------------------------------------------

Screenshot 2023-07-09 234424

-------------------------------------------------------------------------

Screenshot 2023-07-09 234340

Data Encoding

  • We used label encoding for multi-categories features and binary encoding for features which contains ( true , false).

Screenshot 2023-07-09 234715

-------------------------------------------------------------------------

Screenshot 2023-07-09 234642

Dropping target variables for data

  • The four target variables were removed from the data and stored in another variables.

Screenshot 2023-07-09 234735

Checking for nulls in target variables

  • We checked for null values existence in the the four targe variables. We reakizes tha there are missing values in 'EMI' target variable. So, we replaced them with the median value.

Screenshot 2023-07-09 234748

-------------------------------------------------------------------------

Screenshot 2023-07-09 235831

-------------------------------------------------------------------------

Screenshot 2023-07-09 235734

Feature Selection

  • We used three approaches tp get the relevant features for each target varaiable. These approaches are :

    1. MI classification , chi-squared , Extra tree classifier for #binary target variable :
      • Then, we took the top 20 relevant features according to each approach.

    Screenshot 2023-07-09 234846

    -  Then, for the relevant features we took the intersection between the three approaches.
    

    Screenshot 2023-07-09 234831

    1. MI regression , f_regression in univariate selection , Extra tree regressor for continous target variables :
      • Then we took the top 20 relevant features according to each approach for each target variable.

    EMI

    Screenshot 2023-07-12 034710

    ROI

    Screenshot 2023-07-12 034740

    ELA

    Screenshot 2023-07-12 034811

  • Then, for the relevant features for each target variable, we took the union between the three approaches relevamt features.

    EMI

    Screenshot 2023-07-12 034722

    ROI

    Screenshot 2023-07-12 034753

    ELA

    Screenshot 2023-07-12 034955

    • Finally, for all of the three target variables we took the union between all the resulted features for the three target variables.

    Screenshot 2023-07-12 033945

Standard Scaling

  • After getting all relevant features for all of our target variables, we created new dataframes : one containing relevant features for binary prediction and other containing relevant features for regression prediction. Then we took a copy from each one so that we have a copy used in modeling after applying standard scaling on it and another one used in pipelining for both typed of predictions.

    1. For binary target variable ( LoanStatus )

    Screenshot 2023-07-09 235028

    1. For the other three targe variables ( EMI , ROI , ELA )

    Screenshot 2023-07-10 222151

Model Building

We experimented with various machine learning models , including:

Classification models:

KNN

  • accuracy score = 94 % --> this score is before Hyperparameter Tunning.

    Screenshot 2023-07-09 235346

    Screenshot 2023-07-09 235420( 1 )

XGBoost Classifier

  • accuracy score = 96.8% .

Screenshot 2023-07-09 235647

Multi Output Regression models:

Polynomial Features creation

Screenshot 2023-07-10 222222

Lasso regression model

- R-squared (R2) Score: 85.93 %

Screenshot 2023-07-10 222248

Polynomial Regression models

1) Multiple linear regression model

 - R-squared (R2) Score: 85.89%

Screenshot 2023-07-10 222322

2) XGB regression model

 - R-squared (R2) Score: 96.46 %

Screenshot 2023-07-11 004644

3) Decision Tree regression model

 - R-squared (R2) Score: 90.3 %

Screenshot 2023-07-11 004550

4) Ridge regression model

 - R-squared (R2) Score : 85.89 %

Screenshot 2023-07-10 222505

  • Each model was trained using the engineered features, and their performance was evaluated on a test dataset. This iterative process enabled us to identify the strengths and weaknesses of each model and select the most suitable one for our project.

Hyperparameter Tunning

  • We performed hyperparameter tuning using : Elbow method --> for knn model , grid search cv --> for ridge pipeline .This process involved adjusting various parameters within the model to optimize its performance on the training data. By fine-tuning the model's hyperparameters, we were able to achieve better results and enhance the overall accuracy of our predictions.

  • After tunning :

    1. Using elbow method , we got accuarcy for knn model : 94.3% which is a little bit higher than before tunning.

    Screenshot 2023-07-09 235555 ( 1 )

    -------------------------------------------------------------------------

Screenshot 2023-07-09 235535

  1. Using GridSearchCV, we got r2 score for ridge pipeline : 85.7% which is a little bit lower than before tunning.

Screenshot 2023-07-12 040607

Models Evaluation

We evaluated:

  • The Classification models using accuracy, precision, recall, and F1-score metrics. These metrics provided a comprehensive view of each model's performance, allowing us to compare them objectively and select the best model for deployment.

    KNN model evaluation ( before tunning )

    Screenshot 2023-07-09 235420

    -------------------------------------------------------------------------

Screenshot 2023-07-09 235433

KNN model ( after tunning )

Screenshot 2023-07-09 235608

-------------------------------------------------------------------------

Screenshot 2023-07-09 235555

XGB classification model :

Screenshot 2023-07-09 235710

  • The Multi_Output Regression models using Test accuracy, R-squared (R2) Score and Mean Squared Error.

    Lasso Regression model

    Screenshot 2023-07-12 035448

    Polynomial Regression models :

    1) Multiple linear regression model

    Screenshot 2023-07-12 035506

    2) XGB regression model

    Screenshot 2023-07-12 035518

    3) Decision Tree regression model

Screenshot 2023-07-12 035538

4) Ridge regression model

Screenshot 2023-07-12 035555

Pipelining

After data splitting ( x --> relevant features data , y --> target variable/s ), we created :

Classification pipelines :

 1) KNN pipeline was created using standard scaling , PCA with number of components = 4 , KNN model. We got accuarcy : 94.78%

Screenshot 2023-07-11 010715

-------------------------------------------------------------------------

Screenshot 2023-07-11 011055

2) XGB classifier pipeline was created using standard scaling , PCA with number of components = 4 , XGB classifier model. We got accuarcy : 95.22%

Screenshot 2023-07-11 011126

-------------------------------------------------------------------------

Screenshot 2023-07-11 011139

Multi-regression pipelines :

  1. MultiTaskLasso pipeline was created using standard scaling , polynomial features with degree = 2 , MultiTaskLasso model with alpha=1e-05. We got accuarcy : 86.35%

Screenshot 2023-07-12 041043

  1. Multiple Linear Regression pipeline was created using standard scaling , polynomial features with degree = 2 , LinearRegression model. We got accuarcy : 86.33%

Screenshot 2023-07-12 041202

  1. XGB Regression pipeline was created using standard scaling , XGBRegressor model. We got accuarcy : 96.5%

Screenshot 2023-07-12 041215

  1. Decision Tree Regression pipeline was created using standard scaling , DecisionTreeRegressor model. We got accuarcy : 90.4%

Screenshot 2023-07-12 041236

  1. Ridge Regression pipeline was created using standard scaling , olynomial features with degree = 2 , Ridge model. We got accuarcy : 86.34%

Screenshot 2023-07-12 041247

Note : We concluded that by using polynomial features in XGB Regressor , Decision tree regressor models and pipelines , the scores decreased. So, we used it only in the other multi-regression models.

Deployment

Once we have selected and fine-tuned our model :

  • We saved our two models with highest and best scores one is for classification prediction and the other is for milti-regression prediction using pickle. We provided detailed steps and code for deploying the model.

    Screenshot 2023-07-12 041303

  • This included deploying the model on a local server, creating a streamlit website with user-friendly interface and used the saved models files to run the modeling code for prediction.

WhatsApp Image 2023-07-13 at 1 46 55 AM

-------------------------------------------------------------------------

WhatsApp Image 2023-07-13 at 1 43 35 AM

-------------------------------------------------------------------------

WhatsApp Image 2023-07-13 at 1 44 23 AM

-------------------------------------------------------------------------

WhatsApp Image 2023-07-13 at 1 49 16 AM

-------------------------------------------------------------------------

  • Upon using our website :

    LoanStatus prediction

    WhatsApp Image 2023-07-13 at 1 53 56 AM

-------------------------------------------------------------------------

WhatsApp Image 2023-07-13 at 1 57 29 AM

EMI , ROI , ELA prediction

WhatsApp Image 2023-07-13 at 2 05 55 AM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published