This model is to predict if borrowers in an online peer-to-peer (P2P) lending platform will pay back their loans on time and with lower interest rates. We look at factors related to the borrower, the loan, and their social profile to see how they affect loan performance. By analyzing this data, we can suggest ways for borrowers and lenders to increase their chances of successful lending and repayment.
- Introduction
- Data Collection
- Data Preprocessing
- EDA
- Feature Engineering
- Model Building
- Hyperparameter Tuning
- Models Evaluation
- Pipelining
- Deployment
We study the borrower-, loan- and social- related determinants of performance predictability in an online P2P lending market by conceptualizing financial and social strength to predict whether the borrowers could be funded with lower interest, and the lenders would be timely paid.
So, this project consists of two parts :
- Binary classification prediction :
- LoanStatus : Whether the borrower will repay the loan or not.
- multi-regressor prediction :
- preferred EMI : Represents the ideal or desired amount that an individual or borrower would like to pay as their monthly installment towards a loan
- preferred ROI : Represents the expected or desired rate of return on an investment.
- ELA : Represents the maximum loan amount for each loan application based on the criteria set by the lending institution. It can be a significant feature in analyzing and modeling loan data, providing insights into loan approval decisions and potential borrowing capacities.
The data used was provided from Prosper company.
The data was preprocessed by :
- Dropping features which aren't important.
- Missing values in the data : in the columns containing less than 70% of data null values by replacing them with median value ( for numerical features ) and with mode ( for categorical features ) otherwise these features were dropped.
- Values which are higher than the upper bound of values and lower than the lower bound of the values : by replacing some with min or max value.
- Converted the features ( with datatype : object ) including date and time values into suitable format ( datatype : datetime ) to be easily used in the data.
- Created our four target variables and their relevant features needed in their calculations.
We used different plots to visualize the data and see the relationship between features :
- It explores each variable in the data set, separately.
- It involves evaluating multiple variables in the data (more than two) to identify any possible association among them.
- We dropped all the features used in the four target variables calculation. Besides, dropping not important features.
- We dropped the features whuch have high correlation with our target variables and for the others we dropped all features and kept one for each heat map.
- We created two new features based on existing one in the data to be easily used.
- We used label encoding for multi-categories features and binary encoding for features which contains ( true , false).
- The four target variables were removed from the data and stored in another variables.
- We checked for null values existence in the the four targe variables. We reakizes tha there are missing values in 'EMI' target variable. So, we replaced them with the median value.
-
We used three approaches tp get the relevant features for each target varaiable. These approaches are :
- MI classification , chi-squared , Extra tree classifier for #binary target variable :
- Then, we took the top 20 relevant features according to each approach.
- Then, for the relevant features we took the intersection between the three approaches.
- MI regression , f_regression in univariate selection , Extra tree regressor for continous target variables :
- Then we took the top 20 relevant features according to each approach for each target variable.
- MI classification , chi-squared , Extra tree classifier for #binary target variable :
-
Then, for the relevant features for each target variable, we took the union between the three approaches relevamt features.
- Finally, for all of the three target variables we took the union between all the resulted features for the three target variables.
-
After getting all relevant features for all of our target variables, we created new dataframes : one containing relevant features for binary prediction and other containing relevant features for regression prediction. Then we took a copy from each one so that we have a copy used in modeling after applying standard scaling on it and another one used in pipelining for both typed of predictions.
- For binary target variable ( LoanStatus )
- For the other three targe variables ( EMI , ROI , ELA )
We experimented with various machine learning models , including:
- accuracy score = 96.8% .
- R-squared (R2) Score: 85.93 %
- R-squared (R2) Score: 85.89%
- R-squared (R2) Score: 96.46 %
- R-squared (R2) Score: 90.3 %
- R-squared (R2) Score : 85.89 %
- Each model was trained using the engineered features, and their performance was evaluated on a test dataset. This iterative process enabled us to identify the strengths and weaknesses of each model and select the most suitable one for our project.
-
We performed hyperparameter tuning using : Elbow method --> for knn model , grid search cv --> for ridge pipeline .This process involved adjusting various parameters within the model to optimize its performance on the training data. By fine-tuning the model's hyperparameters, we were able to achieve better results and enhance the overall accuracy of our predictions.
-
After tunning :
- Using elbow method , we got accuarcy for knn model : 94.3% which is a little bit higher than before tunning.
- Using GridSearchCV, we got r2 score for ridge pipeline : 85.7% which is a little bit lower than before tunning.
We evaluated:
-
The Classification models using accuracy, precision, recall, and F1-score metrics. These metrics provided a comprehensive view of each model's performance, allowing us to compare them objectively and select the best model for deployment.
-
The Multi_Output Regression models using Test accuracy, R-squared (R2) Score and Mean Squared Error.
After data splitting ( x --> relevant features data , y --> target variable/s ), we created :
1) KNN pipeline was created using standard scaling , PCA with number of components = 4 , KNN model. We got accuarcy : 94.78%
2) XGB classifier pipeline was created using standard scaling , PCA with number of components = 4 , XGB classifier model. We got accuarcy : 95.22%
- MultiTaskLasso pipeline was created using standard scaling , polynomial features with degree = 2 , MultiTaskLasso model with alpha=1e-05. We got accuarcy : 86.35%
- Multiple Linear Regression pipeline was created using standard scaling , polynomial features with degree = 2 , LinearRegression model. We got accuarcy : 86.33%
- XGB Regression pipeline was created using standard scaling , XGBRegressor model. We got accuarcy : 96.5%
- Decision Tree Regression pipeline was created using standard scaling , DecisionTreeRegressor model. We got accuarcy : 90.4%
- Ridge Regression pipeline was created using standard scaling , olynomial features with degree = 2 , Ridge model. We got accuarcy : 86.34%
Note : We concluded that by using polynomial features in XGB Regressor , Decision tree regressor models and pipelines , the scores decreased. So, we used it only in the other multi-regression models.
Once we have selected and fine-tuned our model :
-
We saved our two models with highest and best scores one is for classification prediction and the other is for milti-regression prediction using pickle. We provided detailed steps and code for deploying the model.
-
This included deploying the model on a local server, creating a streamlit website with user-friendly interface and used the saved models files to run the modeling code for prediction.