## <center><font color=navy>Big Data Economics</font></center>
### <center>Logistic Regression: an essential BD tool</center>
#### <center>Ali Habibnia</center>
 
<center> Assistant Professor, Department of Economics, </center>
<center> and Division of Computational Modeling & Data Analytics at Virginia Tech</center>
<center> habibnia@vt.edu </center> 

<div class="alert alert-block alert-info">

Recruiters in industry expect you to know at least two algorithms: Linear Regression and Logistic Regression. Due to their ease of interpretation, consultancy firms use these algorithms extensively.

<img src="images/funny.png" width="600">
</div>




### Readings:

- https://www.khanacademy.org for basic math and stats.
1. ***Chapter 4.4*** [The Elements of Statistical Learning: Data Mining, Inference, and Prediction](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf). 
2. For a quick review see: ***Chapter 9.3***, Understanding Machine Learning From Theory to Algorithms.

### Overview

Many a time, situations arise where the dependent variable isn't normally distributed; i.e., the assumption of normality is violated. For example, think of a problem when the dependent variable is binary. Will you still use Multiple Regression? Of course not!

Often we have to resolve questions with binary or yes/no outcomes.

For example:

* _Does a patient have cancer?_

* _Will a team win the next game?_

* _Will the customer buy my product?_

* _Will I get the loan?_

* Can we get a loan, from the Lending Club, of 10,000 dollars at 12 per cent or less, with a FICO Score of 720?

> Logistic regression is used to predict the outcome of a categorical variable. A categorical variable is a variable that can take only specific and limited values.


In high dimensions, it is often convenient to think binary.



#### Types of Logistic Regression

* Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or Not Spam, Cancer or No Cancer.
* Multinomial Logistic Regression: The target variable has three or more nominal categories such as predicting the type of Wine.
* Ordinal Logistic Regression: the target variable has three or more ordinal categories such as restaurant or product rating from 1 to 5.



#### How does Logistic Regression work?

Linear regression is well suited for estimating values, but it isn’t the best tool for predicting the categorical variables. Logistic regression is a special case of linear regression where the target variable is categorical in nature. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

>
>Linear Regression Equation:
>
>$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p $$

In Logistic Regression, we use the same equation but with some modifications made to $y$. Let's reiterate a fact about Logistic Regression: we calculate probabilities. And, probabilities always lie between 0 and 1. In other words, we can say:

The response value must be positive.
It should be lower than 1.

We know the exponential of any value is always a positive number. And, any number divided by number + 1 will always be lower than 1. Let's implement these two findings:

>Logistic Function:
>$$ p = \frac{e^{y}}{1+e^{y}} $$

Sigmoid "S" shape functions have the same characteristics.

$$ p = \frac{1}{1+e^{-y}} $$

>Apply Sigmoid function on linear regression:
>$$ p = \frac{e^{(\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p)}}{1+e^{(\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p)}}$$
>

Now we are convinced that the probability value will always lie between 0 and 1.


#### A familiar example

We are going to start by plotting something we understand in the real world, although we may never actually have plotted it before.

Let's say on the x-axis is the number of goals scored by an NFL team over a season and say the outcome on the y-axis is whether they lost or won the game indicated by a value of 0 or 1 respectively. 

Then a plot for these scores might look like this:

<img src="files/images/b1fig1_nfloutcomes.png" />

So, how do we predict whether we have a win or a loss if we are given a score? 
Clearly linear regression is not a good model. 
If we plot a normal linear regression over our data points, it looks like this:

<img src="files/images/b1fig2_nfloutcomes_withline.png" />

### How do we model this sort of data best?

We need a better way to model our data. We build a linear model for binary response data.

We will just pull a function out of the data science bag of tricks and show that it works reasonably well. We are going to understand how we came up with that function and how it is related to binary outcomes and odds.

This function will need to have a value of 0 for the loss scores and 1 for the win scores.
To make sense it will need to be 0 for some score and all scores below it and be 1 for some other score and all scores above it. And it will need to smoothly increase from 0 to 1 in the intermediate range.

Logistic Regression assumes a linear relationship between the independent variables and the link function (logit).

It will need to look something like this:
<img src="files/images/ey x.png" width="450"/>
<img src="files/images/standardSigmoidFunction.png" />

### Now for a spot of Math

A function that has the above shape is:

$$P(x) = \frac{1}{1 + e^{\beta_0 + \beta_1x}}$$

where $P(x)$ is the probability of a score of $x$ leading to a win. 
$\beta_0, \beta_1$ are parameters that we will estimate, so the curve fits our data.


Notice that we have a familiar looking linear function, 
$$\beta_0 + \beta_1x$$ 
but it's plugged into a formula that generates the shape we want. 

From the shape we can see that if Score was less than 20 then $P(x)$ would predict a loss, if Score was greater than 30, $P(x)$ would predict a win. But in the middle things would be somewhat fuzzy - we would have even odds when the score was around 25.

So this sort of function is what we use to model binary outcomes.

### Threshold functions for logistic regression


#### Odds, mathematically speaking. 

We are going to take the notion of odds, put a simple mathematical framework around it and then use our previous knowledge of linear regression to create a model that predicts binary outcomes. 

Basically all we need to know is that Probability is a number between 0 and 1 and indicates the likelihood of an event occurring. We remind ourselves that: 

probability = 0 is as good as the event being impossible and 

probability = 1 is as good as it being certain. 

We should also remind ourselves that if 
the probability of an event happening is $p$ 
then the probability of it not happening is $1 - p$. 

That's all the probability we need.

Having said that let's start talking about odds.


#### Odds and Odds Ratio

When bettors say the odds of winning are 1:4 what is this in terms of probability? 

It means 1 part chance of winning to 4 parts chance of losing. Note that total # of parts = 5 and odds of winning is 1 out of 5. So p is 1/5 = 0.2 and 1-p is 0.8. Here p is small and 1-p is large.

The odds might be 1:1 which means p = 1/2 and 1-p = 1/2 i.e. equal chances of an even happening or not = "even odds".

The odds might be 3:2 which means p = 0.6 and 1-p = 0.4. Here p is greater and 1-p is smaller.

So depending on the ratio of p to 1-p we have more or less confidence in a bet winning.

This suggests we might want to look at: 

$$Odds Ratio (OR) = \frac {p}{1-p}$$

If OddsRatio is high say: 

$$OR > 4$$ 

then the event might be considered very likely and if: 

$$OR < 0.25$$ 

then very unlikely. 


### The Logit Function

Mathematicians like to work with a function derived from this called the Logit function. It's the Log of the OddsRatio

$$logit(p) = log(\frac{p}{1-p})$$ 

or the LogOdds function. 


<img src="files/images/logodds.png" width="400"/>

As you might recognize, the right side of the (immediate) equation above depicts the linear combination of independent variables. The left side is known as the log - odds or odds ratio or logit function and is the link function for Logistic Regression.

We can interpret the above equation as, a unit increase in variable $x$ results in multiplying the odds ratio by $exp$ to power $\beta$. In other words, the regression coefficients explain the change in log(odds) in the response for a unit change in predictor. However, since the relationship between $p(X)$ and $X$ is not straight line, a unit change in input feature doesn't really affect the model output directly but it affects the odds ratio.

So we would say: 

$$logit(p) = log( \frac{p}{1-p} ) = \beta_0 + \beta_1X$$ 

where $X$ is the "value" of the event. 

So here instead of $Y = \beta_0 + b_1X$ we want to plot $logit(p)$ on the $Y$ axis and the event or the score on the $X$ axis. 

So this is how the linear model slips in - we want to express log odds as a linear function of score. 

Patience now, we are just one step away. 


#### How to estimate coefficients?

In Linear Regression, we use the Sum of Squared Errors and Ordinary Least Square (OLS) method to estimate the best coefficients to attain good model fit. In Logistic Regression, we use maximum likelihood method to determine the best coefficients and eventually a good model fit.

Maximum likelihood works like this: It tries to find the value of coefficients $(\beta_0,\beta_1)$ such that the predicted probabilities are as close to the observed probabilities as possible. In other words, for a binary classification (1/0), maximum likelihood will try to find values of $(\beta_0,\beta_1)$ such that the resultant probabilities are closest to either 1 or 0. The likelihood function is written as


$$Log Loss = \sum_{(x,y)\in D} -y \cdot log(y_{pred}) - (1 - y) \cdot log(1 - y_{pred})$$



#### How can you evaluate Logistic Regression model fit and accuracy ?

In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate model fit and accuracy. But, Logistic Regression employs all different sets of metrics. Here, we deal with probabilities and categorical values. Following are the evaluation metrics used for Logistic Regression:

1. Akaike Information Criteria (AIC)
2. Null Deviance and Residual Deviance
3. Confusion Matrix
4. Receiver Operator Characteristic (ROC)


### Showcase

We're going to look at the data set from Lending Club, the first US peer-to-peer lending company. Let's assume we have a FICO Score (credit scores) of 720 and we want to borrow 10,000 dollars.
We would like to get an Interest Rate less that 12 per cent.

The question we pose is: 

> Can we get a loan, from the Lending Club, of 10,000 dollars at 12 per cent or less, with a FICO Score of 720?

How do we use Logistic Regression here? Let's recast the problem as follows:

> What is the probability of getting a Loan, from the Lending Club, of 10,000 dollars at 12 per cent or less with a FICO Score of 720? 

Then let us decide that if we get a probability of less than 0.67 we say it means we won't get the loan and if it is greater than 0.67 we will. I.e. we are not confident until we have a 2/3 chance of getting it.

In reality we can set the threshold higher, say 0.8, if we want to be "more certain" that it will happen, but for this exercise we'll just say 0.67.


We start with a model of the form

$Interest Rate = \beta_0 + \beta_1*FICOScore + \beta_2*LoanAmount$

And the derive a second equation of the form:

$Z = Prob (InterestRate < 12\%)$.

We apply this to the loan dataset and create a Logistic Regression Mode.

In [1]:
import pandas as pd
dfr = pd.read_csv('data/loanf.csv')
dfr.head()

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount
6,15.31,670,36,4891.67,6000
11,19.72,670,36,3575.0,2000
12,14.27,665,36,4250.0,10625
13,21.67,670,60,14166.67,28000
21,21.98,665,36,6666.67,22000


In [3]:
# we add a column which indicates (True/False) whether the interest rate is <= 12 
dfr['TF']=dfr['Interest.Rate']<=12
# inspect again
dfr.head(80)
# we see that the TF values are False as Interest.Rate is higher than 12 in all these cases

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount,intercept,TF
6,15.31,670,36,4891.67,6000,1.0,False
11,19.72,670,36,3575.00,2000,1.0,False
12,14.27,665,36,4250.00,10625,1.0,False
13,21.67,670,60,14166.67,28000,1.0,False
21,21.98,665,36,6666.67,22000,1.0,False
...,...,...,...,...,...,...,...
394,18.75,670,36,4500.00,21250,1.0,False
396,14.33,670,36,3333.33,12000,1.0,False
407,22.47,670,60,5083.33,22000,1.0,False
413,15.80,665,36,6612.00,5000,1.0,False


In [6]:
# now we check the rows that have interest rate == 10 (just some number < 12)
# this is just to confirm that the TF value is True where we expect it to be
d = dfr[dfr['Interest.Rate']==10]
d.head()
# all is well

Unnamed: 0,Interest.Rate,FICO.Score,Loan.Length,Monthly.Income,Loan.Amount,TF
650,10.0,700,36,3250.0,2800,True
204,10.0,715,36,15416.67,6000,True
440,10.0,730,36,6250.0,21000,True
521,10.0,715,36,5000.0,12000,True
1017,10.0,735,60,4000.0,5000,True


Now we use our Logistic Regression modeler to create Logit model using this data, with the 'TF' column as the dependent (or response) variable and 'FICO.Score' and 'Loan.Amount' as independent (or predictor) variables.


In [4]:
import statsmodels.api as sm
# statsmodels requires us to add a constant column representing the intercept
dfr['intercept']=1.0
# identify the independent variables 
ind_cols=['FICO.Score','Loan.Amount','intercept']
logit = sm.Logit(dfr['TF'], dfr[ind_cols])
result=logit.fit()

sm.Logit()

Optimization terminated successfully.
 Current function value: 0.319503
 Iterations 8


TypeError: __init__() missing 2 required positional arguments: 'endog' and 'exog'

We should see some soothing messages from our software re-assuring us that all went well 
and giving us some numbers we may not find useful right now. 
More importantly we want the results.
What are the fitted coefficients that the software has computed?

In [5]:
# get the fitted coefficients from the results
coeff = result.params
print(coeff)

FICO.Score 0.087423
Loan.Amount -0.000174
intercept -60.125045
dtype: float64


The numbers above are the coefficients for the respective independent, i.e. predictor, variables in the linear expression.

So, using the above coefficients, the linear part of our predictor is 

$$z = -60.125 + 0.087423*FicoScore -0.000174*LoanAmount$$

Finally, the probability of our desired outcome, ie our getting a loan at 12% interest or less, is

$$p(z) = \frac{1}{1 + e^{\beta_0 + \beta_1*FicoScore + \beta_2*LoanAmount}}$$ 

where $\beta_0 = −60.125, \beta_1 = 0.087423$ and $\beta_2 = −0.000174$

We create a function in code that encapsulates all this.

It takes as input, a borrowers FICO score, the desired loan amount and the coefficient vector from our model. It returns a probability of getting the loan, a number between 0 and 1.

In [13]:
from math import exp
def pz(fico,amt,coeff):
 # compute the linear expression by multipyling the inputs by their respective coefficients.
 # note that the coefficient array has the intercept coefficient at the end
 z = coeff[0]*fico + coeff[1]*amt + coeff[2]
 return 1/(1+exp(-1*z))

Now we use our data FICO=720 and Loan Amount=10,000 to get a probability using the z value
and the logistic formula. 

In [14]:
pz(720,10000,coeff)

0.7463785889515121

This value of 0.746 tells us we have a good chance of getting the loan we want, according to our criterion, where anything above 0.67 was considered a 'yes'.

Now we are going to try (fico, amt) pairs as follows:

* 720,20000
* 720,30000
* 820,10000
* 820,20000 
* 820,30000 

In [15]:
print("Trying multiple FICO Loan Amount combinations: ")
print('----')
print("fico=720, amt=10,000")
print(pz(720,10000,coeff))
print("fico=720, amt=20,000")
print(pz(720,20000,coeff))
print("fico=720, amt=30,000")
print(pz(720,30000,coeff))
print("fico=820, amt=10,000")
print(pz(820,10000,coeff))
print("fico=820, amt=20,000")
print(pz(820,20000,coeff))
print("fico=820, amt=30,000")
print(pz(820,30000,coeff))


Trying multiple FICO Loan Amount combinations: 
----
fico=720, amt=10,000
0.7463785889515121
fico=720, amt=20,000
0.3405398576881749
fico=720, amt=30,000
0.08308359523703378
fico=820, amt=10,000
0.9999457423271543
fico=820, amt=20,000
0.9996908677522978
fico=820, amt=30,000
0.9982408301380601


We see as somewhat expected that the person with a 720 FICO Score will have decreasing probability of getting loans with higher amounts.
However, the person with the 820 FICO Score is very likely to get loans with those amounts, again as expected.