Table of Contents
Linear Regression is a very simple approach for Supervised Learning, as it is a useful tool for predicting quantitative responses. It serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression, such as Generalized Linear Models.
Suppose that a random sample from size
The random variables
The
The Linear Regression Model represents a relation between the response variable
Where
The parameters of the model
Note that the relationship between the predictors and the response is not necessarily linear, as polynomial or interaction terms may be included, but it is necessarily linear in the beta coefficients. That is, the relationship is modeled as a linear combination of the parameters.
Note that, in the general linear regression model, the response variable y has a normal distribution with the mean:
The analytical solution to the parameter vector of the Linear Regression Model
is called the Normal Equation.
It is given by:
Though, it is actually unusually to calculate the parameters of the Linear Regression model using the Normal Equation, as certain, faster algorithms are preferred (such as SVD / LU Decomposition / Gradient Descent).
In a Regression Model, categorical variables are encoded. There are two main encoding methods: one-hot encoding and dummy encoding.
In one hot-encoding a category like ['BMW', 'AUDI', 'Volvo']
would be represented using three variables:
BMW = [1, 0, 0]
AUDI=[0, 1, 0]
Volvo = [0, 0, 1]
.
In dummy-encoding, we'd use the following strategy:
AUDI=[1, 0]
Volvo = [0, 1]
.
Notice there's no representation of BMW, as BMW
is called the "reference category"
and is represented by BMW=[0, 0]
.