4.1 Least Square Method

4.1.1 History

The least squares method is a procedure to find the best function match of the data by minimizing the square sum of the error. The least square method can be used to easily obtain unknown data, and minimize the sum of squares of errors between the obtained data and the actual data. The least squares method can also be used for curve fitting. Some other optimization problems can also be expressed by the least square method.

In 1801, Italian astronomer Giuseppe Piazzi discovered the first asteroid, Ceres. After 40 days of tracking, Piazzi lost the position of Ceres as Ceres moved behind the sun. Then scientists all over the world began to search for Ceres using Piazzi's observational data, but most of them had no results. At the age of 24, Gauss also calculated the orbit of Ceres. The Austrian astronomer Heinrich Olbers rediscovered Ceres based on the orbit calculated by Gauss.

The method of least squares used by Gauss was published in his book "Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium meaning" in 1809. The French scientist Legendre independently invented the least squares method in 1806, but it was unknown to the world. Legendre had a dispute with Gauss who first established the least squares method.

In 1829, Gauss provided proof that the optimization effect of the least squares method was stronger than other methods. This is called Gauss-Markov theorem.

4.1.2 Mathematical Principles

Linear regression tries to learn:

$$z_i=w \cdot x_i+b \tag{1}$$

To make:

$$z_i \simeq y_i \tag{2}$$

Among them, $x_i$ is the sample characteristics, $y_i$ is the sample label value, $z_i$ is the model prediction.

How to learn $w$ and $b$? The mean squared error (MSE) is a commonly used method in regression tasks: $$ J = \frac{1}{2m}\sum_{i=1}^m(z_i-y_i)^2 = \frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2 \tag{3} $$

$J$ is the loss function. It is trying to find a straight line to minimize the sum of squares of residuals from all samples to the straight line.

图4-3 Figure 4-3 Evaluation principle of the mean square error function

In Figure 4-3, the circular points are the samples, and the straight line is the current fitting result. As shown in the figure on the left, we need to calculate the vertical distance from the sample points to the straight line. We need to calculate the (垂足) based on the slope of the straight line and then calculate the distance. This calculation is very slow; In engineering, we usually use the method on the right, that is, the vertical distance between the sample point and the straight line, becuase this calculation is very convenient, just a subtraction.

Suppose we calculate that the preliminary result is shown by the dashed line, is this straight line appropriate? We can calculate the distance from each point in the figure to this straight line, and add up the values of these distances (all positive numbers, there is no problem of cancelling each other) to become an error.

Because the points in the above figure are not on a straight line, there cannot be a straight line that can pass through them at the same time. Therefore, we can only think of ways to continuously change the angle and position of the red straight line to minimize the overall error (it can never be $0$), which means that the overall deviation is the smallest. Then the final straight line is the result we want.

If we want to minimize the value of the error, we can take the derivative of $w$ and $b$, and then set the derivative to $0$ (reach the minimum extreme value), which is the optimal solution of $w$ and $b$.

The derivation process is as follows:

$$ \begin{aligned} \frac{\partial{J}}{\partial{w}} &=\frac{\partial{(\frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2)}}{\partial{w}} \\\ &= \frac{1}{m}\sum_{i=1}^m(y_i-wx_i-b)(-x_i) \end{aligned} \tag{4} $$

Let formula 4 be $0$:

$$ \sum_{i=1}^m(y_i-wx_i-b)x_i=0 \tag{5} $$

$$ \begin{aligned} \frac{\partial{J}}{\partial{b}} &=\frac{\partial{(\frac{1}{2m}\sum_{i=1}^m(y_i-wx_i-b)^2)}}{\partial{b}} \\\ &=\frac{1}{m}\sum_{i=1}^m(y_i-wx_i-b)(-1) \end{aligned} \tag{6} $$

Let formula 6 be $0$:

$$ \sum_{i=1}^m(y_i-wx_i-b)=0 \tag{7} $$

From formula 7, we can get (Suppose there are $m$ samples）：

$$ \sum_{i=1}^m b = m \cdot b = \sum_{i=1}^m{y_i} - w\sum_{i=1}^m{x_i} \tag{8} $$

Divide both sides by $m$:

$$ b = \frac{1}{m}\left(\sum_{i=1}^m{y_i} - w\sum_{i=1}^m{x_i}\right)=\bar y-w \bar x \tag{9} $$

Among them:

$$ \bar y = \frac{1}{m}\sum_{i=1}^m y_i, \bar x=\frac{1}{m}\sum_{i=1}^m x_i \tag{10} $$

Substitute formula 10 into formula 5:

$$ \sum_{i=1}^m(y_i-wx_i-\bar y + w \bar x)x_i=0 $$

$$ \sum_{i=1}^m(x_i y_i-wx^2_i-x_i \bar y + w \bar x x_i)=0 $$

$$ \sum_{i=1}^m(x_iy_i-x_i \bar y)-w\sum_{i=1}^m(x^2_i - \bar x x_i) = 0 $$

$$ w = \frac{\sum_{i=1}^m(x_iy_i-x_i \bar y)}{\sum_{i=1}^m(x^2_i - \bar x x_i)} \tag{11} $$

Substitute formula 10 into formula 11:

$$ w = \frac{\sum_{i=1}^m (x_i \cdot y_i) - \sum_{i=1}^m x_i \cdot \frac{1}{m} \sum_{i=1}^m y_i}{\sum_{i=1}^m x^2_i - \sum_{i=1}^m x_i \cdot \frac{1}{m}\sum_{i=1}^m x_i} \tag{12} $$

Multiply both the numerator and denominator by $m$:

$$ w = \frac{m\sum_{i=1}^m x_i y_i - \sum_{i=1}^m x_i \sum_{i=1}^m y_i}{m\sum_{i=1}^m x^2_i - (\sum_{i=1}^m x_i)^2} \tag{13} $$

$$ b= \frac{1}{m} \sum_{i=1}^m(y_i-wx_i) \tag{14} $$

In fact, there are many variants of formula 13, and you will see different versions in different articles, and you are often confused. For example, the following two formulas are also correct solutions:

$$ w = \frac{\sum_{i=1}^m y_i(x_i-\bar x)}{\sum_{i=1}^m x^2_i - (\sum_{i=1}^m x_i)^2/m} \tag{15} $$

$$ w = \frac{\sum_{i=1}^m x_i(y_i-\bar y)}{\sum_{i=1}^m x^2_i - \bar x \sum_{i=1}^m x_i} \tag{16} $$

For the above two formulas, if you substitute formula 10 into it, you should get the same answer as formula 13, but you need some calculation skills. For example, many people don’t know this magic formula:

$$ \begin{aligned} \sum_{i=1}^m (x_i \bar y) &= \bar y \sum_{i=1}^m x_i =\frac{1}{m}(\sum_{i=1}^m y_i) (\sum_{i=1}^m x_i) \\\ &=\frac{1}{m}(\sum_{i=1}^m x_i) (\sum_{i=1}^m y_i)= \bar x \sum_{i=1}^m y_i \\\ &=\sum_{i=1}^m (y_i \bar x) \end{aligned} \tag{17} $$

4.1.3 Code

Let's use Python to implement the above calculation process:

Calculate the value of $w$

# formula 15
def method1(X,Y,m):
    x_mean = X.mean()
    p = sum(Y*(X-x_mean))
    q = sum(X*X) - sum(X)*sum(X)/m
    w = p/q
    return w

# formula 16
def method2(X,Y,m):
    x_mean = X.mean()
    y_mean = Y.mean()
    p = sum(X*(Y-y_mean))
    q = sum(X*X) - x_mean*sum(X)
    w = p/q
    return w

# formula 13
def method3(X,Y,m):
    p = m*sum(X*Y) - sum(X)*sum(Y)
    q = m*sum(X*X) - sum(X)*sum(X)
    w = p/q
    return w

With the help of the function library, we don't need to manually calculate the basic functions like sum() and mean().

Calculate the value of $b$

# formula 14
def calculate_b_1(X,Y,w,m):
    b = sum(Y-w*X)/m
    return b

# formula 9
def calculate_b_2(X,Y,w):
    b = Y.mean() - w * X.mean()
    return b

4.1.4 Calculation result

Using the above methods, the final results are consistent, which can play a role of cross-validation:

w1=2.056827, b1=2.965434
w2=2.056827, b2=2.965434
w3=2.056827, b3=2.965434

Code location

ch04, Level1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04.1-Least Square Method.md

04.1-Least Square Method.md

4.1 Least Square Method

4.1.1 History

4.1.2 Mathematical Principles

4.1.3 Code

Calculate the value of $w$

Calculate the value of $b$

4.1.4 Calculation result

Code location

Files

04.1-Least Square Method.md

Latest commit

History

04.1-Least Square Method.md

File metadata and controls

4.1 Least Square Method

4.1.1 History

4.1.2 Mathematical Principles

4.1.3 Code

Calculate the value of $w$

Calculate the value of $b$

4.1.4 Calculation result

Code location