manuelsolan-o - Understanding Linear Regression

Understanding Linear Regression

Linear Regression is a fundamental yet powerful machine learning model that blends elements of statistics, linear algebra, and optimization. Understanding the concepts behind linear regression provides a strong foundation for understanding the core principles and structures of many other machine learning models.

In this post, we’ll explore how to use the Fish Market Dataset to predict the weight of fish using a linear regression model. We’ll consider five key features: Length1, Length2, Length3, Height, and Width.

Fish Market Data

Let’s relabel our five features as \(x_1\), \(x_2\), \(x_3\), \(x_4\), \(x5\) then write the fish weight as a function of these five features:

\[\displaystyle y = f(x_1,x_2,x_3, x_4, x_5)\]

Training Function (Model)

We assume that the weight depends linearly on the length features.

This means that the weight of a fish \(y\), can be computed using a linear combination of its five different length measurements, plus a bias term \(w_0\) giving the following training function:

\[\displaystyle y = w_0 + x_1w_1 + x_2w_2 + x_3w_3 + x_4w_4 + x_5w_5\]

After our major decision in the modeling process to use a linear training function \(f(x_1,x_2,x_3, x_4, x_5)\), all we have to do is to find the appropriate values of the parameters \(w_0\), \(w_1\), \(w_2\), \(w_3\), \(w_4\), and \(w_5\). We will learn the best values for \(w\)’s from the data. The process of using the data to find the appropriate \(w\)’s is called training the model. A trained model is then a model where the \(w\) values have been decided on.

For linear models, each parameter gives each feature a certain weight in the prediction process. So if the value of \(w_2\) is larger than the value of \(w_5\), then the second feature plays a more important role than the fifth feature in our prediction, assuming that the second and the fifth features have comparable scales. This is one of the reasons it is good to scale or normalize tha data before training the model.

Therefore, learning our \(w\)’s from the data allow us to mathematically compute the contribution of each feature to our predictions (or the importance of feature combinations if some features were combined during the data preparation stage, before training). In other words, the models learn how the data features interact and how strong these interactions are.

Loss Function

For linear regression, we use the mean squared error function. This function averages over the squared distance errors between the predictions and the true value for m data points.

\[ \text{Mean Squared Error (MSE)} = \displaystyle \frac{1}{m} \left( \left| y_{\text{predict}}^{1} - y_{\text{true}}^{1} \right|^2 + \left| y_{\text{predict}}^{2} - y_{\text{true}}^{2} \right|^2 + \cdots + \left| y_{\text{predict}}^{m} - y_{\text{true}}^{m} \right|^2 \right) \]

Let’s write the above expression more compactly using the sum notation:

\[ \text{Mean Squared Error (MSE)} = \frac{1}{m} \sum_{i=1}^{m} (y_{\text{predict}}^{i} - y_{\text{true}}^{i})^2 \] It is important to get into the habit of using the more compact linear algebra notation of vector and matrices.

Both software and the hardware are built for machine learning models are optimized for matrix and tensor (multidimensional matrices) computations.

Using Linear Algebra Notation, we can write the mean squared error as:

\[ \text{Mean Squared Error (MSE)} = \frac{1}{m} (\vec{y}_{\text{predict}} - \vec{y}_{\text{true}})^t (\vec{y}_{\text{predict}} - \vec{y}_{\text{true}}) \]

\[ \text{Mean Squared Error (MSE)} = \frac{1}{m} \left\| \vec{y}_{\text{predict}} - \vec{y}_{\text{true}} \right\|_{l^2}^2 \]

The last equality introduces the \(l^2\) norm of a vector which is just the \(\sqrt{\text{sum of squares of its components}}\).

Training Subset

\[ \begin{align*} y_{\text{predict}}^{1} &= w_0 + w_1 x_1^{1} + w_2 x_2^{1} + w_3 x_3^{1} + w_4 x_4^{1} + w_5 x_5^{1} \\ y_{\text{predict}}^{2} &= w_0 + w_1 x_1^{2} + w_2 x_2^{2} + w_3 x_3^{2} + w_4 x_4^{2} + w_5 x_5^{2} \\ &\vdots \\ y_{\text{predict}}^{m} &= w_0 + w_1 x_1^{m} + w_2 x_2^{m} + w_3 x_3^{m} + w_4 x_4^{m} + w_5 x_5^{m} \\ \end{align*} \]

We can arrange this system as:

\[ \begin{pmatrix} y_{\text{predict}}^{1} \\ y_{\text{predict}}^{2} \\ \vdots \\ y_{\text{predict}}^{m} \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} w_0 + \begin{pmatrix} x_1^{1} \\ x_1^{2} \\ \vdots \\ x_1^{m} \end{pmatrix} w_1 + \begin{pmatrix} x_2^{1} \\ x_2^{2} \\ \vdots \\ x_2^{m} \end{pmatrix} w_2 + \begin{pmatrix} x_3^{1} \\ x_3^{2} \\ \vdots \\ x_3^{m} \end{pmatrix} w_3 + \begin{pmatrix} x_4^{1} \\ x_4^{2} \\ \vdots \\ x_4^{m} \end{pmatrix} w_4 + \begin{pmatrix} x_5^{1} \\ x_5^{2} \\ \vdots \\ x_5^{m} \end{pmatrix} w_5 \]

The vector on the lefthand side of the equation is \(\vec{y}_{\text{predict}}\), the matrix on the righthand side is the training subset \(X\) augmented with the vectors of ones, and the last vector on the righthand side has all the unknown weights.

We can rewrite the equation as:

\[ \vec{y}_{\text{predict}} = X \vec{w} \] Now the formula of the mean squared error loss function, which we wrote before as:

\[ \text{Mean Squared Error (MSE)}= \\ \frac{1}{m} (\vec{y}_{\text{predict}} - \vec{y}_{\text{true}})^t (\vec{y}_{\text{predict}} - \vec{y}_{\text{true}}) = \\ \frac{1}{m} \left\| \vec{y}_{\text{predict}} - \vec{y}_{\text{true}} \right\|_{l^2}^2 \]

becomes:

\[ \text{Mean Squared Error (MSE)}= \\ \frac{1}{m} (X \vec{w} - \vec{y}_{\text{true}})^t (X \vec{w} - \vec{y}_{\text{true}}) = \\ \frac{1}{m} \left\| X \vec{w} - \vec{y}_{\text{true}} \right\|_{l^2}^2 \] We are now ready to find the \(\vec{w}\) that minimizes the loss function.

Correlated Features

Our model might have a problem if two or more features of the data are highly correlated. This means that there is a strong relationship between the features, so one of those features can be determined (or nearly determined) using a linear combination of the others. Thus, the corresponding feature columns are not linearly independent.

For matrices, this is a problem, since it indicates that the matrix either cannot be converted or is ill conditioned. Ill-conditioned matrices produce large instabilities in computations, since slight variations in the training data (which must be assumed) produce large variations in the model parameters and hence render its predictions unreliable.

We desire well-conditioned matrices in our computations, so we must get ride of the sources of ill conditioning. When we have highly correlated features, one possible avenue is is to include only one of them in our model, as the other do not add much information. Another solution is to apply dimension reduction techniques such as principal component analysis.

Optimization

The goal is to find the values that make our training function best fit the training data subset, where the word best is quantified using the loss function.

The first and second derivatives of a function are ver important for locating optimizers because the firs derivative contains information on how fats a function increases at a point (so if you follow its direction, you might ascend to the maximum or descend to the minimum), and the second derivative contains information on the shape of the bowl of the function - if it curves up or it curves down.

Let’s remember a key idea from calculus: minimizers (and/or maximizers) happend at critical points (defined as the points where one derivative is either equal to zero or does not exist).

How do we locate the critical points in the interior of our search space?

Approach 1

Find one derivative of our function.
Set it equal to zero.
Solve for the \(w\)’s that make our derivative zero.

For functions whose derivatives are linear like Mean Squared Error is sort of easy to solve for these \(w\)’s.

On the other hand, when our equations are nonlinear, finding this solutions is an entirely different story.

Approach 2

Another option is to follow the gradient direction to descend toward the minimum or ascend toward the maximum.

We’ll use this approach later. For now let’s focus on approach 1.

Minimizing the loss function

\[ L(\vec{w}) = \frac{1}{m} (X \vec{w} - \vec{y}_{\text{true}})^t (X \vec{w} - \vec{y}_{\text{true}}) \]

\[ = \frac{1}{m} ((X \vec{w})^t - \vec{y}^t_{\text{true}}) (X \vec{w} - \vec{y}_{\text{true}}) \] \[ = \frac{1}{m} (X^t \vec{w}^t - \vec{y}^t_{\text{true}}) (X \vec{w} - \vec{y}_{\text{true}}) \] \[ = \frac{1}{m} (X^t \vec{w}^t X \vec{w} - X^t \vec{w}^t \vec{y}_{\text{true}} - \vec{y}^t_{\text{true}} X \vec{w} + \vec{y}^t_{\text{true}} \vec{y}_{\text{true}}) \] To simplify the process we are going to substitute \(s = X^t X\) and \(\vec{a} = X^t \vec{y}_{\text{true}}\)

\[ = \frac{1}{m} (s \vec{w}^t \vec{w} - \vec{w}^t \vec{a} - \vec{w} \vec{a} + \vec{y}^t_{\text{true}} \vec{y}_{\text{true}}) \\ \] \[ \frac{\partial}{\partial \vec{w}} = \frac{1}{m} (2s \vec{w} - \vec{a} - \vec{a} + 0) \\ \]

Solving for \(\vec{w}\)

\[ \frac{1}{m} (2s\vec{w} -\vec{a} - \vec{a}) = 0 \\ \] \[ 2s\vec{w} - 2 \vec{a} = 0 \\ \]

\[ 2s\vec{w} = 2 \vec{a} \\ \]

\[ \vec{w} = s^{-1}\vec{a} \\ \]

Finally we have:

\[ \vec{w} = (X^t X)^{-1} X^t \vec{y}_{\text{true}} \\ \]

Doing the computation we have that the value of our weights are:

\[ \vec{w} = \begin{pmatrix} w_0 \\ w_1 \\ w_2 \\ w_3 \\ w_4 \\ w_5 \end{pmatrix} = \begin{pmatrix} -475.1993 \\ 82.8497 \\ -28.8595 \\ -28.5077 \\ 29.8298 \\ 30.9725 \end{pmatrix} \]