manuelsolan-o - Understanding Gradient Descent: SSR example

As we mention in the Fitting Functions to Data blog, (Supervised) Machine Learning Algorithms are built by the following structure:

Training Function (Model)
Loss Function
Optimization

In order to optimize the loss function we have to alternatives.

Approach 1

Find one derivative of our function.
Set it equal to 0.
Solve for the \(w\)’s that make our derivative zero.

Approach 2

Follow the gradient direction to descend toward the minimum or ascend toward the maximum.

Gradient Descent uses approach 2. To exemplify how Gradient Descent works, let’s propouse and example.

Simple Example

Data

For practical purposes let’s take just 3 data points with values in \(x\) and \(y\):

These can be represented in a scatter plot:

Linear Regression

For this data we could adjust a Linear Regression Model.

Training Function (Model)

Let’s remember that linear regression makes a linear combinations of weights and features plus a bias.

\[\vec{y} = \vec{w_0} + \vec{w_1} \vec{x_1}\]

Loss Function SSR

For this example let’s use Sum of Squared Residuals as our loss function which can be represented like this

\[\text{Sum of Squared Residuals (SSR)} = (\vec{y}_{\text{true}} - \vec{y}_{\text{predicted}})^t (\vec{y}_{\text{true}} - \vec{y}_{\text{predicted}})\]

Since this example has only three data points we’ll use the summation Notation instead of Linear Algebra notation.

\[\text{Sum of Squared Residuals (SSR)} = \sum_{i=1}^{3} (y^i_{\text{true}} - y^i_{\text{predicted}})^2\]

Note:

In this part of the example, we will optimize for \(w_0\) and assume a value for \(w_1\)

Since \(y_{\text{predicted}} = w_0 + w_1\cdot x_1\) and we are assuming \(w_1\) as \(0.64\)

\[\text{Sum of Squared Residuals (SSR)} = (1.4 - w_0 + 0.64 \cdot x^1_1)^2 + (1.9 - w_0 + 0.64 \cdot x^2_1)^2 + (1.4 - w_0 + 0.64 \cdot x^3_1)^2\] \[\text{Sum of Squared Residuals (SSR)} = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\]

\[L(\vec{w}^{i}) = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\]

Optimization: Gradient Descent \(\vec{w}^{i+1} = \vec{w}^{i} - \eta \nabla L(\vec{w}^{i})\)

Gradient Descent is an iterative algorithm that updates the value of variable while it is decreasing towards the minimum of a function.

It’s formula is \(\vec{w}^{i+1} = \vec{w}^{i} - \eta \nabla L(\vec{w}^{i})\):

where:

\(\vec{w}^{i}\) represents the vector of parameters
\(\eta\) is the learning rate
\(L(\vec{w}^{i})\) is the Loss function
\(\nabla L(\vec{w}^{i})\) is the gradient of the Loss function with respect to the parameter vector \(\vec{w}^{i}\).

In this case our function is the loss function and we want to find the value of \(w_0\) where the loss function reaches its minimum.

For Gradient Descent we need to find the partial derivative of the loss function with respect to \(w_0\):

\[L(\vec{w}^{i}) = (1.4 - w_0 + 0.64 \cdot 0.5)^2 + (1.9 - w_0 + 0.64 \cdot 2.3)^2 + (1.4 - w_0 + 0.64 \cdot 2.9)^2\] Applying the chain rule

\[ \frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - w_0 + 0.64 \cdot 0.5) -2(1.9 - w_0 + 0.64 \cdot 2.3) -2(3.2 - w_0 + 0.64 \cdot 2.9) \]

\(1^{st}\) Iteration

We need to initialize a \(w_0\) to some value.

Let’s try 0. Substituting \(w_0\) to 0:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0 + 0.64 \cdot 0.5) -2(1.9 - 0 + 0.64 \cdot 2.3) -2(1.4 - 0 + 0.64 \cdot 2.9) = -5.7\]

The value of our derivative will give us the slope of at the point \(0\)

Calculating the step size:

\[\text{Step Size} = \eta \nabla L(\vec{w}^{i})\]

\[\text{Step Size} = 0.1 \cdot -5.7 = -0.57\]

\[w_2 =0 \cdot -(-0.57) = 0.57\]

The \(w_0\) value updates to \(0.57\).

\(2^{nd}\) Iteration

Substituting the updated value in the derivative function:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0.57 + 0.64 \cdot 0.5) -2(1.9 - 0.57 + 0.64 \cdot 2.3) -2(1.4 - 0.57 + 0.64 \cdot 2.9) = -2.3\]

The value of our derivative will give us the slope of at the point \(0.57\)

\[\text{Step Size} = 0.1 \cdot -2.3 = -0.23\] \[w_3 = 0.57 \cdot -(-0.23) = 0.8\]

The \(w_0\) value updates to \(0.8\).

\(3^{rd}\) Iteration

Substituting the updated value in the derivative function:

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - 0.8 + 0.64 \cdot 0.5) -2(1.9 - 0.8 + 0.64 \cdot 2.3) -2(1.4 - 0.8 + 0.64 \cdot 2.9) = -0.9\]

The value of our derivative will give us the slope of at the point \(0.8\)

\[\text{Step Size} = 0.1 \cdot -0.9 = -00.9\] \[w_4 = 0.8 \cdot -(-0.09) = 0.89\]

When to stop?

Gradient Descent has to stop criterion:

When the Step Size is close to 0 which means that the slope at the point \(w_1\) is close to \(0\) (\(<=0.0001\) ) and therefore it is a critical point.
After 1000 iterations.

Automatizated Algorithm in python

I automatized this algorithm in python and it converges at the \(8^{th}\) iteration giving us a value of \(w_0\) of \(0.95\)

The regression line updates this way:

Optimization: Gradient Descent \(w_0\) & \(w_1\)

Now to make the optimization of \(w_0\) & \(w_1\) we need to apply the gradient of the loss function \[L(\vec{w}^{i}) = (1.4 - w_0 + w_1 \cdot 0.5)^2 + (1.9 - w_0 + w_1 \cdot 2.3)^2 + (1.4 - w_0 + w_1 \cdot 2.9)^2\]

The partial derivative \(L(\vec{w}^{i})\) with respect of \(w_0\)

\[\frac{\partial L(\vec{w}^{i})}{\partial w_0} = -2(1.4 - w_0 + w_1 \cdot 0.5) -2(1.9 - w_0 + w_1 \cdot 2.3) -2(1.4 - w_0 + w_1 \cdot 2.9) \] The partial derivative \(L(\vec{w}^{i})\) with respect of \(w_1\)

\[\frac{\partial L(\vec{w}^{i})}{\partial w_1} = -2\cdot0.5(1.4 - w_0 + w_1 \cdot 0.5) -2\cdot 2.3(1.9 - w_0 + w_1 \cdot 2.3) -2\cdot 2.9(1.4 - w_0 + w_1 \cdot 2.9) \] ### Automatizated Algorithm

The algorithm converges at the \(33^th\) iteration with values of \(w_0\) and \(w_1\) if \(0.95\) and \(0.64\) respectivly.

Code

All the resources I used for this blog are here

Acknowledgements

This blog is inspired in the you tube video Gradient Descent, Step-by-Step by StatQuest with Josh Starmer. BAM