Building a Neural Network from scratch

Neural Networks are computational graphs representing the formula of the training function.

Training in the neural networks context means finding the parameter values, or weights, that enter into the formula of the training function via minimizing a loss function. This is similar to training linear regression, which we discussed here.

The mathematical structure remains the same: 1. Training Function 2. Loss Function 3. Optimization

The only difference is that for other models (linear regression, logistic regression, soft max regression, support vector machines, etc.), the training functions are uncomplicated. They linearly combine the data features, add a bias term (\(w_0\)), and pass the result into a most one nonlinear functions (for example, the logistic function in logistic regression). As consequence, the results of these models are also simple: a linear (flat) function for linear regression, and a linear division boundary between different classes in logistic regression, soft-max regression and support vector machines. Even when we use use these simple models to represent nonlinear data, such as in the case of polynomial regression (fitting the data into polynomial functions of the features) or support vector machines with the kernel trick, we still end up with linear functions or division boundaries, but these will either be in higher dimensions( for polynomial regression, the dimensions are the features and its powers) or in transformed dimensions (such as when we use the kernel trick with support vector machines).

For neural networks, on the other hand, the process of linearly combining the features, adding a bias term, then passing the result through a nonlinear function (now called activation function) is the computation that happens only in one neuron.

This simple process happens over and over again in dozens, hundreds, thousands or sometimes millions of neurons, arranged in layers, where the output of one layer acts as the input of the next layer.

We will discover that after all, neural networks are just one mathematical function

In this blog we will build a simple multilayer perceptron from scratch using the library Sympy for symbolic calculations.

from sympy import *
from numpy import e

from PIL import Image
from IPython.display import display

img = Image.open('media/graph.png')
display(img)

This this our perceptron’s architecture. It has an input layer of two features x1 and x2 that will be linearly combined to a hidden layer where the activation functions are the sigmoid function.

\[\text{Sigmoid Function: } \sigma(z) = \frac{1}{1 + e^{-z}}\]

In the last neuron, the activation function is the sigmoid function too, that means that the output will be a number between 0 and 1, giving us a probability. Thus, this is a classification artificial neural network.

# Features
x1 = 0.35
x2 = 0.9

# Weights and bias
w13 = symbols('w13')
w14 = symbols('w14')
w23 = symbols('w23')
w24 = symbols('w24')
w35 = symbols('w35')
w45 = symbols('w45')
w0 = symbols('w0') # Bias

# Input Layer ----------------------------------------------------------------------------------------
x1 = 0.35
x2 = 0.9

# Hidden Layer ---------------------------------------------------------------------------------------
# Neuron 3
z3 = (x1*w13+w0) + (x2*w23+w0)

# Activation function
f3 = 1 / (1+e**(-z3))

# Neuron 4
z4 = (x1*w14+w0) + (x2*w24+w0)

# Activation function
f4 = 1 / (1+e**(-z4))

# Output Layer ---------------------------------------------------------------------------------------
# Neuron 5 
z5 = (f3*w35+w0) + (f4*w45+w0)

# Activation function
f5 = 1 / (1+e**(-z5))

Training Function (Model)

After all the linear combinations and passing them to the activation function we end up with just one mathematical function that represent our training function

f5

\(\displaystyle \frac{1}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1}\)

Loss Function

Given that our output function is the sigmoid function, our loss function will be the binary cross entropy loss function

# Loss Function (Binary Cross Entropy)
y_true = 1

L = -y_true*log(f5) - (1-y_true)*log(1-f5)

\(\displaystyle - \log{\left(\frac{1}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1} \right)}\)

Optimization: Gradient Descent \(\vec{w}^{i+1} = \vec{w}^{i} - \eta \nabla L(\vec{w}^{i})\)

In order to calculate the gradient descent we need to calculate the partial derivative of the Loss Function with respect of each weight and bias.

To get a deeper view on how to do the gradient descent check out this blog

Partial Derivatives

dL_w0 = diff(L, w0)
dL_w0

\(\displaystyle \frac{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} \left(- \frac{2.0 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2}} - \frac{2.0 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2}} - 2.0\right)}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1}\)

dL_w13 = diff(L, w13)
dL_w13

\(\displaystyle - \frac{0.35 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

dL_w14 = diff(L, w14)
dL_w14

\(\displaystyle - \frac{0.35 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

dL_w23 = diff(L, w23)
dL_w23

\(\displaystyle - \frac{0.9 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

dL_w24 = diff(L, w24)
dL_w24

\(\displaystyle - \frac{0.9 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

dL_w35 = diff(L, w35)
dL_w35

\(\displaystyle - \frac{1.0 \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right) \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

dL_w45 = diff(L, w45)
dL_w45

\(\displaystyle - \frac{1.0 \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right) \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)

Gradient Descent automation

# Optimize for w's
learning_rate = 0.3

# Random Values for Weights and bias
w0_i = 0.1
w13_i = 0.2
w14_i = 0.8
w23_i = 0.6
w24_i = 0.4
w35_i = 0.1
w45_i = 0.5

i = 1
step_size_w0 = 1
step_size_w13 = 1
step_size_w14 = 1
step_size_w23 = 1
step_size_w24 = 1
step_size_w35 = 1
step_size_w45 = 1

def update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w0.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w0_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w13.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w13_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w14.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w14_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w23.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w23_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w24.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w24_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w35.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w35_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

def update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
    slope_w0 = dL_w45.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
    step_size_w0 = learning_rate * slope_w0
    updated_w0 = w45_i - step_size_w0
    w0_i = updated_w0
    return w0_i, step_size_w0

Results of random parameters

f5.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})

\(\displaystyle 0.649864477526152\)

stop_criteria = abs(step_size_w0) >= 0.001 or abs(step_size_w13) >= 0.001 or abs(step_size_w14) >= 0.001 or  abs(step_size_w23) >= 0.001 or  abs(step_size_w24) >= 0.001 or abs(step_size_w35) >= 0.001 or abs(step_size_w45) >= 0.001

I will comment this section as it a heavy process

"""
while stop_criteria and i != 1000:
    w0_i, step_w0 = update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w13_i, step_w13 = update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w14_i, step_w14 = update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w23_i, step_w23 = update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w24_i, step_w24 = update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w35_i, step_w35 = update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    w45_i, step_w45 = update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
    

    print(f'''Iteration: {i} 
    StepSize_w0: {step_w0}
    New w0: {w0_i}
    
    StepSize_w13: {step_size_w13}
    New w13: {w13_i}
    
    StepSize_w14: {step_size_w14}
    New w14: {w14_i}
    
    StepSize_w23: {step_size_w23}
    New w13: {w23_i}
    
    StepSize_w24: {step_size_w24}
    New w24: {w24_i}
    
    StepSize_w35: {step_size_w35}
    New w35: {w35_i}
    
    StepSize_w45: {step_size_w45}
    New w45: {w45_i}
    
    ''')
  
    i += 1
    """

"\nwhile stop_criteria and i != 1000:\n    w0_i, step_w0 = update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w13_i, step_w13 = update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w14_i, step_w14 = update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w23_i, step_w23 = update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w24_i, step_w24 = update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w35_i, step_w35 = update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    w45_i, step_w45 = update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n    \n\n    print(f'''Iteration: {i} \n    StepSize_w0: {step_w0}\n    New w0: {w0_i}\n    \n    StepSize_w13: {step_size_w13}\n    New w13: {w13_i}\n    \n    StepSize_w14: {step_size_w14}\n    New w14: {w14_i}\n    \n    StepSize_w23: {step_size_w23}\n    New w13: {w23_i}\n    \n    StepSize_w24: {step_size_w24}\n    New w24: {w24_i}\n    \n    StepSize_w35: {step_size_w35}\n    New w35: {w35_i}\n    \n    StepSize_w45: {step_size_w45}\n    New w45: {w45_i}\n    \n    ''')\n  \n    i += 1\n    "

Results of trained parameters

#f5.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})

Prints 0.999453185534398

The y_true value was 1, so it worked. It improve :)