from sympy import *
from numpy import e
Neural Networks are computational graphs representing the formula of the training function.
Training in the neural networks context means finding the parameter values, or weights, that enter into the formula of the training function via minimizing a loss function. This is similar to training linear regression, which we discussed here.
The mathematical structure remains the same: 1. Training Function 2. Loss Function 3. Optimization
The only difference is that for other models (linear regression, logistic regression, soft max regression, support vector machines, etc.), the training functions are uncomplicated. They linearly combine the data features, add a bias term (\(w_0\)), and pass the result into a most one nonlinear functions (for example, the logistic function in logistic regression). As consequence, the results of these models are also simple: a linear (flat) function for linear regression, and a linear division boundary between different classes in logistic regression, soft-max regression and support vector machines. Even when we use use these simple models to represent nonlinear data, such as in the case of polynomial regression (fitting the data into polynomial functions of the features) or support vector machines with the kernel trick, we still end up with linear functions or division boundaries, but these will either be in higher dimensions( for polynomial regression, the dimensions are the features and its powers) or in transformed dimensions (such as when we use the kernel trick with support vector machines).
For neural networks, on the other hand, the process of linearly combining the features, adding a bias term, then passing the result through a nonlinear function (now called activation function) is the computation that happens only in one neuron.
This simple process happens over and over again in dozens, hundreds, thousands or sometimes millions of neurons, arranged in layers, where the output of one layer acts as the input of the next layer.
We will discover that after all, neural networks are just one mathematical function
In this blog we will build a simple multilayer perceptron from scratch using the library Sympy
for symbolic calculations.
from PIL import Image
from IPython.display import display
= Image.open('media/graph.png')
img display(img)
This this our perceptron’s architecture. It has an input layer of two features x1
and x2
that will be linearly combined to a hidden layer where the activation functions are the sigmoid function.
\[\text{Sigmoid Function: } \sigma(z) = \frac{1}{1 + e^{-z}}\]
In the last neuron, the activation function is the sigmoid function too, that means that the output will be a number between 0 and 1, giving us a probability. Thus, this is a classification artificial neural network.
# Features
= 0.35
x1 = 0.9
x2
# Weights and bias
= symbols('w13')
w13 = symbols('w14')
w14 = symbols('w23')
w23 = symbols('w24')
w24 = symbols('w35')
w35 = symbols('w45')
w45 = symbols('w0') # Bias
w0
# Input Layer ----------------------------------------------------------------------------------------
= 0.35
x1 = 0.9
x2
# Hidden Layer ---------------------------------------------------------------------------------------
# Neuron 3
= (x1*w13+w0) + (x2*w23+w0)
z3
# Activation function
= 1 / (1+e**(-z3))
f3
# Neuron 4
= (x1*w14+w0) + (x2*w24+w0)
z4
# Activation function
= 1 / (1+e**(-z4))
f4
# Output Layer ---------------------------------------------------------------------------------------
# Neuron 5
= (f3*w35+w0) + (f4*w45+w0)
z5
# Activation function
= 1 / (1+e**(-z5)) f5
Training Function (Model)
After all the linear combinations and passing them to the activation function we end up with just one mathematical function that represent our training function
f5
\(\displaystyle \frac{1}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1}\)
Loss Function
Given that our output function is the sigmoid function, our loss function will be the binary cross entropy loss function
# Loss Function (Binary Cross Entropy)
= 1
y_true
= -y_true*log(f5) - (1-y_true)*log(1-f5) L
L
\(\displaystyle - \log{\left(\frac{1}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1} \right)}\)
Optimization: Gradient Descent \(\vec{w}^{i+1} = \vec{w}^{i} - \eta \nabla L(\vec{w}^{i})\)
In order to calculate the gradient descent we need to calculate the partial derivative of the Loss Function with respect of each weight and bias.
To get a deeper view on how to do the gradient descent check out this blog
Partial Derivatives
= diff(L, w0)
dL_w0 dL_w0
\(\displaystyle \frac{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} \left(- \frac{2.0 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2}} - \frac{2.0 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2}} - 2.0\right)}{2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1}\)
= diff(L, w13)
dL_w13 dL_w13
\(\displaystyle - \frac{0.35 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
= diff(L, w14)
dL_w14 dL_w14
\(\displaystyle - \frac{0.35 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
= diff(L, w23)
dL_w23 dL_w23
\(\displaystyle - \frac{0.9 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{35}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
= diff(L, w24)
dL_w24 dL_w24
\(\displaystyle - \frac{0.9 \cdot 2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} w_{45}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right)^{2} \cdot \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
= diff(L, w35)
dL_w35 dL_w35
\(\displaystyle - \frac{1.0 \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1\right) \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
= diff(L, w45)
dL_w45 dL_w45
\(\displaystyle - \frac{1.0 \cdot 2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}}}{\left(2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1\right) \left(2.71828182845905^{- 2 w_{0} - \frac{w_{35}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{13} - 0.9 w_{23}} + 1} - \frac{w_{45}}{2.71828182845905^{- 2 w_{0} - 0.35 w_{14} - 0.9 w_{24}} + 1}} + 1\right)}\)
Gradient Descent automation
# Optimize for w's
= 0.3
learning_rate
# Random Values for Weights and bias
= 0.1
w0_i = 0.2
w13_i = 0.8
w14_i = 0.6
w23_i = 0.4
w24_i = 0.1
w35_i = 0.5
w45_i
= 1
i = 1
step_size_w0 = 1
step_size_w13 = 1
step_size_w14 = 1
step_size_w23 = 1
step_size_w24 = 1
step_size_w35 = 1
step_size_w45
def update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w0.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w0_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w13.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w13_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w14.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w14_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w23.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w23_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w24.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w24_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w35.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w35_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
def update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i):
= dL_w45.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
slope_w0 = learning_rate * slope_w0
step_size_w0 = w45_i - step_size_w0
updated_w0 = updated_w0
w0_i return w0_i, step_size_w0
Results of random parameters
f5.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
\(\displaystyle 0.649864477526152\)
= abs(step_size_w0) >= 0.001 or abs(step_size_w13) >= 0.001 or abs(step_size_w14) >= 0.001 or abs(step_size_w23) >= 0.001 or abs(step_size_w24) >= 0.001 or abs(step_size_w35) >= 0.001 or abs(step_size_w45) >= 0.001 stop_criteria
I will comment this section as it a heavy process
"""
while stop_criteria and i != 1000:
w0_i, step_w0 = update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w13_i, step_w13 = update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w14_i, step_w14 = update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w23_i, step_w23 = update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w24_i, step_w24 = update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w35_i, step_w35 = update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
w45_i, step_w45 = update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)
print(f'''Iteration: {i}
StepSize_w0: {step_w0}
New w0: {w0_i}
StepSize_w13: {step_size_w13}
New w13: {w13_i}
StepSize_w14: {step_size_w14}
New w14: {w14_i}
StepSize_w23: {step_size_w23}
New w13: {w23_i}
StepSize_w24: {step_size_w24}
New w24: {w24_i}
StepSize_w35: {step_size_w35}
New w35: {w35_i}
StepSize_w45: {step_size_w45}
New w45: {w45_i}
''')
i += 1
"""
"\nwhile stop_criteria and i != 1000:\n w0_i, step_w0 = update_w0(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w13_i, step_w13 = update_w13(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w14_i, step_w14 = update_w14(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w23_i, step_w23 = update_w23(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w24_i, step_w24 = update_w24(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w35_i, step_w35 = update_w35(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n w45_i, step_w45 = update_w45(w0_i, w13_i, w14_i, w23_i, w24_i, w35_i, w45_i)\n \n\n print(f'''Iteration: {i} \n StepSize_w0: {step_w0}\n New w0: {w0_i}\n \n StepSize_w13: {step_size_w13}\n New w13: {w13_i}\n \n StepSize_w14: {step_size_w14}\n New w14: {w14_i}\n \n StepSize_w23: {step_size_w23}\n New w13: {w23_i}\n \n StepSize_w24: {step_size_w24}\n New w24: {w24_i}\n \n StepSize_w35: {step_size_w35}\n New w35: {w35_i}\n \n StepSize_w45: {step_size_w45}\n New w45: {w45_i}\n \n ''')\n \n i += 1\n "
Results of trained parameters
#f5.subs({w0: w0_i, w13: w13_i, w14: w14_i, w23: w23_i, w24: w24_i, w35: w35_i, w45: w45_i})
Prints 0.999453185534398
The y_true
value was 1, so it worked. It improve :)