Introduction to Neural Networks. Pt5 - Backpropagation
Backpropagation is How Neural Networks 'Learn'.
How to train a neural network, by tweaking its weights and biases during training
This is the final post in this mini-series in our Introduction to AI. In this post we will see how we use calculus to determine the sensitivity of our cost function to every single weight and every single bias in the network. Then we will use these sensitivities to calculate optimal tweaks & nudges that will, over the course of many, many epochs, get the network to perform the way we want it to.
How to determine the optimum set of millions of weights & biases?
Think of this like a giant mathematical combination lock. We have a cost function that measures the square of the difference between known outputs and what our network calculates. To solve the puzzle come up with a set of weights and biases (the numbers on our combination lock) that get the cost function as low as possible (open the combination lock).
Think of backpropagation as helping solve a crazily complicated cryptex. Backpropagation will give you an optimum set of weights and biases in your network that will minimise the error between the neural network's output and the training examples
What impact will changing weights & biseses have?
Let’s first think about what tweaking our weights and biases will do. We’ll look at biases first. Our bias term this will determine whether our neuron will fire or not. A high bias is ‘excitory’ you can see below that a high bias will mean that the neuron fires at lower levels of activation.
A high bias means that a neuron will 'fire' at lower levels of input activation
Likewise, a low bias is inhibitory. We will need a lot of input activation or very high weights, or both, for the neuron to fire.
A low bias means that there needs to be a greater level of input activation (measured by 'x' in the diagram above) before the neuron will 'fire'
If we think about our weights the same thing is true. Higher weights create more excitory signals in neurons in the network. Lower weights will inhibit the activation of neurons.
Increasing the weights that connect neurons will increase activations. A decrease in weights will naturally inhibit activations
The other factor that will determine the activation of a neuron are the neuron’s inputs. That is the activations of the neurons in the previous layer. Naturally we can’t directly change the activation of neurons in earlier layers. But we can influence the activations of neurons in prior layers by altering their weights and biases.
The changes to a neuron's bias, changes to a neuron's weights and changes to a neuron's inputs will all impact the activation of an artificial neuron.
Three types of things we can 'tweak' to get a better output
Therefore the three things that can alter any neuron’s activation are:
1. It’s bias.
2. The neuron’s weights.
3. The activations of the neurons in the prior layer.
Thus, if we need an excitory response for a particular neuron we can increase its own weights and biases. We can’t directly increase the activation of the prior layer’s neurons, as these are derived values. But we can increase the weights and biases of the neurons in the prior layer.
Similarly, if we need an inhibitory response for a particular neuron, we can decrease its own weights and biases. We can’t directly decrease the activation of the prior layer’s neurons, as these are derived values. But we can decrease the weights and biases of the neurons in the prior layer.
'Backpropagation' refers to indirectly changing the activations of neurons in prior layers. This is accomplished by changing the weights and biases in those prior layers
Two key takeaways...
Changes to the weights and biases of neurons in previous layers is where the term ‘Backpropagation’ comes from. When we are training our network, we ‘backpropagate’ our error, from output layer, backwards through each hidden layer until we reach our input layer. Each neuron’s activation can be increased or decreased by altering its own weights and biases, but also of all the weights and biases of neurons in earlier layers in the network.
The other thing to keep in mind is that not all nudges and tweaks to weights and biases are created equal. Weights are multiplied by activations. So even a large change to a weight that is connected to a neuron with low activation will have little effect. Similarly, a large change in upstream weights and biases for a neuron connected by a zero, or low weight will have zero or low impact.
The intuition behind 'Backpropagation'
Before we dive into the formal calculus let’s illustrate the two concepts we have just discussed. This will give us an intuition about what the backpropagation algorithm is doing. I hope you this intuition helpful, rather than just looking at equations of derivatives – which can be a little dry just on their own.
Let’s invoke our options pricing example and focus on the last three layers of the network. Our output layer, and the final two hidden layers. We will label these two hidden layers ‘l’ and ‘k’.
Modifying all the weights and biases in every layer to increase the output activation. In this example the activation is $0.72c too low.
Our calculated option price is too low...
We’ve provided all our time-series data and other parameters way upstream in our input layer. Let’s say our correct option price is $7.25. Our output is showing the incorrect answer of $6.53. Still a very long way away from the correct price. It is too low by seventy-two cents. There is clearly a lot of work to do to nudge the weights and biases before this would be a useful model capable of generalising. We need to increase the activation in our output layer. What will help with this?
The most proximate help is from the output neuron’s own bias. Increasing this will increase the output activation. We can also increase the three weights connecting the activations from layer ‘l’ to this output neuron. Finally, we can look at the activations in layer ‘l’ itself. Increases here will increase our output activation. Clearly, we can’t nudge these activations directly. But all these neurons have their own biases (Bl1, Bl2 & Bl3 in the animated diagram above).
Increases to these biases will increase the activations in layer ‘l’, which in turn will increase our output layer. Similarly, each of the neurons in layer ‘l’ have their own weights. These can be nudged up to increase the activations in layer l, which in turn will increase the activation in the output neuron.
We continue this process all the way back to our input layer. Increases to the biases in in layer k will push up the activations. As will changes to the weights that connect these neurons to the previous hidden layer. Which is hidden off the screen to the left in the animated figure above.
Again this is where the term ‘Backpropagation’ comes from. The Error or Loss cascades back through the network from the output layer through all of the weights and biases the previous hidden layer, and then through all of the weights and biases of the preceding hidden layer. This cascades back all the way to the input layer.
Not all tweaks and nudges are created equal
The other thing to keep in mind is that not all nudges are created equal. What do we mean by this? Well, look at this weight in the figure below.
The weight highlighted in yellow is connected to a neuron with a high weight (9.53). Even small changes to this weight will have a big impact in the output activation
It is connected to a neuron with a high activation. Even small changes to this weight will have a big impact on the activation of the output neuron. At the same time look at this weight. This is connected to a neuron with a very low activation. Because of this even large changes in this weight will not have such an effect on the output.
Next look at the weight highlighted in the figure below. This has a very high strength. That means that even small changes to the activation of the highlighted neuron in layer L will have a large impact on the output activation. Of course, we can’t change the value of this neuron directly, but it does mean that even small changes to the bias, or to the weights connected to this neuron will have a large effect.
The neuron with activation of '3.18' is connected the output neuron via a high weight. Even small increases to its bias, or its weights will therefore have a large impact on the output activation
When the calculated answer is too high
Bear in mind that this is the result of one training example. We spoke in the last post about the importance of getting as many examples as you can. If we supply the market data and correct option price of a second example we would expect a different result, with a different loss. (In this case it has over-priced the option by four dollars and thirty-five cents).
In this example the network has overestimated the price of the option. Here we need to reduce weights and biases to get the output of the network closer to the correct value.
To lower this loss, we will need to decrease the weights and biases. Over the course of a whole epoch, we will see a range of losses. Our mean squared error cost function will ensure that these losses are all cumulative for the cost for that epoch.
So, there are two very important takeaways before we cover the screen in Greek letters and subscripts. Firstly, understand how the error or loss backpropagates through the network from output to input. Secondly that the nudges to weights and biases are uneven. Even very small changes to activations that are connected through high weights will make a difference. Likewise, a very large nudge to a weight that takes an input from a neuron with low activation will have a limited effect on the final output.
If you are comfortable with these two concepts, then the calculus will be much more intuitive.
Calculus allows us to determine rates of change. How much one variable changes with respect to another. It also allows us to do things such as find the minimum of a function. Here we see a function in the animated figure below. We see the formula along with the graph. This graph plots the value of the function for a range of input values – ‘x’. This function has a minimum when x is 6.22.
Using calculus to determine rates of change and to find the minimum and minima of functions
Finding a function's minimum
Here is how we can use calculus to find the minimum of this function. The slope of the function is given by how much ‘y’ changes for a given change in x. If we make the changes very small, we can get the local tangent – or slope – of the curve at any point x. This slope cold be positive – when x increases y increases. Or it can be negative – this is where an increase in x will lead to a decrease in y. The slope can also be zero. It will be zero at the minimum of a function. If the slope is positive, then we will need to decrease x to head in the direction of the minimum. If the slope is negative, then we will need to increase x to move towards the minimum.
The slope can be steep – in which case we will want to alter x by a larger amount to get to the minimum. The slope can also be shallow, which it will be when we are near the minimum. Here we will make much smaller changes to x.
Gradient Descent and 'Stochastic Gradient Descent' (Adding some noise)
This technique of using the direction and size of the slope to find a function’s minimum is called ‘Gradient Descent’. One problem here is that it can get stuck in local minima. As you can see there is another minimum on the animated chart above where x = 2. A way of avoiding getting stuck in local minima is to use a technique called ‘Stochastic Gradient Descent’ or SGD. Here rather than computing the exact gradient by using all the training examples, we take a sample of the training set. This gives us an estimate of the true gradient, but one with enough noise or randomness to jolt out of local minima.
So now we have a strategy – Stochastic Gradient Descent – that will help us choose the optimum set of weights and biases for our network to get our cost function as low as we can. We’ll walk through an example to show how this works.
Let’s start by writing our three equations for the neural network. Firstly the output layer from the final hidden layer, secondly any hidden layer from the previous hidden layer and finally the equation for the first hidden layer to the input layer. Writing the equations in this way simplifies the expression and lets us deal with an arbitrary number of hidden layers.
We can then write out our cost function as the mean of the squared error. The cost function itself is a function of our neural network. To make that explicit let’s plug our output layer equation into this function.
An animated walkthrough 'Backpropagation' to accompany the explanation below.
We now have an expanded version of our cost function to look at the mean squared error between the training samples and the output of our network. This version is written in terms of the activation function, the weights, and biases as well as the activations of the previous layer. We want to minimise the cost function. To do that we need to nudge our weights and biases to get CF as low as possible. We will use calculus to determine the sensitivity of the cost function to our weights and biases so that we can make the right adjustments to our network while it is being trained.
Let’s focus on just one training sample which we will denote with the subscript ‘1’. For this sample we have a known output which we will call y_train. For many applications we will have multiple outputs, hence the representation here as a vector. For our option pricing solution, we have a single output. We can generalise this to a vector with a single value.
This ‘cost’ is a function of a function of a function. There are three nested functions. The first function is our composition equation. This multiplies the activations of the previous layer by weights and adds a bias. The second function is our activation function.: could be ReLU, hyperbolic tangent, sigmoid etc. Here we will use ReLU. Our third function is the square of the error. Subtract the output of our network from the correct training result and square this loss.
To have a strategy to nudge our weights and biases we want to know three things. Firstly, the sensitivity of the cost function to the weights in the output layer. When we change the weights, how much does the cost function change for this training example?
Secondly the sensitivity of the cost function to the bias in the output layer. When we change the bias vector, how much does the cost function change?
Finally, we would like to know the sensitivity of the cost function to the previous layer’s activations. As we’ve said we can’t directly nudge these activations. But we can nudge this layer’s weights and biases. If you recall this is where the idea of backpropagation comes in. By knowing the ideal adjustments we’d like to see in the activations of the previous layer we can derive how we would like that layer’s weight’s, biases, and that layer’s previous activations to change too. We can thus keep propagating the error back all the way to the input layer.
For nested functions we use the ‘Chain Rule’ from calculus. This states that for functions of functions of functions we simply multiply the derivatives of each function.
We have three variables here: weights & biases of this layer and the activations of the previous layer. So, we will take the partial derivative of each. This simply holds all the other variables constant and looks at the changes only to the variable in question.
So, the partial derivative of the cost function for this example with respect to the weights in the output layer. Or ‘How much does the cost function change for a small change in the weights). Can be determined by applying the chain rule to our cost function equation. The first function is our composition function. The weights, multiplied by the activations then added to the bias. If we differentiate this with respect to the weights, then the bias term disappears, and we are left with the activations. The derivative of our ReLU cost function is either zero or one. Finally, when we differentiate third term (the square of the error) we get two multiplied by the error. We can simplify the equation to this. So, the derivative is either 0 or the loss multiplied by the previous layer’s activations multiplied by 2. Let’s just pause and review where these terms come from. Firstly, we differentiate the composition function with respect to the weights. That gives us the activations from the previous layer. Then we differentiate our ReLU function, that gives us 1 or 0. Finally we differentiate our squared error function. That gives us twice the loss.
We then go through the same process for the other two partial derivatives. To get the sensitivity of the cost function for changes to bias in the output layer we differentiate our composition function with respect to the bias vector. The derivative of the bias is simply 1. Furthermore, the derivative for the ReLU and the squared error is identical to our first equation when we looked at the weights.
Finally, we want to determine our sensitivity to the previous layer’s activations. Again, to restate we can’t directly alter these activations, but we can use the sensitivities to see what we would like to alter further up the network. The clever ‘backpropagation trick’ at play. The derivative of our composition function with respect to the previous layer’s activations is simply the weights. And following through the rest of the chain rule we get the now familiar terms for our ReLU and our squared error.
A visualisation of the nudges to weights (W) and biases (b). The desired changes to the previous layer's activations (h) tell us how much to nudge its own weights and biases. These desired changes propagate back through each layer of neurons. Hence 'Backpropagation'.
Let’s look at a matrix representation of this. In blue on the left we have our weight matrix. In yellow on the right we have our bias vector and in purple in the middle we have the activations of the previous layer. These are the inputs to the neurons in this layer.
We multiply the blue matrix by the purple vector add the yellow bias terms and apply our activation function. What calculus and Stochastic Gradient Descent give us are a set of nudges for our training examples to minimise the loss. That is to say to modify the weights and biases so that over time the network’s calculated answer gets closer and closer to the correct answer from our training data.
That is a lot to take in all at once. Don’t worry if this takes a while to sink in. Frankly all you need to know here is that there is a strategy for determining the weights and biases in a neural network. They are not pulled from the ether, rather derived using a formal, iterative approach.
A neural network ‘learns’ by updating its weights and biases in response to training data.
It incentive to ‘learn’ is to solve a mathematical equation. It has to come up with a set of weights and biases that minimise a cost function.
To solve this equation it uses a technique from Calculus, called Stochastic Gradient Descent
This method looks at the sensitivity of the cost function to every weight and bias. From this sensitivity an optimum set of nudges to every weight and bias can be derived.