Richard Walker
- May 3, 2022
- 7 min read

Introduction to Neural Networks. Pt4 - 'Cost Functions'

Updated: Jun 13, 2022

In this fourth post on neural networks we are going to see how neural networks ‘learn’. Or are ‘trained’. In the first post we looked at the anatomy of a neuron. This includes its: inputs, outputs, weights, a bias and an activation function. The second post took a closer look at non-linear activation functions. In the third instalment we saw how we could link neurons in layers, with the activations of neurons in one layer feeding the neurons in the next layer – all the way until an output layer. The output layer provides the results of the task that the network has been trained to learn.

Cost Functions explained

In the third post we also suggested that a neural network could be regarded as a function. A crazily complicated function. But a function nonetheless. It has a bunch of inputs and a bunch of outputs. Crucially it has a large number of parameters, named weights and biases. It is these weights and biases in each neuron that we can tweak. This tweaking is what allows the neural network to ‘learn’ a pattern or mapping between inputs and outputs.

The Neural Network as a Function

Let’s get a little bit more formal with that ‘a neural network could be regarded as a function’ statement. Functions can be written out. So let’s write down what a neural network looks like as a function. To make things a little easier to digest let’s not write out the whole thing in one go. Instead let’s look at the function from layer to layer. From the input layer through hidden layers and finally to the output layer.

Vector & Matrix Representation

Up to this point we’ve been looking at equations for a single neuron. While we can continue to write out the equations in this form it is more legible and compact to use matrix notation.

Vector representation of a neural network

If we want to calculate the activations in neurons a1, a2, and a3 we will multiply the activations of the neurons in the previous layer (here h1 through h4) with their associated weights. We then add each neuron’s bias and apply the activation function. Here denoted as ‘f’. We can use vectors to represent each layer. The elements in each vector represent the activation of each neuron in that layer. We can also use a vector to represent the biases of every neuron in that layer. The weights that multiply the activations in the previous layer can be held in a matrix. This vector and matrix notation helps us write out the equations in a way that they will generalise to any shape and size of network.

Transforming Inputs to Outputs

Let’s represent our inputs as a vector x, and our outputs as a vector y. We’ll call our first hidden layer h1.If we have n inputs and m neurons in our first hidden layer, and if all inputs are connected to every neuron in this first layer then we can describe the weights connecting these neurons as an n x m matrix. Let’s call this matrix W1. every one of the neurons in this hidden layer will have a bias, so we will need a vector for that too. We shall call that b1. Finally we will need an activation function to generate the output of each neuron in the hidden layer. We’ll call that f1()

First Hidden Layer

That means we can write out the function to calculate all of the activations in our first hidden layer h1 as:

The activations of the first hidden layer, calculated from the input layer

General Hidden Layer

If we want to write the function that generates the activations in the ith hidden layer then we can similarly write it as:

Determining the activations of one hidden layer from the previous hidden layer

Output Layer

Until we reach the output layer, when our output vector will be:

Calculating the output layer of a Neural Network

Where the subscript ‘final’ denotes the final hidden layer. Written this way we have a recursive set of formulae that generate our output from our input. You can see how you can get to the output y from the input x.

This vector representation translates easily to software

More importantly you can see that you, or someone you know, could write the code in Java or Python or some other language to calculate y from x. This would mean writing some loops for the matrix multiplication, doing some addition of the bias terms and coding the algorithm of whatever activation function that you chose. But hopefully even the most cynical among you can concede that this is a realistic endeavour. Seen in this way I hope you agree that a neural network is just a function that maps a vector of inputs, x into a vector of outputs, y.

Training: 'Supervised Learning'

In this post we are going to focus on what is called supervised learning. For this we need a bunch of associated input/output pairs. That is to say some output data (e.g. an option price) and some input data (time series of underlying) that gives rise to this particular output. You cannot have too much data. The more associated pairs of data you have the better your results will be.

To effectively 'learn' AI needs a lot of training data

The only substitute for training data is even more training data. No shortcuts, no free lunch. Get your hands on as much training data as you possibly can.

Key to successful AI projects is ensuring you have enough training samples

Lucidate’s three laws of training data:

1. You can never have enough training data

2. The only substitute for training data is more training data

3. No matter how much training data you have, get some more

Training your neural network; step-by-step

So armed with as much training data as you can you are now ready to train your network. This is essentially a brute-force approach. What we will do is nudge the weights and biases in each neuron in the network so that our AI is able to generalise. That is it is able to come up with sensible outputs for sets of inputs that it has not seen in its training data.

It really is no more complicated than:

1. Present inputs and measure the accuracy of the associated outputs

2. Nudge the weights and biases a little

3. Repeat until satisfied

We will focus on step 1 for the remainder of this post and leave steps 2 and 3 for the next installment.

The 'Cost Function' Explained

Now let’s talk about this mysterious ‘cost function’. Actually it is not that mysterious. All it is, is a way of measuring how good, or how bad, the network is at predicting the correct output while the network is learning. The cost function heavily penalises the network for really bad outputs. That is to say the network’s outputs that are a long way from the target output in the training set. As in life high costs are bad, and lower costs are good. For high costs – that is to say large deviations from the correct output – we’ll want to nudge our weights and biases more. For low cost – that is to say outputs that are close to the correct answer, we might only want to tweak the weights and biases a little.

Training with the 'Forward Pass'

So we can start with a complete random insitialisation of our weights and baises in the network. Then we can present our first input, let the network do its multiplication, bias addition and activation thing and let the outputs from one layer feed into the next. This is referred to as a ‘forward pass’. At the end of this forward pass, with randomly initialised weights and biases you would expect a truly awful output. If this is your expectation you will usually not be disappointed. A network with random weights will be terrible at getting anywhere close to the correct answer. You can subtract the output calculated by the network from the output you expect and get an error. This error tells us how far the network was from the correct answer. You will often hear this error referred to as the ‘loss’.

Calculate the Error or 'Loss'

You can then repeat this for a bunch of other examples in the training set. This is the sequence:

1. Plug in the input data

2. Let the network calculate the output with a forward pass

3. Compare the calculated output to the output you expect

As you continue with this you will get some negative errors (underestimates) as well as positive errors (overestimates). You want all errors to be cumulative, that is to say you don’t want positive errors to cancel out negative errors. A common tactic for dealing with this is to square the error, as a square will always be positive. Or just to take the absolute value of the error by ignoring any signs.

The Mean Squared Error - 'MSE'

Thus a common choice of cost function is the sum of the squared errors, better still the mean of the sum of the squared errors to normalise the answer. This means at the end of a training run, by showing a whole bunch of input & output data to our network and comparing the outputs we calculate to the outputs we expect we can sum the square of the errors, take the average and get a single number: that quantifies how good (a low cost) or how bad (a high cost) our network is.

Remember also that our network is a function, mapping inputs to outputs. If we decide that our cost function is the mean of the squared errors – or ‘MSE’ that becomes:

The 'Mean Squared Error' ('MSE') Cost Function

Where ytrain is the correct output from out training set and ynetwork is what our network calculated.

Each batch of training examples is called an 'Epoch'

You will hear these training runs referred to as ‘epochs’ and at the end of every epoch we can calculate the cost. We can use this cost and some fancy calculus to nudge our weights and biases to get a better set of outputs on the next epoch.

The Cost Function not only measures how accurate the network is, but as we shall see in the next post, it also helps us determine the optimum weights and biases

The whole phrase ‘Cost Function’ neatly sums things up. The cost bit measure how good or bad the network is. The function bit means that it can be differentiated. We’ll see in the next post how we use calculus to get the derivatives of the cost function and use this as a mechanism to update our weights and biases.

In summary:

We’ve seen how we can represent the neural network as a function
We’ve applied a cost function to measure how accurately (or otherwise) the network has learned the mapping between inputs and outputs while training
We’ve looked at the mean squared error as a good choice of cost function. This ticks a lot of boxes: it ensures that positive and negative errors are cumulative (rather than cancelling each other out) and it normalizes the error by taking an average.

In the next post we will see how we can use the derivatives of the cost function to get a strategy to tweak the parameters of the network after every epoch. This goes by the formal name of ‘Backpropagation’