Introduction to Neural Networks. Pt2 - Activation Functions

Richard Walker
Apr 26, 2022
5 min read

Updated: Jun 30, 2022

This is the second post in this series on neural networks. In the first post we introduced the artificial neuron. In the next posts we will discuss how neurons are stacked together in layers, as well as how we train networks to perform a particular task.

Click 'Play' on the video below, or read the blog post with animated graphics.

Full video of this blog post

What influences a neuron's activation?

We said that each neuron has an activation level. This activation is influenced by three things:

1. Inputs to the neuron.

2. A set of weights and a bias term

3. An activation function

In this blog post we will take a closer look at activation functions. Activation functions perform two vital roles in neural networks. As we have seen they play a part in producing the output of a neuron. But as we shall see in parts 4 & 5 of this blog series, they also play a vital role in training a neural network. When a network is being trained, adjustments are made to the weights and biases of each neuron. The strength and direction of adjustments these parameters depend on the activation function. Specifically, it depends on the first derivative or slope of the activation function.

Calculus is an important concept for neural network training or 'learning'

Let’s not get ahead of ourselves and worry too much about training, adjustments to parameters and calculus just now. We will come to that later in this series. But for now, when we look at different activation functions and we talk about their derivatives, please do bear in mind that there is a reason for this.

The reason is that these derivatives play a vital role in a neural network ‘learning’. It ‘learns’ by making small changes to its weights and biases when presented with training data.

First composition, then activation

Here is our schematic of an artificial neuron with three inputs again. The first step in the calculation of a neuron's output can be thought of as ‘composition’.

A schematic of a neuron with three inputs

Here we multiple each input by its weight, sum these products and then add a bias. The result of this composition is then fed into an activation function. The result of which is the neuron’s activation or output. It is this activation function that we are interested about in this blog post.

Activations scale and normalise the neuron's inputs

Activations serve to scale and normalise the outputs. As you will see all the activation functions here are non-linear (i.e they are not straight lines). That is vital in allowing the network to ‘learn’ complex non-linear relationships in problems it is trying to solve.

The Sigmoid function and its first derivative

Here is the Sigmoid function that we saw in the last post. This squashes and normalises all the input data from minus infinity to plus infinity into an output range of 0 to 1. As you can see it also has a simple derivative, which helps a lot in training (more on that in posts four and five). In the 1980s and early 1990s the sigmoid function was the go-to activation function for neural networks. But is does suffer from a fatal flaw. Let me explain.

Problems with too much scaling and normalisation - 'Vanishing Gradients'

The problem stems from squishing a huge input space (minus infinity to plus infinity) into a tiny output space (0 to 1). This means that some really big changes in input have only a tiny change in the output. In other words, the derivative is small. Small derivatives mean only tiny changes to weights and biases in training. This slows the rate of ‘learning’ of the network.

Back in the 80s this frankly wasn’t much of a problem. Most neural networks were quite small, with only a few neurons and a small number of layers. But as networks get bigger you start having small gradients multiplied by other small gradients. That starts to give you vanishingly small adjustments and training grinds to a halt. You will see this referred to in the literature as the ‘Vanishing Gradients’ problem.

So, while it is true that Sigmoid functions do have their uses – don’t confine them to the scrap heap yet. They are not the ubiquitous workhorses that they were. But if you want a value between zero and one. Maybe for models that are predicting a probability where likelihood only exist in a range between 0% and 100%, they might be a good choice.

A couple of ways to address some of the Sigmoid's shortcomings

A improvement on Sigmoid from a 'squishing' perspective is the hyperbolic tangent. This had its heyday through the late 1990s and early 2000s as a common choice of function. Inputs here are compressed down into the -1 to +1 range. So the vanishing gradients problem is still there, but it is only half as bad. This function also has a well-behaved derivative that is relatively easy to compute.

The hyperbolic tangent - or tanh - activation function

We can extend this a little further with the ArcTan function. This ticks a lot of the boxes, it is non-linear, it provides both positive and negative outputs and compresses its outputs into a range between -1.57 and +1.57.

A further variation - the arctan activation function

Frankly it is not used that much. We are chasing down a vanishing gradients black hole here. As neural networks get bigger these types of activation functions become more problematic. And with the success of neural networks as a solution, coupled with advances in computing power and GPUs to allow truly huge neural networks these types of activations functions are not so common nowadays. But they do have their uses in certain niches. Especially in smaller-scale systems.

'ReLU' (and friends) to the rescue

Let’s turn our attention to what are undeniably the most common activation functions in use today. The place to start is ReLU – or Rectified Linear Unit. This is a simple function that returns 0 if the input is less than zero. For inputs greater than zero it simply returns the input. Very straightforward.

The 'ReLU' activation function

The derivative is very easy to compute as well. It is either 1 or 0. It is 1 if the input is greater than zero and 0 otherwise. This very simple derivative really speeds up training. These factors have made this type of activation function the most popular for large-scale networks and computer vision tasks.

'Leaky ReLUs' & 'Dead ReLUs'

Given that the gradient for negative inputs is zero there is still a potential problem. You may hear the term ‘dying ReLUs’ or ‘dead ReLUs’ where training and updating of the weight stalls. If this does become an issue many people turn to some pragmatic solutions.

The 'Leaky ReLU' - a pragmatic solution to 'Dying ReLUs'

One is to adopt a ‘Leaky ReLU’ where values below zero are multiplied by a tiny constant (the default for Leaky RelU is 0.01, but feel free to try out other values).

Other variations - ELU

There are many other alternatives ReLUs, with ELU or Exponential Linear Unit being shown in the figure below.

Further variation - The Exponential Linear Unit

Here there is more computation involved, bit in calculating the output as well as the derivative. For small networks this will not be a problem. But for networks with millions or tens of millions of neurons the additional training time will be noticeable.

In summary

This has been an introduction to Activation Functions. The goal has been to get across concepts rather than this be an exhaustive list or catalogue of available options. Do understand what an activation function does. Do understand that calculating its derivative is important in training – to be covered later – and understand that as with much of AI there is a great deal of room for trail and experimentation with different approaches. When training neural networks initially it is perhaps best to start with ReLU or hyperbolic tangent. But do be on the lookout for areas where you might profit from some of the dozens of alternatives.

In the next post we will look at linking neurons together in layers. Input layers, output layers and hidden layers. After that we will get into how we train and optimise these networks.