Understanding Neural Nets Forwards and Backwards

Benjamin S. Knight

March 5th, 2017

I. Constructing a Neural Net

         My goal in writing this is to walk through a single iteration of a neural network in order to foster a better intuition into how neural nets operate. Neural nets have proved revolutionary in pattern inference. For example, Facebook's DeepFace facial recognition system uses neural nets to identify faces with 97% accuracy (The Verge, July 7, 2014). While neural nets can quickly assume truly daunting levels of complexity (e.g. AlexNet has 60 million parameters) they can be broken down into component parts the underlying mechanics of which can be described using Calculus.
         A neural network is an ensemble of nodes and the connections between them. These nodes are assembled into layers, with an input layer recieving an input corresponding to an observation (e.g. "Is the pixel red? [yes/no]") and the final layer producing an output (e.g. "Score assigned to 'apple' class for image X = 0.88"). These networks may be dozens of layers deep or more, hence the term 'deep-learning.' As information traverses the network, each connection either amplifies or weakens the information passed through it via a system of weights. This stage where information permeates the network, is weighted and reweighted, and is then ultimately used to generate an output is referred to as a 'forward pass.'
         Even the most well-designed neural net won't get things right on the first try. The net needs exposure to additional information in order to learn. This learning mechanism takes the form of updates to the weights assigned to the connections between each node. In the event of a mistake, connections with weights that amplify the flow of information will be penalized more heavily - obstructing that information pathway for subsequent forward passes. This process by which an error signal is generated and then passed backwards through the network's connections is referred to as 'back propagation.'
         In this post I will walk step-by-step through a single epoch of a three-node neural network. An epoch is defined as one run of the model that utilizes the entirety of the avialable training data (for our purposes, the epoch will consist of a single observation, [8, -4]). Our example network has two input variables X₁ and X₂, and one output variable Y. The interior of the network is comprised of two layers - one hidden layer of two nodes (H₁ and H₂) followed by a single output node O₁ transformed by the following sigmoid function.

$$ \sigma(x) = \displaystyle\frac{1}{1+e^{-x}} $$

The sigmoid function serves to rescale the results from the neural net so that they reside between zero and one. For classification problems of a True/False nature, a value of 0.49 is far more tractable that -37.4. With all of the elements in place, our neural network takes the structure shown in Figure 1.

Figure 1: An Example Neural Net

A GIF showing the construction of a 2-inputs, 1-output, 3-node neural net with a sigmoid activation function in the output layer.

Before we can turn the crank of our neural net, we need to assign weights to the connections. We typically begin by initializing the weights with small random values chosen from a zero-mean Gaussian distribution. A standard deviation of about 0.01 is recommended (Hinton, 2010).
We start the forward pass by multiplying our initial inputs (X₁ and X₂) times their respective weights, and then sum the results. Next, we repeat the process for the nodes in the hidden layer - multiplying the output of H₁ times its weight, multiplying the output of H₂ times its weight, and summing the two products to create the value associated with the output node O₁. We conclude the forward pass by applying the sigmoid function, thus yielding our initial prediction of 0.21.

Figure 2: Randomly Assigning Weights and Executing the Forward Pass

A GIF showing one forward pass of the neural network.

An initial forward pass of our model generates a value of 0.21. However, the classification is incorrect (as we would expect given that the weights were randomly assigned). If our training set shows that the actual value for that obseration is 1, how can we leverage that information to improve our model? Here we turn to Calculus. By propagating the error signal backwards from the output layer to the input layer, we can create an error gradient from which we can optimize the weights in such a way as to minimize the error. Because we are descending the gradient as we appoach the point of minimum error, we refer to this process as gradient descent.

II. Gradient Descent

If we wanted to plot the relationship between two variables X and Y, we might start by drawing a two-dimensional plot. If we went a step further and wanted to know what values of X correspond with the largest possible change in the value of Y, then we might instead show the rate of chage in Y with respect to X by plotting the derivative. If we were interested in the relationship between rates of change among not just X and Y, but also between Z and Y, then we would need a three-dimensional plot of the derivatives like the one shown below.

Figure 3: A Three Dimensional Visualization of Gradient Descent

A Three Dimensional Visualization of Gradient Descent Courtesy of Zoran Vrhovski.

Source: Zoran Vrhovski May 29th, 2012

         A gradient is a representation of a derivative - a rate of change - projected across n dimensions. With only three variables, a gradient can still be represented visually. However, the number of variables generated by a neural net - the weights - can number in the millions. In this context, a more practical representation of a gradient is as an array of partial derivatives, i.e. the rate of change in the error signal E with respect to weight W₁, the rate of change in E with respect to weight W₂, and so forth. Thus, a neural net can be thought of as one, very large differentiable equation with many variables. For those wanting additional detail, I highly recommend the excellent series of video lectures from Kahn Academy.
         Once we calculate the error gradient, we use it to adjust the weights in the neural net via gradient descent. In gradient descent we calculate an amount - Δ, by which to change each weight by. To calculate Δ we take the error derivative associated with that node, weigh it by the absolute input passing to it from the preceding node, and multiply the result by the learning rate, η (pronounced 'ee-tah').
         As we feed the neural net new and different information, the exact shape of the error gradient will naturally change. For this reason, smaller values of η tend to perform better. In other words, it is better to take more, smaller steps in a direction that is generally correct as opposed to taking a very large step only to realize that the path was less than optimal.

III. Calculating the Error Gradient

To illustrate the process by which we calculate the error gradient, I will refer to Figure 4 which contains four elements. We have two nodes, i and j. i is located in the penultimate layer - a hidden layer, whereas j is in the output layer. Next, we have the outputs of these two nodes - y_i for node i and y_j for node j respectively. Finally, we have z_j which is the array of all inputs into node j.

Figure 4: Calculating the Propagation of the Error Signal

A diagram establishing the variables used in backpropagation.

To begin, we need a method of deriving a generalized error signal from all of the output nodes. The function we use for this is typically referred to as the 'loss function.' A very common loss function is the sum of squared errors as shown below.

$$ E = \displaystyle\frac{1}{2}\sum_j(t_j - y_j)^2 $$

Here the overall error signal generated by the entirety of the neural net is represented as E. To calculate this value, let's assign the true value of output node j to t_j. Taking y_j and t_j together, we see that the contents of the parentheses is the difference between the predicted value and the actual value. We are interested in the absolute size of the the error, and so we square the difference. Taking the absolute value would also yield the desired value. However, absolute values are not differentiable - a prerequiste of gradient descent. Next, we sum these squared differences for all of the output nodes. Lastly, we multiply the results by one half for the sake of convenience when we take the derivative.
Now that we have a general error signal, we need to apply it to all of the nodes in the output layer. In other words, we need a way of capturing the extent to which a change in the output of node j impacts the overall level of error, E. For this reason, we take the partial derivative of the loss function with respect to the output node as shown below.

$$\begin{eqnarray} \frac{\partial E}{\partial y_j} = -(t_j -y_j) \nonumber \\ \end{eqnarray}$$

Now that we know the error derivative corresponding to node j, we need to calculate what amount of error to propagate backwards to all of the nodes that connect to j. Recall that the sum of all of these precursor nodes is captured by the term z_j. When we take the partial derivative with respect to these nodes we get the following expression.

$$\begin{eqnarray} \frac{\partial E}{\partial z_j} &=& \displaystyle\frac{dy_j}{dz_j} \displaystyle\frac{\partial E}{\partial y_j} \nonumber \\ &=& y_j(1-y_j)\displaystyle\frac{\partial E}{\partial y_j} \nonumber \\ \end{eqnarray}$$

The change in the amount of error signal E with respect to the inputs to node j is equal to the change in the the output of node j with respect to change in the inputs to node j all multiplied by the change in E with respect to the output of node j Note that because the term z_j captures an array of values, its partial derivative is written with a cursive d whereas the other partial derivatives are designated with the more typical δ. The resulting value ends up being equal to the output of node j times one minus the output of node j all times the partial derivative of the error term E with respect to the output of node j.
Moving backwards from the output layer to the hidden layer, we use the error derivative for the inputs to node j to calculate the error derivative for the outputs of node i. In other words, we will use the partial derivative of E with respect to z_j to calculate the partial derivative of E with respect to y_i.

$$\begin{eqnarray} \frac{\partial E}{\partial y_i} = \displaystyle\sum_j\displaystyle\frac{dz_j}{dy_i} \displaystyle\frac{\partial E}{\partial z_j} \nonumber \\ = \displaystyle\sum_jw_{ij}\frac{\partial E}{\partial z_j} \nonumber \\ \end{eqnarray}$$

We can interpret the above expression as follows: The change in the error with respect to the output of a node i is equal to the change in error associated with all inputs to the subsequent node j times the weight of the connection bridging the two nodes W_ij. We repeat this process for all connections between node i and the subsequent layer. By summing the results, we derive the total error to back propagate to node i.
We can zoom in further and see how this plays out with a particular weight.

$$\begin{eqnarray} \frac{\partial E}{\partial w_{ij}} &=& \displaystyle\frac{\partial z_j}{\partial w_{ij}} \displaystyle\frac{\partial E}{\partial z_j} \nonumber \\ &=& y_i \frac{\partial E}{\partial z_j} \nonumber \\ \end{eqnarray}$$

The change in error with respect to a change in some weight W_ij is equal to the change in error with respect to the inputs to the subsequent layer multiplied by the input itself.

IV. Tying it all Together

In this section I apply the backpropagation logic discussed in the previous section, and apply it to the neural net shown in Figure 1. Based on the notation in Figure 1, input variables are designated as X, while hidden layer nodes and output layer nodes are represented as H and O respectively. Figure 5 illustrates this process of backpropagation, gradient derivation, and descent with a learning rate of η = 0.5.

Figure 5: Executing Backpropagation and Gradient Descent

A diagram establishing the variables used in backpropagation.

We start our backpropagation by taking the error derivative at our sigmoid layer as a function of the total error where x is the output of the output layer, y-hat is the predicted value, and y is the actual value.

$$\begin{eqnarray} \frac{\partial E}{\partial y_\sigma} &=& (y - \hat{y})\sigma(x)(1- \sigma(x)) \phantom{aaaa} \nonumber \\ &=& (1- 0.21)\Bigg(\displaystyle\frac{1}{1+e^{-(-1.28)}}\Bigg)\Bigg(1 - \displaystyle\frac{1}{1+e^{-(-1.28)}}\Bigg) \phantom{aaaa} \nonumber \\ &=& (1-0.21)(0.21)(1-0.21) \nonumber \\ &=& (0.79)(0.21)(0.79) \nonumber \\ &=& 0.131 \nonumber \\ \end{eqnarray}$$

Next we take the error derivative of the inputs to the sigmoid layer. Here y_σ is the output of the sigmoid layer.

$$\begin{eqnarray} \frac{\partial E}{\partial z_\sigma} &=& y_\sigma(1-y_\sigma)\displaystyle\frac{\partial E}{\partial y_\sigma} \nonumber \\ &=& (0.21)(1-0.21)(0.131) \nonumber \\ &=& 0.0217 \nonumber \\ \end{eqnarray}$$

We now have the error derivative for the inputs to the sigmoid layer. We use this to calculate the error signal propagated to the two hidden nodes, H₁ and H₂.

$$\begin{eqnarray} \frac{\partial E}{\partial W_{H_1\sigma}} &=& H_1 \frac{\partial E}{\partial z_\sigma} \nonumber \phantom{aa} \rightarrow \phantom{aa} (-0.8)(0.0217) \phantom{aa} \rightarrow \phantom{a} -0.01736\\ \frac{\partial E}{\partial W_{H_2\sigma}} &=& H_2 \frac{\partial E}{\partial z_\sigma} \nonumber \phantom{aa} \rightarrow \phantom{aa} (1.2)(0.0217) \phantom{aaa} \rightarrow \phantom{aaa} 0.02604 \nonumber\\ \end{eqnarray}$$

Now that we have all of the error derivatives, we can employ gradient descent. In adjusting the weights for the output layer (W_2a, W_2b), we multiply together (1.) the learning rate paramater, η = 0.5 (2.) the output unit error, and (3.) the hidden unit activation value. For the hidden unit activation values, recall that

$$\begin{eqnarray} y_{H_1\sigma} &=& (H_1)(W_{H_1\sigma}) \phantom{aa} \rightarrow \phantom{aa} (-0.8)(0.7) \phantom{aa} \rightarrow \phantom{aa} -0.56 \nonumber\\ y_{H_2\sigma} &=& (H_2)(W_{H_2\sigma}) \phantom{aa} \rightarrow \phantom{aa} (1.2)(-0.6) \phantom{aa} \rightarrow \phantom{aa} -0.72 \nonumber\\ \end{eqnarray}$$

Given these inputs, the changes to the weights feeding the neural net's output layer are calculated as follows.

$$\begin{eqnarray} \Delta W_{H_1\sigma} &=& (\eta)\Big(\frac{\partial E}{\partial y_\sigma}\Big)(y_{H_1\sigma}) \phantom{aa} \rightarrow \phantom{aa} (0.5)(0.131)(-0.56) \phantom{aa} \rightarrow \phantom{aa} -0.036 \nonumber \\ \Delta W_{H_2\sigma} &=& (\eta)\Big(\frac{\partial E}{\partial y_\sigma}\Big)(y_{H_1\sigma}) \phantom{aa} \rightarrow \phantom{aa} (0.5)(0.131)(-0.72) \phantom{aa} \rightarrow \phantom{aa} -0.047 \nonumber \\ \end{eqnarray}$$

With the new weights in hand, we can see that we are slightly decreasing the flow of infomation between both the H₁ node and the sigmoid node (0.7 - 0.036 → 0.664) as well as between the H₂ node and the sigmoid node (-0.6 - 0.047 → -0.647).
Moving on to the weights in the preceding layer (W_1a, W_1b, W_1c, W_1d), we adjust the updating formula slighty. The change in weights between the input layer and the hidden layer is calculated by multiplying together (1.) the learning rate, (2.) the hidden unit's error, and (3.) the input values, X₁ = 8 and X₂ = -4.

$$\begin{eqnarray} \Delta W_{X_1 H_1} &=& (\eta)\Big( \frac{\partial E}{\partial W_{H_1\sigma}}\Big)(X_1) \phantom{aa} \rightarrow \phantom{aa} (0.5)(-0.01736)(8) \phantom{aaaa} \rightarrow \phantom{aa} -0.069 \nonumber \\ \Delta W_{X_1 H_2} &=& (\eta)\Big( \frac{\partial E}{\partial W_{H_2\sigma}}\Big)(X_1) \phantom{aa} \rightarrow \phantom{aa} (0.5)(0.02604)(8) \phantom{aaaaaa} \rightarrow \phantom{aaaa} 0.104 \nonumber \\ \Delta W_{X_2 H_1} &=& (\eta)\Big( \frac{\partial E}{\partial W_{H_1\sigma}}\Big)(X_2) \phantom{aa} \rightarrow \phantom{aa} (0.5)(-0.01736)(-4) \phantom{aaa} \rightarrow \phantom{aaaa} 0.034 \nonumber \\ \Delta W_{X_2 H_2} &=& (\eta)\Big( \frac{\partial E}{\partial W_{H_2\sigma}}\Big)(X_2) \phantom{aa} \rightarrow \phantom{aa} (0.5)(0.02604)(-4) \phantom{aaaaa} \rightarrow \phantom{aa} -0.052 \nonumber \\ \end{eqnarray}$$

After updating the weights in the input layer by the amounts shown above gradient descent is complete. Repeating this process should, in time, yield a configuration of weights that minimizes classification error.

V. Final Thoughts

Deep learning is a rapidly developing field, and the above discussion can only serve as the barest of introductions. There is a growing array of deep learning net architectures, including convolutional neural networks and recurrent neural networks. The neural net laid out above utilizes a fully connected layer wherein all nodes are connected and all possible information channels are leveraged. However, there is a variety of layers one can employ. In convolutional neural networks, convolutional layers are used to contextualize inputs while pooling layers can be used for dimensionality reduction.
Just as there is a variety of layers, there are multiple activation functions one can choose from. The example above uses a sigmoid function to constrain the output to the [0,1] domain. However, one can see how small the error signal becomes as it is propagated backwards. This vanishing gradient problem becomes more and more problematic as more layers are added - becoming especially pernicious in recurrent neural networks. To help address this, engineers and data scientists have turned to other activation functions including the hyperbolic tangent function or rectified linear units (ReLu).

* * *

For those interested in learning more, I highly recommend Udacity's Deep Learning Nanodegree Foundation course, as well as Coursera's Neural Networks for Machine Learning by Professor Geoffrey Hinton at the University of Toronto. In addition, Christopher Olah has an outstanding blog that served as inspiration for this post. Finally, Andrej Karpathy has generously made his lectures available on YouTube.

VI. References

Brandom, Russell. (2014, July 14th). "Why Facebook is beating the FBI at facial recognition." The Verge. Retrieved from http://www.theverge.com/2014/7/7/5878069/why-facebook-is-beating-the-fbi-at-facial-recognition.
Hinton, Geoffrey. [Artificial Intelligence Courses]. (2013, November 4th). The Backpropagation Algorithm. Retrieved from https://www.youtube.com/watch?v=xfPz92B0rv8.
Hinton, Geoffrey. (2010). "A Practical Guide to Training Restricted Boltzmann Machines." Retrieved from http://www.csri.utoronto.ca/~hinton/absps/guideTR.pdf.
Karpathy, Andrej. [MachineLearner]. (2016, June 14th). CS231n Lecture 4 - Backpropagation, Neural Networks. Retrieved from https://www.youtube.com/watch?v=QWfmCyLEQ8U.