Recurrent Neural Networks (RNN) & Long Short-Term Memory (LSTM) Models

Benjamin S. Knight, February 18th, 2017

             While neural nets excel at pattern recognition, certain types of problems require utilizing information from an observation’s broader context. Deciphering hand-written text in cursive is one such example (Bezerra et al 2012). The task of identifying an individual character becomes far more tractable if we can also leverage classifications of the preceeding and subsequent characters as well. With these types of highly contextualized problems, pattern recognition effectively becomes sequence recognition.

             These sequences can be of a time series nature (e.g. stills from a video feed), but need not be. Examples include sequences of letters, words, or even items within a shopping cart. Recurrent neural networks (RNNs) handle such sequences through the use of dedicated hidden layers. RNNs can be configured in a variety of ways, ranging from many inputs to one output for a task such as sentiment analysis to a one-to-many network for image captioning. Let us designate a specific hidden layer of nodes within the network as h subscripted by t for time step (although again, the elements within the sequence need not be a time series). A RNN then derives an array of weights f to be applied to the various interations between the values of the hidden layer corresponding to the previous time step and the inputs corresponding to the current time step. Then, depending on the number of desired outputs, the values from the hidden layer corresponding to the current time step are then passed on to the next hidden layer, to the output layer where they are weighted in turn, or both.

Figure 1: Derivation of the RNN Weights

The formulas used for creating the arrays of weights used by the recurrent neural network.


             For an example, imagine that we want to create a model that takes a word as an input and predicts whether the word is associated with positive feelings, i.e. sentiment analysis. Setting aside issues of capitalization and grammar, we can create a network with twenty-six inputs (the letters of the alphabet) and one output (a ‘yes’/’no’ positive sentiment boolean). If we wanted a network that could accommodate words three letters in length, then we would need three hidden layers - one layer for every element in the sequence. Figure 2 depicts what such a network would look like after being abridged to only suport three inputs.


Figure 2: A Many-to-One RNN with Sequences Three Elements in Length

Throughput of a recurrent neural net.


             Because we are modeling the interdependencies of the inputs not just relative to one another but also relative to their place within the sequence, a RNN will inevitably be more complex relative to a standard network. The three dimensions of inputs are:

1. Number of Features:
             The number of possible values that an element within the sequence can assume.
2. Sequence Length:
             The number of timesteps the network must accomodate - corresponds to the number of
             hidden layers.
3. Batch Size:
             How many observations to be propagated through the network at a given time. A full batch
             tends to yield the most accurate estimate of the gradient whereas mini-batches tend to be
             less accurate but also less expensive.


Figure 3: Required Inputs and Data Volume of a Standard Neural Network Versus a RNN

The data volume and computational cost or a recurrent neural network far exceeds that of a conventional neural network.


The Vanishing Gradient Problem

             Another disadvantage of RNNs is that they are typically unable to discern patterns between elements far apart (10 timesteps or more) from one another within a sequence. This is due the vanishing gradient problem a problem that afflicts any many-layered network, not just RNNs. First identified by Sepp Hochreiter (Hochreiter, 1991), the vanishing gradient problem occurs over the process of backpropagation.
             Neural networks typicaly utilize hyperbolic tangent functions which have gradients in the range (−1, 1). During backpropagation, the chain rule is invoked with the effect of multiplying these less-than-one values n times. The end result is the exponential decay of the gradient (error signal) as backpropagation proceeds from the output layers to the front layers. In practical terms, the front layers of a many-layered network will tend to train very slowly to the point of impracticality.

Long Short Term Memory (LSTM) Networks

Figure 4: Walk-Through of a LSTM Cell

Walk-Through of a LSTM Cell

References