LSTM Network

a neural network with a long short-term memory

Table of Contents

People do not start over with thinking every time. When reading an article, one understands the meaning of each word based on the meanings of the previous words. It is peculiar to thoughts to accumulate and influence one another. The LSTM networks use this principle.

Simple neural networks cannot do this, which is their major disadvantage. Imagine the following situation: You wanted to classify in real time the events that happen in a movie. It is a very big question how an ordinary neural network can use knowledge of past events to investigate subsequent ones.

Recurrent neural network (RNN)

Recurrent neural networks are able to solve this problem because they consist of cycles that store the information.

Cycles in RNN

The diagram above shows that part of the neural network A accepts the input signal x and outputs the value h. This cycle enables the information to be passed from one network step to the other.

Because of the presence of cycles, RNNs look a bit mysterious. But if you take a closer look at them, it becomes clear that they are no different from a simple neural network. An RNN can be thought of as multiple copies of the same network, with each copy sending a message to a successor. Let's consider what happens when a cycle is sampled:

Cycle scanning in RNN

The scanned structure of the RNN shows that repetitive neural networks are closely related to sequences and lists. This is the natural architecture of a neural network that is used for this type of data.

Undoubtedly, they are used very actively nowadays! The RNN have achieved incredible success in recent years when used in such research areas as speech recognition, linguistic modeling, translation, image description ... The list is endless. ** LSTMs have become an essential aid in solving the tasks listed. ** They are recurrent neural networks of a specific type that solve individual tasks much more efficiently than standard methods. There are all significant results associated with the use of LSTM networks based on RNNs.

The problem of long-term relationships

The recurrent neural network uses the information previously obtained to solve the following tasks, e.g. the following video fragments can be analyzed based on those obtained earlier.

Sometimes only the last information is needed to complete a task. For example, a language model is created that tries to predict the next word based on the previous words. No additional context is needed to predict the last word in the phrase "clouds in the sky": it is evident that the next word will be "sky". If the gap between the previous information and where it is needed is small, the RNN will handle the task. RNN successfully solves this task only in those cases when the gap between the previous information and the place where it is necessary is small.

But sometimes more context is needed. Let us consider the attempt to predict the last word in the following text: "I grew up in Germany ... I speak German fluently." From the previous words it is clear that the next word is likely to be the name of the language. In order to name the correct language, the mention of Germany must be taken into account. The distance between the information required for detection and the point at which it is needed increases.

Unfortunately the RNN loses the connection between the information when the distance increases.

In theory, RNN can cope with such "long-term dependencies". To solve this problem, the researcher can carefully put together the network settings. Unfortunately, in practice, RNN is incapable of solving this problem. This topic was discussed in studies from Hochreiter (1991) and Bengio et al. (1994), in which the basic limitations of RNN were described. Fortunately LSTM does not have this problem!

LSTM networks

LSTM (Long Short Term Memory, long short term memory) refers to a kind of recurrent neural network that can learn long-term dependencies. LSTM was first presented in research Hochreiter & Schmidhuber (1997), later improved and popularized by other researchers. It successfully copes with many tasks and is still widely used.

LSTM was developed with the aim of eliminating problems of long-term addiction. Their specialty is storing information for long periods of time, so there is practically no need for training!

All recurrent neural networks are in the form of a chain of recurring modules of a neural network. A simple structure is typical for this recurring module in standard RNN, e.g. a tanh layer.

The recurring module of the standard RNN contains a layer
The recurring LSTM module contains four interacting layers

The details are not important at the moment. Let us now consider the terms that continue to be used.

Each line means a vector. The pink circle is used to designate point-by-point operations, for example the summation of vectors. Yellow cells refer to the layers of a neural network. The line link is the union of vectors, and the fork sign means copying the vector with further storage in different places.

How the LSTM network works

The key concept of LSTM is cell state: this is a horizontal line that goes through the top of the graph.

The cell state can be compared to a conveyor belt, since it runs through the entire chain and is subject to small linear changes.

It depends on needs whether LSTM decreases or increases the amount of information in the cellular state. For this purpose, carefully adjusted structures, which are also called gates, are used.

A gate is a "gate" that either lets information through or not. The two most important parts of Gates are a sigmoid layer of a neural network and a pointwise multiplication operation.

In order to determine what percentage of each information unit should be allowed to continue through, numbers from zero to one are output at the output of the sigmoid layer. The value "0" means "let nothing through", the value "1" means "let everything through".

Step-by-step work scheme of the LSTM network

In LSTM, a distinction is made between three gates to control the cell state, which will be examined in more detail.

Loss layer

In the first step it must be decided which information is to be ejected from the cell state. This decision is made by a sigmoid layer called the "forget gate layer". It receives the input ** h ** and ** x ** and specifies the number from 0 to 1 for each number in the cell state ** C **. ** 1 ** stands for "save completely" and ** 0 ** stands for "completely delete".

Let us look again at our example of the language model. Now it tries to predict the next word based on all previous words. In such a task, the cell state includes the subject's gender in order to use the correct pronouns. In the case of the new subject, the gender of the previous subject must be forgotten.

Storage Layer

The next step is to decide which new information will be stored in the cell state. This process takes place in two stages. First, the sigmoid layer, also called the "input gate layer", decides which values need to be updated. Then the Tanh layer creates the vector of new C values; that are added to the cell state. Next, these two values are combined to update the status.

In the example of our language model, the cell state is supplemented by the gender of the new subject in order to replace the gender of the old subject.

New condition

Now the previous cell state is updated to get the new state C. Once the update method has been selected, the update is carried out by itself.

First, the old state is multiplied by ** f ** by losing the information that was forgotten. Then i * C is added. These are new candidate values. They are then scaled based on how each value of the state was updated.

In the case of the language model, the information on the gender of the old subject is rejected and new information is added.

Finally, it must be decided what should be preserved at the exit. The result is the filtered cell state. First, the sigmoid layer is started, which decides which parts of the cell state are to be output. Then the cell state is passed through tanh (in order to arrange all values in the interval [- 1, 1] and multiplied by the output signal of the sigmoid gate.

Since the network only worked with the subject, it can output the information related to the verb for the language model. For example, the network outputs information about the number of the subject (singular or plural) in order to properly conjugate the verb.

LSTM examples

The scheme described above is traditional for LSTM. Not all LSTMs are created equal, however. In fact, different versions are used in almost every article. While the differences are minor, some of them are worth mentioning.

In the research Gers & Schmidhuber (2000) the popular version of LSTM was presented, in which gate- Layers can look at the cell state.

In the diagram above, all gates have an "eye", but in many articles it is only present in some of the gates.

The second variant is to use connected loss and entry gates. Instead of deciding separately what to forget and what to add new information to, these decisions are made simultaneously. The information is only forgotten when something new needs to be placed in the same place. New values are only entered in the status when something older is forgotten.

A slightly different LSTM example is Gated Recurrent Unit, or GRU for short, which is described in Cho, et al. al. (2014) was introduced. Here the cell state and the hidden state are linked and some other changes are entered. Because the resulting model is simpler than standard LSTM types, it is becoming more popular.

These are just a few of the best-known examples of LSTM. There are many others, for example Yao, et al. (2015) an RNN with a depth gate is shown and in Clockwork RNN by Koutnik et al. (2014) described a completely different approach to solving long-term dependencies.

Which of the following is the best? Are the differences very important? In Greff et al. (2015) the popular LSTM versions are compared in detail and shown that they are all roughly the same. In Jozefowicz et al. (2015) more than ten thousand RNN architectures were tested. Some of them are better than LSTM at solving certain problems.


When LSTM is viewed as a set of equations, it seems terrible. Hopefully, stepping through the whole scheme in this post helped make it more accessible.

LSTM was a big step in the development of RNN. Of course, another question arises: can you go on? The general opinion of all researchers is: "Yes! The next step is to use the attention mechanism!" How does it work At each step, RNN selects the information to view from each large amount of data. For example, an image description is created using RNN so the network can select a portion of the image to view each output word. This is exactly what was shown in Xu, et al. (2015) and will be the starting point for research into the attention mechanism.