Everyday life is filled with problems where information is not constant, rather it changes over time. For instance, we communicate through sound waves, which vary from moment to moment; or even during movie night where the exciting narratives are told through a series of pictures. In the field of Artificial Intelligence, processing this kind of information is paramount, but with traditional neural network architectures their usability is limited. This post will present various problems and solutions detailing Seq2Seq architectures and Attention Mechanism. 

Prerequisites (9-12 minutes): Familiarity with RNN structure and LSTM layer

In cases where I/O information is data of a serial nature (written text, speech, video, translation) Recurrent (RNN) neural networks should be used, instead of conventional Fully Connected (FC) neural networks, since the entire series of information is not always available in full (hindering processing), and conventional FC does not generalize well based on properties. The word “blue” is used as a color attribute in most cases, however this must be retaught for an FC net for each position, reducing efficiency of the model and thereby increasing computational capacity required for the same level of accuracy.

Traditional RNN structure [1]

Traditional RNN structures are useful for many-to-one/one-to-many problems (such as text classification by mood) or for many-to-many problems where the input and output sequence are the same length. On the figure the blue represents the outputs, and the red represents the inputs.

Problem classification based on input length [2]

In case of Many-to-many problems (input and output lengths differ), we must use SEQ2SEQ structures: an encoder net compresses the input sequence into a vector that we pass onto a decoder net, which creates the desired output sequence. A typical application is translation: the encoder is responsible for the interpretation of the English sentence, and decoder for the formation of the German sentence. SEQ2SEQ architectures use the hidden state vector, produced by the encoder at the last time step, to convey information. This is called the context vector.

There are basically two ways to pass the encoder’s context vector to the decoder. These are briefly described below:

  1. The context vector is fed to the input of the decoder net every time step, statically. In this case, the initial internal state-variables of the decoder net are initialized randomly. This method is less common.
  2. The context vector is made into an initial internal state of the decoder net (init) and in case of prediction, the inputs are fed the estimated output of the decoder at the previous time step (except for said previous time step, for which a special start character is used). This type of net is taught by feeding the modified target sequence (shifted by a time step), to the input of the decoder net (starting with the start object). This is known as teacher forcing. A problem with teacher forcing may be the appearance of noisy outputs on the input on test time, for which a solution is scheduled sampling, which randomly switches the input of the decoder net during training between the real data (teacher forcing) and the data predicted by itself at the previous time step.
2nd case: initial status is the context vector [3]

While processing long sentences with SEQ2SEQ architectures one will encounter two obstacles. One derives from the RNN layer that constitutes SEQ2SEQ structures: the vanishing/exploding gradients and the resulting disappearance of information. This can be solved by building the SEQ2SEQ structure from LSTM layers instead of RNN (see Figure [3]). Long Short-Term Memory helps to deliver important information from the beginning of the sentence to the context vector at the end. The second problem is caused by the SEQ2SEQ architecture, that is, the fixed length context vector (the dimension of which is equal with the unit number of the LSTM layer). The encoder has a single vector to embed and pass the information of the whole sentence’s meaning. This is hard to achieve for long sentences, and one can see that traditional human translation does not work that way either. This constraint is caused by the fact that due to the fixed-length interface in the case of very similar words, it is difficult to encode subtleties, which are the basics of quality translation. The Attention Mechanism was developed to solve this problem.


The two popular Attention Mechanisms are the Bahdanau attention [4] and the Loung attention [5]. Their innovation is that the models don’t restrict the encoder / decoder interface to the encoder’s context vector. They enable the consideration of the encoder output value at every time step while translating, not just the last one. This is achieved through a so-called Attention context vector (not to be confused with the traditional SEQ2SEQ context vector since that is only the last encoder status). This vector is calculated from the encodings of the input information (words, letters, pictures, sounds). Human decision making and comprehension also happen in similar way. While reading, we examine part of a sentence, and focus only on a single object during the visual process. This vector can be created in several ways.

For the Attention context vector calculation the input data encoding is taken into account: based on the extent (entirety or only specific parts) there is global or local Attention. Depending on how the Attention context vector is constructed from these selected encodings, we can talk about Soft or Hard Attention.


Soft Attention is calculated by the weighting and summation of the encodings for the different time step’s. Weighting can be done in a variety of ways (for instance, using a conventional FC neural network with Softmax activation based on the previous hidden state and the given input encodings). Hence the sum of the weights is equal to 1, so a probability distribution is calculated for each input encoding. The encodings are then multiplied by the weights, and these weighted encodings are added up: which is called the Attention context vector.


In Hard Attention, the probability distribution is not used to construct a weighted sum, but rather as a factor controlling a sampling frequency. In this case, the Attention context vector is derived from the sampled encodings according to the Monte Carlo method. This method can’t be differentiated in all cases (it can be calculated with Reinforcement Learning), thus the Soft Attention approach is more widespread.

The context vector can then be added to an LSTM network in various ways, presented below are some examples. The notations have been standardized for reference.



Simple Attention [6]

Exclusively the Attention Context vector is fed to the input of the decoder. This model resembles the I. SEQ2SEQ structure with the difference that the context vector here isn’t static, but recalculated at every time step with the help of Attention. We construct the weighting with the help of an FC neural network based on the decoder’s previous hidden state and the encodings.



In this model, besides the Attention context vector, the value predicted at the previous time step is added to the input of the decoder’s LSTM network (the word ‘comment’ in the figure). The function f creates the attention weights.


Attention, context on the shared input [7]


This model’s outputs isn’t given by the usual LSTM decoder, rather it’s calculated from an Attention vector with the usage of an FC neural network. The Attention vector is the concatenation of the Attention context vector and the decoder output.

In order to create the weighting needed to produce the Attention context vector, we use the current decoder status (note the difference: the previous state was used in the previous models), and all the encodings. Here, the decoder’s LSTM input is the previous Attention context vector and the previous predicted word. In this model, the Attention mechanism affects both the input and output.


Loung Attention [8]


The first popular model. The context vector is calculated by an FC network using the previous time step’s decoder status and the input encodings. In contrast to the simplified Attention Mechanisms, here the Attention context vector’s initial status is concatenated with the previous decoder status at every time step and then added to the LSTM decoder layer. The decoder receives its own output estimate at the previous time step, similar to the language models.


Bahdanau Attention [9]

To compare these models, in the next post we will build spelling corrector neural nets, on which the different models’ accuracy will be demonstrated.

We are confident that inside your company there are a lot of tasks which can be automated with AI: In case you would like to enjoy the advantages of artificial intelligence, then apply to our free consultation on one of our contacts.





[4] Thang Luong et al. Effective approaches to attention-based neural machine translation. 2015

[5] Dzmitry Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, 2014





Close Menu