Transformers v/s LSTM

4 min readMar 15, 2020

Starting with RNN

When RNN was introduced it added some spice to basic neural network. Earlier the vanilla neural network used to take fixed size input which used to be a problem when we wanted to give series kind to input with no size limit / predetermined size.

But now RNN could take series kind of input with no fixed / predetermined size.

Now don’t get it messed like calling a vanilla neural network repeatedly to form a sequence.

They both are different as in sequence we mean the inputs are interconnected like first input will influence the result from the second input.

So, we need something to capture this relationship across inputs meaningfully which RNN did.

RNN remembers the past and it’s further decisions are influenced by what it has learnt from the past.

Note: Basic feed forward networks “remember” things too, but they remember things they learnt during training.

Model is like :

Problems with RNN

Can’t remember for long distances — Vanishing Gradient

Slow processing / high training time as it goes sequentially.

Entry of LSTM

LSTMs are a modified version of RNN, which makes it easier to remember past data in memory. This solved the problem of Vanishing gradient in RNN with help of seperate memory at each stage which helped it remember things for longer period which made it a better option for sequential problems which require long back data like time series prediction given time lags of unknown duration.

In this 3 gates are present — forget, input, output.

Problems with LSTM

It is more conmplex & required more computation power.

Slow processing / high training time remained there.

Transformers

This changes things upside down like in case of sequential models like RNN, LSTM we used to pass inputs one by one & word embeddings are generated one at a time step but in transformers we don’t have any concept of time steps we pass all inputs together simultaneously and determine word embeddings simultaneously.

Understanding Positional Encoders-

A vector that gives context based on it’s position in it’s position in the sentence. The position of words may have different meaning at different positions.

Attention- a technique that is used in a neural network. For RNNs, instead of only encoding the whole sentence in a hidden state, each word has a corresponding hidden state that is passed all the way to the decoding stage. Then, the hidden states are used at each step of the RNN to decode.

The encoder and decoder blocks are actually multiple & identical. Number of encoders is equal to the number of decoders.

Note : The number of encoders & decoders is a hyperparameter.

In this as we see the input goes through all encoders one after the another and at last encoder the output goes to all the decoders as shown in figure below

Now here, in addition to the self-attention and feed-forward layers, the decoders also have one more layer of Encoder-Decoder Attention layer. This helps the decoder focus on the appropriate parts of the input sequence.Now you can relate it to the complete Diagram of Transformer as below :

Here as you can see in inputs and outputs the embedding are combined with positional encoding so that they have combined advantage of both embeddings and positional embeddings.

Now you can get more info about self — attention :

Conclusion

Transformer models are more like attention based models. They see the entire sentence as a whole whereas in LSTM sentence is processed sequentially.

During the training the LSTMs need to back propagate whereas the Transformer doesn’t require back propagation.

They both have a kind of different approach where Transformers are attention based models which can do things more naturally as the attention mechanism looks at all the words at once whereas LSTMs can capture long term dependencies.

Transformers work more in kind of parallel manner and can better utilize resources than that of sequential models.