Modern LLMs are incredibly complex, built on years of research. However, the LLM revolution started with one key development - the transformer. Suddenly, scale was not the bottleneck, but the whole point. The biggest development being that most language modelling papers up until the Transformers paper got released were sequence modelling. Transformers allowed language modelling to become entirely parallel. This unlocked the ability to fully utilise GPUs to their maximum capability by processing all tokens in parallel rather than sequentially, which scales across single or multiple GPUs.
Rather surprisingly, transformers were originally built for machine translation, and not LLMs like we have today.
Transformers primarily work on the principle attention. You can imagine attention as a model asking itself “What different parts of this input sequence should i really care about”, and providing an answer based on the main parts it cares about. This is what allows to achieve complete parallelism.
Note: This article is part 2 of 2 in our series on the transformer. It assumes you have an understanding over what an embedding is, and why it’s useful, and what word embeddings and positional embeddings are. Link to part 1: https://techkareer.hashnode.dev/positional-embeddings-an-intuitive-guide
Why Transformers?
Before transformers, most NLP modelling was limited to sequence-to-sequence modelling. i.e. Old school models would have to individually sequentially process the words in a given sentence to predict the next ones. Some notable examples include Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM).
RNNs
Basic RNNs consisted of a single layer of neurons that fed their outputs back to themselves as inputs. This led to a problem where sufficiently long sequences simply could not be modelled (AKA the vanishing/exploding gradient problem)
LSTMs
While LSTMs did significantly improve this process with a specialized cell architecture, and deciding what to remember and not to with input gates, forget gates, output gates, it was ultimately sequence to sequence.
The problem with this was that the LSTM encoder converts the entire input sequence into a single fixed-length vector. Think of it as summarizing a whole paragraph into one short sentence.
The decoder then uses only this fixed-length vector (along with the words it has already generated) to produce the translation. This creates an information bottleneck. It's hard to perfectly represent a long, complex sentence in a single vector, leading to degraded performance, especially on longer sentences.
Bahadanau Attention
Finally, Bahadanau Attention came up that let the decoder “look back” (i. e. “attend to”) to the entire sequence of encoded words rather than the single fixed summary vector.
It's still considered a Seq2Seq model because it maps an input sequence to an output sequence using an encoder-decoder structure, with attention being an addition to it.
This attention mechanism is what paved the way for transformers, which ditch recurrence entirely in favour of self-attention, allowing each word to directly attend to every other word in the sequence simultaneously. Hence the name “Attention is All You Need”
Enter Transformers
Transformers ditch the idea of sequences entirely. Opting to model solely based on attention where each word maps to other words. It’s primary purpose was machine translation, but has further been adopted into the architectures of generative pretrained transformer models (GPTs)
Notice how in the above example, the word it attends to animal and tired. Indicated that the model somehow knows that in this case, “it”, is an animal and it was too tired
Let’s walk through the transformer step by step.
- The input sequence, our initial sentence is input. Each token is converted into an embedding. For the sake of convenience here, we assume that each word is its own token. In reality, tokens are more abstract and a single word may contain more than one token
- This creates a tensor of size , where T is the number of tokens and d_model is the size of embeddings we generate. The original paper sets it to 512.
- We then generate positional embeddings for the same. Positional embeddings are a way for transformers to tell which words appear where in relation to one another (ie, they allow the transformers to grasp the difference between “the cat ate the rat” and “the rat ate the cat”. (do read part 1 if you have not already)
- This is the meat of the attention mechanism. Multi-headed attention. To understand multi-head attention let’s first discuss single headed attention. Attention is described by the formula:
Where Q K and V are matrices that are generated during inference time, and d_k is the dimension of keys.
Ok, what are these matrices and numbers?!
Q - Query Matrix. It denotes a question the sequence asks itself.
K - Key Matrix. It denotes who answers these questions.
V - Value Matrix. It denotes how much the key answer actually answers the question in the Query matrix.
Let’s explain with the help of an example. Say the sentence is “The chicken eats the grain”. In this case,
The Query might look like the grain asking “does anything eat me?”
The Key matrix would be a representation of the chicken answering saying, “I eat you!”
The Value matrix being a representation of to what degree does this answer that question.
The Softmax function is a function that accepts any vector, and outputs a probability distribution of it that sums to 1. In our case, we apply softmax to every row, ie every token embedding, to generate a probability distribution out of it.
We will explain this in greater depth, but for now it is imperative you understand what’s going on underneath here:
The Input matrix, of dimensions:
is down-projected into a smaller representation of,
These being the dimensions of the Query and Keys respectively (usually, dimensions of keys is far less than that of the model. We will explore how this down-projection happens in the next section.)
This now means that
And this is really useful as it allows every token to attend to every other token.
This is the crux of what we’ve really been trying to do so far! Complete parallelism allows each token to directly look at other tokens, meaning we’ve done all our processing in one step!
Since the generated QKᵀ matrix is a probability distribution of each word given the other, we apply softmax over each row, meaning the sum of attention values for each element in a row will add up to 1, giving us a probability distribution for each word.
Softmax prioritizes high values and minimizes small values, allowing us to really look at what actually matters.
Note: The division by the root of is a scaling factor, that prevents the values of multiplication from becoming too large.
Example: Our previous “The chicken eats the grain” example, a sample QKᵀ may look like
Now, the multiplication by V matrix (size) occurs
And we have our attention matrix! This now lets us see how much each token cares about the other
Note: You might notice V is of dimensions instead of . Usually, the both are equal, but in some rare implementations they might differ. However for our purposes, = .
And all of this gives a single “head” of self-attention.
What occurs during the training, and how we get Q, K and V
The training phase of the transformer is where it learns to create the appropriate Q, K, and V matrices. Weights of the model are trained, that result in Wq, Wk and Wv matrices. These are what are trained during training time
During inference time, these weights are used to generate the Q, K, and V matrices
What is Multi-Head attention?
A single head of attention, with a single set of trained , , and matrices might not be the best at capturing all the subtleties of a given input sequence. Here is where multi-head attention comes into picture.
Remember how when we were downscaling the input matrix into ? What if we head some number of heads, where each head of attention had its own set of weights. i.e:
Where I is the input matrix, thus we train h different , , and matrices, and each head goes through its own unique matmul, generating h different queries and keys. The entire self-attention process is calculated for every , , .
The final result after all self attention calculations for each head results in a matrix of dimensions
Where is a matrix used to ensure the dimensions are
Note: In the original paper by Vaswani et al. They selected in such a way that = . In this case, is an identity matrix, as the resultant matrix of multi head attention is already in dimensions . If it is not, we use to project it back to those dimensions.
Break :)
Congrats on making it so far! Here’s a picture of my cat to cheer you on
Piecing It All Together
After all this, we’re finally ready to tackle the actual transformer itself.
Core Ideas of the Transformer:
- Encoder and Decoder layer.
- Encoder provides the decoder with vector information of the input.
- Decoder uses information from current latest encoder output and the previous decoder output to predict the next output token.
- Linear and Softmax to convert tensors into words.
Encoder Stack
The encoder layer itself consists of N layers, in the original Vaswani et al. paper, = 6.
Inside One Encoder Layer: Each layer has two main sub-layers:
- Multi-Head Self-Attention Mechanism:
- What it does: Allows each input token to look at (attend to) every other token in the input sequence (including itself). Calculates attention scores.
- Input: The output from the previous layer (or the initial embeddings).
- Output: A set of attention-infused vectors, same sequence length as the input.
- Position-wise Feed-Forward Network (FFN):
- What it does: A simple fully connected feed-forward network. It is applied independently to every token coming out of the Attention sub-layer.
- Why we do this: The attention mechanism itself involves a lot of linear combinations (weighted sums). The FFN introduces non-linearity, allowing the model to learn and represent more complex patterns and relationships in language that linear operations alone can NOT capture.
- Input: The output from the attention sub-layer (after Add & Norm).
- Output: Further processed vectors, same sequence length.
- Add & Norm: After each of the two sub-layers (Self-Attention and FFN), there's a residual connection followed by Layer Normalization:
- Residual Connection (Add): Put simply, it carries forward the data from the previous calculation. This helps the model have a continous sense of what the data means.
- Layer Normalization (Norm): Ensures the network remains stable. When you pass inputs through many layers, their distribution can change drastically. This keeps the distribution stable, which primarily prevents gradients from vanishing or exploding, (see RNNs).
- The first layer’s output is fed as input to the second layer in the encoder itself. This repeats until the Nth layer.
- Final Encoder Output: The final output of the Encoder is simply the output matrix resulting from the last Encoder layer's Add & Norm step (after the last FFN). Its shape is . This will now be used in Encoder-Decoder Attention.
Decoder Stack
- Purpose: To generate the output sequence token by token, based on the encoded input representation and the tokens generated so far.
- Structure: Similar to the Encoder, the Decoder is also typically a stack of identical layers (again, in this case =6).
- Inside One Decoder Layer: Each layer has three main sub-layers:
- Masked Multi-Head Self-Attention Mechanism:
- What it does: Similar to the Encoder's self-attention, but with one major difference: masking. When processing the token at position i, this mechanism only allows attention to positions less than or equal to i in the output sequence being generated. It prevents the Decoder from "cheating" during training by looking ahead at future tokens it hasn't predicted yet. I. E.
- Input: The output from the previous Decoder layer (or a special start token if just starting).
- Output: Attention-infused vectors for the output sequence generated so far.
- Multi-Head Encoder-Decoder Attention Mechanism:
- Inputs:
- Queries (): Come from the output of the previous sub-layer (Masked Self-Attention + Add & Norm) within the Decoder.
- Keys () and Values (): Come directly from the output of the final Encoder layer. (These are the same , for every Decoder layer).
- Note: How are and generated?
- The final output of the Encoder is used to generate these.
- There are a set of trained weights for keys and values that are SEPARATE from the weights for multi-head attention, referred to as and .
- The output of the encoder is multiplied with these vectors. It is a similar process to the Encoders Multi-Head Attention, where the INPUT matrix is multiplied with Wk_cross and matrices (learned) to produce h and matrices.
- Note: This is for a SINGLE attention head. There are attention heads similar to this, with their own learned and matrices, similar to how each attention head in the ENCODER-ONLY attention had its own.
- Process:
- Similar to encoder only self-attention, the encoder-decoder self-attention follows the formula:
- Output: Vectors representing the Decoder tokens, now informed by the relevant parts of the encoded input sequence. i.e. that are concatenated.
- Position-wise Feed-Forward Network (FFN):
- Add & Norm: Just like in the Encoder, an Add & Norm step follows each of the three sub-layers in the Decoder.
This sets the UPPER TRIANGULAR MATRIX to -inf. So that when softmax is applied, these get scaled to zero, preventing the decoder from “cheating” during training and producing garbage during inference.
THIS IS THE LINK BETWEEN ENCODER AND DECODER. WHERE THE MAGIC HAPPENS.
This allows each token in the Decoder sequence so far to attend to all tokens in the final output representation from the Encoder.
Let’s discuss what happens in a SINGLE Encoder-Decoder Attention Head.
Identical in structure and function to the FFN in the Encoder layer. Applied independently to each token coming out of the Encoder-Decoder attention sub-layer.
Finishing touches. How do we go from numbers to words?
So after ALL of this we’ve got a tensor that’s supposed to give us the next predicted token. How do we go from a matrix to a word/token?
We only need the last vector in this tensor, to predict the next token as it is now RICH in context from the previous vectors. This last vector in the tensor is what we’ll use to predict the next token. So we start with a vector.
- Linear: We apply a linear transform to project this vector into a space.
Note: here, refers to the number of tokens in the dictionary of the model. I.E. how many words/tokens the model knows exist. If the model only knows 1000 words,
- Softmax: This vector now has each element directly corresponding to a word in the model’s dictionary. The vector needs to act like a probability distribution. However, after all those matrix multiplications, it is NOT a probability distribution yet. This is where softmax comes in. Softmax allows you to convert the vector into a probability distribution, where the sum of all elements adds up to one. Softmax is also sensitive to low and high values, ie higher values get higher and lower get lower, reducing noise.
The above equation gives the softmax value for a particular element, where:
- i: the specific token you’re calculating the probability for.
- j: used to loop through all tokens to calculate the total sum for the denominator.
AND FINALLY, from this probability distribution we pick (usually) the highest and most probable next token.
And that’s it! You’ve successfully generated 1 (one) token! Rinse and repeat thousands of times and you get a coherent language translation.
Conclusion
And there you have it! If you've made it this far, give yourself a pat on the back. We've pulled apart the Transformer engine, piece by piece. From understanding why we needed to ditch sequential models, to getting our heads around Queries, Keys, and Values, Multi-Head Attention, the Encoder-Decoder dance, Add & Norm, FFNs, and finally turning numbers back into words. It is a lot!
The main takeaway? Attention really is all you need. It’s not just applicable in text models and LLMs. Image generation, image recognition, document extraction models also heavily rely on attention these days.
If you wanna check out a cool application of this, check out this video - where stanford students used an image recognition model that uses self-attention to become the best AI at playing Geoguessr!