Transformer Architecture: Self-Attention & MLOps Guide
Exploring attention and its role in contextual text understanding with walkthrough examples
By Kuriko IWAI

Table of Contents
IntroductionWhat is a Transformer ModelIntroduction
The transformer model revolutionizes natural language processing (NLP) by processing entire sequences at once, leveraging techniques like self-attention mechanism, positional encodings, and multi-head attention.
Although powerful, its underlying concepts seem complex due to the intricate interplay of these mechanisms.
In this article, I’ll explore the core mechanism of the transformer model step by step using a walkthrough example.
What is a Transformer Model
A transformer is a type of neural network architecture that has revolutionized the field of deep learning, especially in natural language processing (NLP).
Unlike earlier models like Recurrent Neural Networks (RNNs) that process data sequentially, transformers process the entire sequence simultaneously.
This allows them to capture relationships between words regardless of their distance from each other in the sequence.
***
Transformer models are composed of an encoder and a decoder that work together to process and generate sequences.
The below diagram illustrates the architecture of the encoder and decoder with computation processes:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure A. Transformer model (Created by Kuriko IWAI based on the original image in Attention is all you need)
▫ Encoder
Encoder processes the input sequence and generates a rich representation of the sequence through two sub-layers:
A multi-head attention layer and
A feed-forward layer.
▫ Decoder
Decoder on the other hand, takes the output from the encoder and generates an output sequence by predicting the next token (words) through three sub-layers:
A multi-head self-attention layer,
An encoder-decoder attention layer and
A feed-forward layer.
◼ The Attention Mechanism
In both encoder and decoder, the core of the algorithms lies in the attention mechanism in the attention layers.
Attention mechanism is a component of neural networks that allows the model to weigh the importance of different parts in the input sequence.
Instead of just looking at the previous word, attention mechanisms can examine an entire sequence simultaneously and make decisions on how and when to focus on specific time steps of that sequence, considering the context.
For example, in the following sentence:
The animal didn’t cross the street because it was too tired.
the attention mechanism helps the model determine that “it" refers to "The animal", processing the context of the entire sentence.
Since the transformer processes the sequence at once, it is competitive in capturing long-range dependencies.
Let us take a look how the mechanism woks in transformer models in the next section.
How Transformer Models Work
As shown in Figure A, an encoder and decoder share many processing steps.
So, in this section, I’ll first explain the steps focusing on the encoder, and then highlight the key differences of the decoder.
◼ Step 1. Converting Raw Data into Input Embedding
The transformer model first reads raw data sequences and converts them into vector embedding.
In the vector embedding, each element in the sequence is represented by its own feature vectors that numerically reflect qualities like semantic meaning.
For example, the original sentence:
The animal didn’t cross the street because it was too tired.
is broken into tokens:
["The", "animal", "did", "n't", "cross", "the", "street", "because", "it", "was", "too", "tired", "."]
These tokens can be words, sub-words or characters.
Then, the model generates a numerical word vector of each token using a world embedding model:
The → [0.2, -0.5, 0.1, ...]
animal → [0.7, 0.3, -0.2, ...]
tired → [-0.9, 0.6, 0.4, ...]
These initial vectors are called input embedding or embedding tokens (X) and become a starting point for the attention mechanism.
◼ Step 2. Adding Positional Encoding
While the input embedding captures semantic meaning of the word, it does not contain any information about its position in the sentence.
The transformer’s attention mechanism cannot process the sequential order either because it processes all words simultaneously.
To solve this challenge, positional encoding is used to inject positional information into the word embeddings.
Mathematically, the model computes the positional vectors of each embedding in the input sequence, combining sine and cosine functions:
where:
PE: The positional encoding vector,
pos: The position of the token in the sequence (0,1,2,...),
i: The dimension index of the vector (0,1,2,...) (one for sine (2i) and one for cosine (2i+1)), and
d_{model}: The dimensionality of the model's embedding space (e.g., 512, 768).
These positional vectors are then added to the corresponding word's embedding, and fed into the model.
For example, consider the word "animal" at index 1 in the original sentence in Step 1:
Word Embedding: [0.7, 0.3, -0.2, ...] (represents its semantic meaning)
Positional Encoding: [0.1, 0.4, -0.3, ...] (represents its position at index 1)
Final Input Vector: [0.7 + 0.1, 0.3 + 0.4, -0.2 + -0.3, ...] = [0.8, 0.7, -0.5, ...]
The final, combined vector contains all the information that the model needs:
The word's semantic meaning and
Its precise location within the sentence.
The combined vector is passed onto the next sub-layer, the attention layers.
◼ Step 3. Attention Layer - Determining Attention Scores
In the attention layer, the model first determines correlations between the combined word vectors by computing the attention score.
The attention score weighs the alignment of the word vectors; the large attention score indicates well-aligned vectors.
In the computation process, the model uses the three vectors:
Q: A Query vector of the current word (with each row of q_i),
K: A Key vector of all other words than the current words (with each row of k_j), and
V: A Value vector containing the actual information of the data.
and computes the scaled dot product of Q and K:
where d_k is the dimension size of the the vectors Q and K.
This d_k value works as a scaling factor, preventing the score from becoming too large and increasing vector dimensions excessively.
Let us consider the word vectors for “it", "animal", and “street“ from the original sentence in Step 1, and assume the current word is “it".
In Steps 1 and 2, these word vectors capture the semantic meaning and positional information of the words such that:
The Query vector of the current word it: q_{it} = [0.8, 0.2]
The Key vector of animal: k_{animal} = [0.7, 0.3]
The Key vector of street: k_{street} = [-0.1, 0.9]
To find the relationship among the words, the model computes the dot product QK^T:
q_{it} *k_{animal} = (0.8 * 0.7)+(0.2 * 0.3) = 0.56 + 0.06 = 0.62
q_{it} * k_{street} = (0.8 * −0.1)+(0.2 * 0.9) = −0.08 + 0.18 = 0.10
Then, scale the dot products with the dimension size d_k = 2:
Scaled attention score to "animal": 0.62 / \sqrt 2 ≈ 0.44
Scaled attention score to "street": 0.10 / \sqrt 2 ≈ 0.07
The results indicates a stronger semantic relationship between "it" and "animal" (0.44) than those with “it“and "street“ (0.07).
◼ Step 4. Computing Attention Weights
Next, the model computes the attention weights by applying the attention score to softmax:
Because softmax is denoted with the base of the natural logarithm e:
the attention weights of the current word “it” to “animal” and “street” are computed:
Attention weight to "animal" = e^0.44 / (e^0.44 + e^0.07) ≈ 0.69
Attention weight to"street" = e^0.07 / (e^0.44 + e^0.07) ≈ 0.31
This means that when processing the word “it", the model pays almost 70% of its attention to "animal" and only about 30% to "street”, which correctly reflects the sentence's meaning.
Zero attention weight means the model disregards the word vector, while one attention weight means full attention.
These attention weights help transformer models focus on specific input at specific moments.
◼ Step 5. Output from the Attention Layer
Step 5 is the last step of the attention layer.
The model computes a weighted sum of the Value vectors V to produce the final output in the attention layer:
where:
Z: The output values from the attention layer,
d_k: The dimension size of the Query and Key vectors (Q, K), and
V: The Value vector.
Let us assume we have corresponding Value vectors for “animal" and "street" such that:
V_animal = [0.8, −0.4]
V_street = [0.1, 0.9]
In Step 4, we’ve already calculated the attention weights to "animal" (≈ 0.69) and to "street" (≈ 0.31).
So, the attention output vector for the current word "it" are:
z_animal = 0.69 ⋅ [0.8, −0.4] = [0.55, −0.28]
z_street = 0.31 ⋅ [0.1, 0.9] = [0.03, 0.28]
The final output vector for “it" is the sum of these weighted values:
- Z_it = [0.55,−0.28] + [0.03,0.28] = [0.58, 0]
The new vector [0.58, 0] is the contextualized representation of the word "it" that contains semantic information from both "animal" and "street”.
The vector gives significantly more weight to the information from “animal", correctly capturing the fact that "it" refers to the animal in the sentence.
◼ The Multi-Head Attention
Before moving onto the next sub-layer, feed-forward network layer, let us explore multi-head attention mechanism, the key in the attention layer.
Multi-head attention runs Step 3 to Step 5, enabling the model to process information from multiple perspectives simultaneously.
Instead of a single attention calculation, the model runs several attention mechanisms in parallel where each head learns to focus on different types of relationships within the data.
The first step is to split the input embedding into h evenly-sized subsets.
Then, each subset is fed into one of h parallel matrices of Q, K, and V (respectively called query head, key head, and value head).
Lastly, the outputs from these heads are fed into a corresponding subset of the next attention layer called attention head.
The diagram below shows how Head A processes the relationship of “it" to "animal" in the original sentence in Step 1 using eight heads:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure B. Simplified multi-head attention with eight heads (h=8) (Created by Kuriko IWAI)
In the diagram, the model first splits the input embedding for “animal" and "it" into eight, equal-sized subsets (red and blue circles).
Then, each subset is sent to a corresponding head:
Head 1 gets the first subsets of the “animal” and “it” embedding tokens,
Head 2 gets the second subsets of the “animal” and “it”, and
… continues to Head 8.
These eight heads learn different aspects of the relationships between the given word embeddings, leveraging their own unique Query (Q), Key (K), and Value (V) matrices (black circles in the diagram).
For example:
Head 1 learns the core grammatical link between “animal” and “it”.
Head 2 learns semantic relationship; both “animal” and “it” are subjects in their respective clauses.
Head 3 identifies the distance between the two words.
… and the list continues.
Then, each head generates unique attention outputs Z’s, which are concatenated back together to create a single, combined vector.
The key here is that each head operates the process in parallel, allowing the model to learn diverse aspects of the relationship simultaneously.
So, the new, richer vector now contains a fused representation of all the relationships learned by the individual heads.
This process is then repeated for every token in the sentence.
◼ Step 6. Non-Linear Transformation
Next, the feed-forward network (FFN) layer takes the outputs from the attention layer and performs non-linear transformation by applying ReLU activation:
where:
x: The input vector from the previous sub-layer,
W_1: The weight matrix to transform the input vector to a higher dimensional space,
W_2: The weight matrix to transform the input vector to the original dimensional space,
b1, b2: Bias vectors, and
ReLU: The Rectified Linear Unit activation function, which introduces non-linearity.
This process introduces non-linearity, allowing the model to learn more complex patterns beyond what the attention mechanism alone can capture.
Importantly, this feed-forward network is applied independently and identically to each token’s vector.
Let us assume the FFN layer has the following weights and biases:
The input vector x is the attention output for “it," so x = [0.58, 0].
Being applied the ReLU activation, the final output from the feed forward network is: ReLU([0.74,0.58]) = [0.74,0.58]
This new vector is a richer representation of the current word “it“, which is passed on to the next layer in the transformer model.
◼ Step 7. Create the Sentence Embedding
The final step is to combine the refined vectors into a single, comprehensive vector that represents the entire sentence.
A common method is to use a special token like the [CLS] token.
The [CLS] token is designed to represent the aggregated meaning of the entire input sequence, which becomes the sentence embedding.
The original sentence:
The animal didn’t cross the street because it was too tired.
is represented by a 768-dimensional vector like:
[0.45, -0.21, 0.89, -1.01, 0.55, 0.08, ... , -0.34, 0.72]
which semantically encodes the entire meaning of the sentence.
The sentence embedding is leveraged in tasks like:
Semantic search: Find other sentences with similar meanings, like “The tired dog didn’t go across the road.”,
Text classification: Classify the sentence’s sentiment (e.g., as neutral or negative), and
Information retrieval: Answer a question about the text by comparing the query’s embedding to the sentence’s embedding.
These are the seven steps performed by an encoder.
Now, let us explore a decoder, highlighting its key differences from the encoder.
◼ Key Differences of the Decoder
In an encoder, the attention mechanism accesses the entire input sequence simultaneously, making it suitable for tasks like text classification and sentiment analysis.
On the other hand, in a decoder, the attention mechanism processes the input sequence one token at a time, enabling it to generate a new sentence by predicting a suitable next token.
This requires two distinct differences:
Masked multi-head attention mechanism applied in Step 4 — to prevent the decoder from looking at the future tokens (aka, no cheating allowed), and
An additional encoder-decoder attention layer in Steps 3 to 5 to access the rich representation of the input sequence processed by an encoder.
▫ Masked Multi-head Attention Mechanism
A masked multi-head attention mechanism is a technique to applying a mask to the scaled attention scores.
This is essential for a decoder to effectively block the attention to subsequent words in the sequence, preventing it from looking at future tokens.
The difference is in Step 4 where the scaled attention weights are computed with a Mask factor:
This Mask factor is a triangular matrix with negative infinity values for future positions.
These negative infinities become zero in softmax applied in Step 5, as e^−∞ approaches zero.
The same example from the encoder, the scaled attention scores from the current word “it” are:
Scaled attention score to "animal" ≈ 0.44
Scaled attention score to "street" ≈ 0.07
These values are the same as the encoder because the decoder considers “animal" and "street" as past tokens to "it", a future token.
But the decoder masks the token when it processes the scores from “animal" to "it" by applying the mask of the sequence:
Scaled attention score for “animal" to "it" ≈ 0.44
Adding the mask: 0.44 + (−∞) = −∞
The attention weight: e^−∞ = 0
The result indicates that the mask prevents “animal" to influence "it" because "it" is a future token to “animal”.
▫ Encoder-decoder Attention Layer
In addition to the masked self-attention, the decoder includes a special encoder-decoder attention layer.
This layer performs the same attention computation as the encoder by taking the same Steps 3, 4, and 5.
But a crucial difference lies in how the Query (Q), Key (K), and Value (V) vectors are sourced:
Query vectors Q comes from the decoder’s masked self-attention layer, representing tokens generated by the decoder so far.
Key/Value Vectors K, V come from the final output of the encoder, containing contextualized information of the entire input sequence.
This setup allows the decoder to query the rich representation of the input sequence from the encoder.
For instance, when translating the French word "l'animal", the decoder's query vector will have a high attention score (Step 4) with the encoder's key vector for "the animal", ensuring the correct translation.
Training Transformer Models
The key to transformers' success is how well it can capture long-range dependencies by leveraging the attention mechanism.
◼ The Low-Level Mechanics: Optimizing Model Parameters
During the training process, the model learns to optimize the query (Q), key (K), and value (V) vectors for each token in the attention layers, following the two major steps:
Step 1. Learned Weight Matrices:
The model learns three distinct weight matrices corresponding to Q, K, and V:
A query weight matrix (W_Q),
A key weight matrix (W_K), and
A value weight matrix (W_V).
These matrices are initialized with random values first, then adjusted during training to optimize the model’s performance.
Step 2. Linear Transformation
The Q, K, and V matrices are computed by multiplying the input embedding matrix (X) by their respective weight matrices:
where:
Q, K, V: The Query, Key, and Value matrix,
W_Q, W_K, W_V: Weight matrix for Q, K, and V respectively, and
X: Input embedding.
This linear transformation process projects the original embedding into a new vector space, make the embedding better suite for the attention calculation.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com
Figure C. A simplified diagram of the transformer's attention mechanism (Created by Kuriko IWAI)
These computed Q, K, and V matrices are then used in the core attention formula in Step 3.
Then, in the feed-forward layer, the weight and bias vectors W_1, W_2, b_1, and b_2 are learned through backpropagation and gradient descent to minimize the model's prediction error.
◼ Self-Supervised Learning (SSL): Overcoming the Data Bottleneck
We’ve learned that the attention mechanism enables parallelization, performing computation of each token independently.
This means that transformer models can perform many computations simultaneously, taking full advantage of the speed and power of GPUs.
As a result, these models can be trained on massive datasets, which makes self-supervised learning (SSL) invaluable.
SSL is a machine learning paradigm that trains a model to generate its own supervisory signals from the input data without explicit human-annotated labels, taking the two main steps:
Step 1. Training models on pretext task
The first step is to train the model on an unlabeled dataset and make it solve a task where the answer is derived from the data itself.
For example:
Masked language modeling (MLM): The model is given a sentence with some words masked or hidden, and it must predict the missing words (used in models like BERT).
Image inpainting: The model is given an image with a section removed, and it must fill in the missing pixels.
Predicting next frame: The model is given a sequence of video frames and must predict the next one.
Step 2. Transfer learning onto the downstream task
After the model has learned a rich representation from the pretext task, its learned weights are transferred to a new, specific task only with a small labeled dataset.
This transfer learning significantly boosts performance on the new task with minimal extra effort.
Overcoming a major bottleneck of traditional supervised learning: the need for massive, expensive labeled datasets, SSL is widely applied across various domains, including computer vision (e.g., for object detection and image classification) and audio processing (e.g., for speech recognition).
This is all for the core mechanics of transformer models.
In the next section, I’ll explore major transformer models with use cases.
The Transformer Model Family
The transformer model family is categorized into three main architectural types:
Encoder-only,
Decoder-only, and
Encoder-decoder models.
◼ Encoder-only Models (Auto-Encoding Models)
Encoder-only models are designed to create a rich representation of the input sequence by considering the context bidirectionally.
These models are suitable for tasks that require a deep understanding of the input, such as
Sentiment analysis,
Named entity recognition, and
Text classification.
Major models include:
BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT was a breakthrough model that introduced a bidirectional training approach.
RoBERTa (Robustly Optimized BERT Pretraining Approach): A more robustly trained version of BERT by Facebook, with key differences like using more data and longer training times.
ALBERT (A Lite BERT): A lighter and more efficient version of BERT that uses parameter sharing to significantly reduce the number of parameters.
DistilBERT: A distilled version of BERT, meaning it's smaller, faster, and cheaper to pre-train while retaining most of BERT's performance.
◼ Decoder-only Models
Decoder-only models are autoregressive models that predict the next token in a sequence based on the tokens that came before it.
These models are ideal for generative tasks where the output is created one token at a time, such as
Text generation,
Creative writing, and
Conversational AI.
Major models include:
GPT (Generative Pre-trained Transformer) Family: A series of models from OpenAI.
LLaMA (Large Language Model Meta AI) Family: A series of open-source models from Meta AI.
PaLM (Pathways Language Model): Developed by Google, a family of large language models.
BLOOM: A massive multilingual open-access model created through a collaboration of researchers and institutions.
◼ Encoder-Decoder Models (Sequence-to-Sequence / Seq2seq Models)
Encoder-decoder models use both an encoder and a decoder.
The encoder processes the input sequence to create a contextual representation, and the decoder uses that representation to generate the output sequence.
This architecture is perfect for transforming one sequence into another like
Text summarization,
Question answering, and
Translation.
Major models include:
T5 (Text-to-Text Transfer Transformer): A model from Google that reformulates every NLP problem into a text-to-text format.
BART (Bidirectional and Auto-Regressive Transformers): A model that uses a bidirectional encoder (like BERT) and an autoregressive decoder (like GPT).
PEGASUS: A model from Google specifically designed for abstractive summarization.
◼ Multimodal and Other Architectures
The transformer architecture is also adapted for tasks beyond traditional NLP, handling multimodality like images, audio, and text.
Major models include:
Vision Transformer (ViT): A model that applies the Transformer architecture directly to image classification by treating an image as a sequence of patches.
DALL-E: A model that generates images from text descriptions, using a Transformer to relate text to visual representations.
CLIP (Contrastive Language-Image Pre-training): A model trained on a massive dataset of images and their text descriptions to learn the relationship between images and text.
Wrapping Up
The Transformer model shifts NLP from sequential data processing to a simultaneous, attention-based approach.
The core innovation lies in the self-attention mechanism, which enables the model to weigh the importance of all tokens in a sequence at once.
By leveraging positional encodings, multi-head attention, and feed-forward networks, the Transformer creates rich, contextualized representations of data.
This architecture's ability to be parallelized has enabled it to be trained on vast datasets using self-supervised learning, leading to powerful models like BERT (encoder-only), GPT (decoder-only), and T5 (encoder-decoder).
The Transformer's success has extended beyond text, laying the foundation for advancements in fields like computer vision and multimodal AI.
Continue Your Learning
If you enjoyed this blog, these related entries will complete the picture:
Regularizing LLMs with Kullback-Leibler Divergence
The Definitive Guide to LLM Fine-Tuning: Objectivee, Mechanisms, and Hardware
Grouped Query Attention (GQA): Balancing LLM Quality and Speed
Decoding CNNs: A Deep Dive into Convolutional Neural Network Architectures
Related Books for Further Understanding
These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation
Share What You Learned
Kuriko IWAI, "Transformer Architecture: Self-Attention & MLOps Guide" in Kernel Labs
https://kuriko-iwai.com/transformers
Looking for Solutions?
- Deploying ML Systems 👉 Book a briefing session
- Hiring an ML Engineer 👉 Drop an email
- Learn by Doing 👉 Enroll AI Engineering Masterclass
Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.



