Positional encoding (PE) is a key component of the Transformer architecture as its attention mechanism is set-invariant, requiring explicit positional information to process sequential data like language.

However, traditional PE methods face significant challenges in extrapolating to sequences much longer than those seen during training, limiting the Transformer's context window.

In this article, I’ll explore major PE methods and investigate their ability to handle longer sequences via extrapolation by instantiating and training a small Transformer on a synthetic long-sequence task.

What is Positional Encoding

Positional Encoding (PE) is a technique used in the Transformer architecture to inject information on the position and order of tokens within the input embedding.

◼ Overcoming Set-Invariance in Attention

The below diagram illustrates how a standard PE (absolute PE) works:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Transformer architecture (left) and how APE works (right) using BERT (Created by Kuriko IWAI)

The Transformer architecture (left side of Figure A) relies on the self-attention mechanism where all tokens in a sequence is processed simultaneously without intrinsic sense of the word order.

This helps Transformers process long sequences much faster than transitional sequential models like Recurrent Neural Networks (RNNs).

Position Encoding (PE) plays a key role by adding information about the position of each token in the input sequence to create the final input vector (X in Figure A) which is passed onto the attention layer.

◼ The PE Mechanism

In Figure A, the Transformer model receives an input sentence:

“The car is chasing the dog.”

and proceeds three main steps:

Tokenization: The sentence is tokenized, including special tokens like [CLS] (classification/start) and [SEP] (separator/end). Figure A uses BERT tokenizer to generate nine tokens: [CLS], the, car, is, chasing, the, dog, ., [SEP].
ID assignment:

Position IDs are assigned from zero to the sequence length (0 to 8).
Segment IDs (or Token Type IDs) are assigned to indicate which segment (sentence) a token belongs to (0 for all tokens in Figure A as the input sequence only has one sentence).

Embedding: Convert IDs into embedding vectors: token embedding (pink boxes), position embedding (green box), and segment embedding (blue boxes).

The final input embedding vector (X) is the sum of these three embeddings:

X = E_{token} + E_{pos} + E_{seg}

where:

X: The input vector (shape: sequence length (N = 9 in Figure A) x embedding dimension (i.e., the hidden size of BERT = 768),
E_{token}: Token embedding which provides the semantic meaning of the token,
E_{pos}: Positional embedding which provides the order of the word in the sequence, and
E_{seg}: Segment embedding which indicates which input segment the token belongs to.

Without PE, the model would treat:

"The car is chasing the dog." and
"The dog is chasing the car."

identically, making the sentence completely meaningless in terms of word order.

Key Types of Positional Encoding (PE) for LLMs

In this section, I’ll explore major PE types:

Absolute Positional Encoding
Relative Positional Encoding
Hybrid Approach

◼ Absolute Positional Encoding

Absolute PE assigns a unique vector to each time step, capturing its position in the sequence.

As we learned in the previous section, this process is generalized:

X = E_{Token} + E_{pos} + E_{seg}

where X is the input vector passed onto the attention layer, and E_{token}, E_{pos}, E_{seg} are token/position/segment embeddings respectively:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. How APE works (Created by Kuriko IWAI)

In Figure B, the positional embedding is projected onto the Query (Q), Key (K), and Value (V) vectors such that:

Q = XW_Q = (E_{Token} + E_{pos} + E_{seg}) W_Q

K = XW_K = (E_{Token} + E_{pos} + E_{seg}) W_K

V = XW_V = (E_{Token} + E_{pos} + E_{seg}) W_V

where Q, K, V are the Q, K, V vectors, and W_Q, W_K, and W_V are learnable weight matrices for each vector, respectively.

Absolute PE has three major types to generating positional embedding E_{pos}:

Fixed Absolute Positional Encoding (FAPE)
Learnable Positional Encoding (LPE)
Time Absolute Position Encoding (tAPE)

▫ Fixed Absolute Positional Encoding (FAPE)

Fixed Absolute Positional Encoding (FAPE) is the most common approach using Sinusoidal PE to compute the positional information.

In FAPE, the value of the PE vector is generalized:

For even dimensions (i.e., i = 0, 2, 4, …):

\text{PE (pos, i)} = \sin\left(\frac{pos}{10000^{\frac{i}{d_{model}}}}\right)

For odd dimensions (i.e., i = 1, 3, 5, …):

\text{PE (pos, i)} = \cos\left(\frac{pos}{10000^{\frac{i-1}{d_{model}}}}\right)

where:

pos: The absolute position of the token in the sequence (pos = 0, 1, …),
i: The dimension index from zero to d_{model},
d_{model}: The dimensionality of the embedding space (hidden size) (i.e., d_{model} = 768 for BERT), and
10000^{{i} / {d_{model}}}: The frequency of the sinusoid. As the dimension index i increases, the frequency decreases exponentially.

Pros:

Requires no extra parameters to learn.
Can generalize effectively to arbitrarily long sequence lengths due to the fixed sinusoidal formula.

Cons:

The positional representation is inflexible and cannot be adapted or learned specifically for the task or dataset.

Best when:

Sequence lengths are highly variable.
The model must generalize to sequences longer than those seen during training.

▫ Learnable Positional Encoding (LPE)

Learnable Positional Encoding (LPE) is a type of absolute PE where positional embeddings are randomly initialized and then learned via backpropagation during the model's training process.

This approach is used by models like BERT and GPT, providing more flexibility because it does not rely on a mathematical formula to generate the embeddings.

Pros:

Provides more flexibility as the embeddings are learned via backpropagation.

Cons:

Fails to generalize to sequence lengths exceeding the maximum length seen during training.

Best when:

Sequence lengths are relatively fixed or bounded.
High flexibility and task-specific learning are desired.

▫ Time Absolute Position Encoding (tAPE)

Time Absolute Position Encoding (tAPE) is a form of LPE that uses a recurrent function like a linear projection to recursively generate the next position vector from the previous one, aiming to model temporal relationships.

The tAPE is calculated similarly to FAPE, using sin and cos alternatively for even and odd indices, but with a scaling factor σ that depends on the total sequence length N:

\text{PE (pos, i)} = \sin\left(\frac{pos}{\sigma^{\frac{i}{d_{model}}}}\right)

where the base σ is defined as:

\sigma = N^{\frac{d_{model}}{d_{model} - 2i}}

where:

pos: The absolute position of the token in the sequence (pos = 0, 1, …),
i: The dimension index from zero to d_{model},
d_{model}: The dimensionality of the embedding space, and
N: Sequence length. The total number of tokens in the current sequence.

The key difference from FAPE is that the base σ scales the frequency based on the input sequence length N, instead of the fixed base of 10,000.

Pros:

Adapt to various sequence lengths as the scaling factor depends on the total sequence length.

Cons:

The formula is more complex than FAPE.

Best when:

Ensure flexibility of an LPE approach but also require adaptation and generalization to various sequence lengths (combining benefits of LPE and FAPE generalization).

Using the case in Figure A, each approach generates unique positional embeddings:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Comparison of positional embeddings generated by FAPE, LPE, and tAPE (d_{model} = 4, N = 9) (Created by Kuriko IWAI)

◼ Relative Positional Encoding

Relative Positional Encoding (RPE) incorporates the relative distance between time steps into the attention mechanism.

The below diagram illustrates how RPE works:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. How RPE works (Created by Kuriko IWAI)

First, RPE uses the token embedding as an input vector without adding positional or segment embeddings:

X = E_{Token}

So, the Query (Q), Key (K), and Value (V) vectors project only the token embeddings:

Q = XW_Q = E_{Token} W_Q

K = XW_K = E_{Token} W_K

V = XW_V = E_{Token} W_V

where Q, K, V are the Q, K, V vectors, and W_Q, W_K, and W_V are learnable weight matrices for each vector, respectively.

Next, the attention mechanism is directly modified using a relative positional bias b (red boxes in Figure D):

a_{ij} = \frac{q_i k_j^T}{\sqrt{d_k}} + b_{i,j} = \frac{q_i k_j^T +q_i r_{i-j}^K}{\sqrt{d_k}}

where:

a_{i, j}: The scaled, modified attention score between the i-th Query token and j-th Key token,
q_i: The i-th Query token in the Query vector,
k_j: The j-th Key token in the Key vector,
d_k: The dimensionality of the Key vector (scaling factor), and
b_{i, j}: The relative positional bias between the i-th Query token and the j-th Key token.

In RPE, the positional embeddings directly influence the attention weights, which allows the model to capture dependencies based on how far apart elements are.

Pros:

Can inherently generalize to new sequence lengths.

Cons:

More complex attention mechanism.
Computationally more expensive due to the learned relative key vectors.

Best when:

Robust generalization to varying lengths is critical.

RPE has a simplified variance: Efficient Relative Position Encoding (eRPE).

▫ Efficient Relative Position Encoding (eRPE)

Efficient Relative Position Encoding (eRPE) removes the Query vector q_i from the relative bias calculation, making the positional bias term static and content-independent:

e_{i,j} = \frac{q_i k_j^T}{\sqrt{d_k}} +b_{i-j}

where b_{i-j} is a learned scalar or vector bias for the relative distance i-j.

Pros:

More computationally efficient/simpler than RPE.

Cons:

Less expressive than RPE because the relative bias term does not consider the Query vector.

Best when:

A computationally simpler and faster RPE variant is preferred
A slight reduction in context-dependency is acceptable for efficiency.

Using the case in Figure A, each approach generates positional bias terms:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Comparison of positional embeddings generated by RPE and eRPE (d_{model} = 4) (Created by Kuriko IWAI)

Figure E shows an example of computing the bias terms for the two tokens: “car“ (i = 2) and “dog“ (j = 6) whose relative position is k = -4.

RPE computes the relative bias terms based on the interaction between the Query vector q_2 and a learned relative key vector r_{-4} which depends on the distance between the two tokens.

On the other hand, eRPE computes the static bias terms based solely on the learned scalar b_{-4} which does not consider the Query vector's content.

◼ Hybrid Approaches

Hybrid approaches combine elements of both absolute and relative positional encodings to leverage the strengths of each.

There are three major variants:

Transformer with Untied Positional Encoding (TUPE)
Temporal Positional Encoding (T-PE)
Rotary Positional Embeddings (RoPE)

▫ Transformer with Untied Positional Encoding (TUPE)

Transformer with Untied Positional Encoding (TUPE) is a modification of RPE where the bias terms are introduced as separate, dedicated projections.

The score is calculated by summing three independent terms:

a_{i,j} = \underbrace{\frac{q_i k_j^T}{\sqrt{d_k}}}_{\text{Content similarity}} + \underbrace{q_i r{i-j}^K}_{\text{Query-relative key}} + \underbrace{p_j k_i^T}_{\text{Absolute key-content}}

where:

r_{i-j}^K: The learned relative key vector from RPE, and
p_j: The learned absolute positional key vector for position j.

Taking the case in Figure E as an example, consider the two tokens “car“(Query i=2) and “dog“ (Key j=6):

Content similarity: C_{2,6} = 0.5 (as showed in Figure E)
Query-relative key: b = 0.18 (as showed in Figure E)
Absolute key-content: (p_6 k_2)^T = 0.07

Then, the final TUPE attention score a = 0.5 + 0.18 + 0.07 = 0.75.

TUPE combines the strengths of APE through p_j and RPE through r_{i-j}^K.

Pros:

Combines the strengths of both APE and RPE.

Cons:

The most complex attention score calculation among all.
Requires learning parameters for both absolute and relative positions.

Best when:

Both the absolute position of a token and its relative distance to other tokens are considered important for high performance on a specific task.

▫ Temporal Positional Encoding (T-PE)

Temporal Positional Encoding (T-PE) is specifically designed for sequences with varying length and time intervals like time series or sensor data.

It modifies the positional encoding to be a function of the actual time t_i rather than just the token index i.

In T-PE, the positional information is generated by an RNN or a function h that takes the time interval Δt = t_i - t_j as input.

Then, the formula modifies the Key by adding the temporal difference such that:

a_{i,j} = \frac{q_i (k_j + r_{i,j})^T}{\sqrt{d_k}}

where r_{i, j} = h(t_i - t_j) is the temporal embedding derived from the time difference.

Similar to TUPE, consider the two tokens “car“ (Query i=2) and “dog“ (Key j=6):

Time difference: Assigns hypothetical timestamps for each token such that t_2 = 10 (seconds) and t_6 = 15 (seconds). Computes the time difference: Δt = -5.
Temporal embedding: The temporal embedding r is generated from this difference:
r_{-5} =h(-5) = [0.1, -0.1, 0.2, -0.2] (assuming the dimensionality of the model is four.)
Modified key: The Key vector is updated k_6' = k_6 + r_{-5} = [0.6, 0.4, 0.7, 0.3], assuming the original k_6 = [0.5, 0.5, 0.5, 0.5].

Final T-PE attention score is computed: a_{2,6} = (0.1(0.6 + 0.4 + 0.7 + 0.3)) / 2 = 0.1, assuming q_2 = [0.1, 0.1, 0.1, 0.1].

The low score reflects that a 5-second gap is less relevant for the specific query.

Pros:

Can directly incorporate real-world time intervals between tokens.
Highly suitable for sequences with varying time intervals and non-uniform sampling.

Cons:

Requires external time information for each token.
Implementation is more complex, involving an RNN or function to generate the temporal embedding.

Best when:

Specifically for time series or sensor data where the temporal gap between elements is irregular and relevant to the task.

▫ Rotary Positional Embeddings (RoPE)

Rotary Positional Embeddings (RoPE) leverages an absolute PE method to naturally induce relative position dependency when used within the standard dot-product attention mechanism.

RoPE works by modifying the linear projection steps for the Query and Key vectors in the attention layer, ensuring that the dot product of two RoPE-encoded vectors only depends on their relative distance.

First, RoPE assigns a unique rotation matrix R_m for the Query at the position m to compute the rotated Query vector q_m:

q_m = R_m (\mathbf{x}_m W_Q)

where W_Q is the weight matrix for the Query vector.

RoPE also assigns a unique rotation matrix R_n for the Key at the position n to compute the rotated Key vector k_n:

k_n = R_n (\mathbf{x}_n W_K)

where W_K is the weight matrix for the Key vector.

The dot-product of the rotated q_m and k_n only depends on the relative distance m - n:

q_m^T k_n \propto f(m - n)

Because the final attention score is only influenced by the distance m - n, the model effectively learns relative position, even though the mechanism uses rotation matrices tied to absolute positions.

Similar to the other examples, consider the two tokens “car“ (Query m = 2) and “dog“ (Key n = 6):

For a vector q_m and its absolute position m, the rotation is applied to pairs of dimensions (i, i+1):

q_m = f(q, m) = \begin{pmatrix} \cos(m\theta_0) & -\sin(m\theta_0) & 0 & 0 \\ \sin(m\theta_0) & \cos(m\theta_0) & 0 & 0 \\ 0 & 0 & \cos(m\theta_1) & -\sin(m\theta_1) \\ 0 & 0 & \sin(m\theta_1) & \cos(m\theta_1) \end{pmatrix} q

Since d_{model} = 4, there are two frequency parameters:

θ_0 for dimensions 0 and 1 and
θ_1 for dimensions 2 and 3.

Let us use simple initial vectors: q = [0.1, 0.2, 0.3, 0.4] and k = [0.5, 0.6, 0.7, 0.8], and simplified frequencies: θ_0 = 10∘ and θ_1 = 5∘

The rotated Query q_2 for “car“ (m=2) is computed:

Rotation 20° (= θ_0 × 2):
- q'_0 = 0.1(0.94) - 0.2(0.34) = 0.094 - 0.068 = 0.026
- q'_1 = 0.1(0.34) + 0.2(0.94) = 0.034 + 0.188 = 0.222
Rotation 10° (= θ_1 × 2):
- q'_2 = 0.3(0.98) - 0.4(0.17) = 0.294 - 0.068 = 0.226
- q'_3 = 0.3(0.17) + 0.4(0.98) = 0.051 + 0.392 = 0.443

→ q_2 = [0.026, 0.222, 0.226, 0.443]

The rotated Key k_6 for “dog“ (n=6) is computed:

Rotation 60° (θ_0 × 6):
- k'_0 = 0.5(0.5) - 0.6(0.866) = 0.25 - 0.52 = -0.27
- k'_1 = 0.5(0.866) + 0.6(0.5) = 0.433 + 0.3 = 0.733
Rotation 30° (θ_1 × 6):
- k'_2 = 0.7(0.866) - 0.8(0.5) = 0.606 - 0.4 = 0.206
- k'_3 = 0.7(0.5) + 0.8(0.866) = 0.35 + 0.693 = 1.043

→ k_6 = [-0.27, 0.733, 0.206, 1.043]

Lastly, the attention score is the dot product of the rotated vectors scaled by d_k:

a_{2,6} = (q_2 (k_6^T))/ {\sqrt{d_k}} = 0.665 / 2 = 0.3325

The final score 0.3325 implicitly contains the positional dependency on the relative distance n - m = 4, even though RoPE only uses absolute rotation matrices for positions 2 and 6.

Pros:

Generalizes well to long sequences (extrapolation) because the relative distance is preserved by the constant angle difference.
Computationally efficient as it does not require adding extra parameters or modifying the attention score structure like PRE.

Cons:

Requires complex implementation of trigonometric functions and rotation operations.
The standard 1D RoPE struggles with higher-dimensional data like 2D/3D images without modification.

Best when:

Standard choice in modern LLMs like Llama.
Long-context handling and extrapolation to unseen sequence lengths are critical.

PE in Action: Benchmarking Extrapolation Capabilities

In this section, I’ll train a Transformer on different PE methods and compare the performance on a synthetic data:

FAPE,
LPE,
RPE, and
RoPE

First, I’ll define each PE using Python classes and functions.

◼ Defining FAPE

The FAPE class computes absolute positional encoding and store the vector as a non-trainable state of the model.

The forward method simply adds the PE values corresponding to the current batch to the input sequence.

1import math
2import torch as t
3import torch.nn as nn
4
5
6# fixed absolute positional encoding (fape)
7class FAPE(nn.Module): 
8    def __init__(self, d_model: int, max_seq_len: int, t_dtype = t.float32):
9        super().__init__()
10
11        # compute pe (vector w/ shape (N, d_model))
12        pe = t.zeros(max_seq_len, d_model, dtype=t_dtype)
13        position = t.arange(0, max_seq_len, dtype=t_dtype).unsqueeze(1)
14        div_term = t.exp(t.arange(0, d_model, 2, dtype=t_dtype) * (-math.log(10000.0) / d_model))
15        pe[:, 0::2] = t.sin(position * div_term) # even indices
16        pe[:, 1::2] = t.cos(position * div_term) # odd indices
17
18        # add extra dimension (1) to broadcast a pe table across all examples in a batch during the forward pass
19        pe = pe.unsqueeze(0) # shape - (1, N, d_model)
20
21        # fape is not learnable. so store the pe table as a constant model state.
22        self.register_buffer('pe', pe)
23
24    def forward(self, x: t.Tensor) -> t.Tensor:
25        # add pe w.r.t the current batch to the input seq x
26        current_seq_len = x.size(1)
27        current_pe = self.pe[:, :current_seq_len, :] # shape - (1, current_seq_length (< N), d_model) # type: ignore
28        x = x + current_pe
29        return x
30

◼ Defining LPE

The LPE class, on the other hand, does not require storing the positional encoding as state:

1import torch as t
2import torch.nn as nn
3
4
5# learned positional embedding (lpe)
6class LPE(nn.Module):
7    def __init__(self, d_model: int, max_seq_len: int):
8        super().__init__()
9
10        # initialize pe as a learnable model parameter with random vals (shape - (N, d_model))
11        self.pe = nn.Parameter(t.randn(max_seq_len, d_model))
12
13    def forward(self, x: t.Tensor) -> t.Tensor:
14        # adding pe to the input seq x
15        return x + self.pe[:x.size(1), :].unsqueeze(0)
16

◼ Defining RPE & RoPE

Both RPE and RoPE use nn.Identity() module from the PyTorch library, so we don’t need to define a specific class.

For RoPE, I’ll define the compute_freqs_cis function to compute the rotation factors and the apply_rope function to rotate the token embeddings:

1import torch as t
2
3# rope util func 1 - apply rotation
4def apply_rope(x: t.Tensor, freqs_cis: t.Tensor) -> t.Tensor:
5    # reshape the input seq x
6    x_ = x.float().reshape(x.shape[:-1] + (-1, 2)) # shape - (B, N, D_V // 2, 2) asssuming D_V is even
7
8    # takes array of all cos/sin components from x
9    x_real, x_imag = x_.unbind(-1) # shape - (B, N, D_V // 2)
10
11    # truncated pre-computed frequency table (freqs_cis) to match the actual length of the input seq x, and cast dtype
12    freqs_cis_ = freqs_cis[:x.size(1), :].float() # shape: (N, D_V // 2, 2)
13    cos, sin = freqs_cis_.unbind(-1) # shape - (N, D_V // 2)
14
15    # apply rotation
16    x_out_real = x_real * cos - x_imag * sin # shape - (B, N, D_V // 2)
17    x_out_imag = x_real * sin + x_imag * cos # shape - (B, N, D_V // 2)
18
19    # stack the results
20    x_out = t.stack([x_out_real, x_out_imag], dim=-1).flatten(2) # shape - (B, N, D_V)
21    return x_out.type_as(x)
22
23
24# rope util func 2 - compute rotation factors
25def compute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> t.Tensor:
26    freqs = 1.0 / (theta ** (t.arange(0, dim, 2)[: (dim // 2)].float() / dim)) # shape (end, dim // 2)
27    t_ = t.arange(end, device=freqs.device, dtype=t.float32)
28    freqs = t_.outer(freqs) # shape (end, dim // 2)
29    freqs_cis = t.stack([t.cos(freqs), t.sin(freqs)], dim=-1) # shape (end, dim // 2, 2
30    return freqs_cis
31

◼ Instantiate Transformers

Then, I’ll define the Transformer class which takes PE_type as an argument and instantiate Transformers:

1
2class Transformer(nn.Module):
3    def __init__(
4            self,
5            d_model: int,
6            d_V: int,
7            H: int,
8            max_seq_len: int,
9            PE_type: Literal['fape', 'lpe', 'rpe', 'rope'] = 'fape',
10            n_layers: int = 2,
11            device: t.device = DEVICE,
12        ):
13        super().__init__()
14
15        self.device = device
16        self.d_model = d_model
17        self.d_V = d_V
18        self.H = H
19        self.PE_type = PE_type
20        self.max_seq_len = max_seq_len
21        self.tokenizer = AutoTokenizer.from_pretrained('t5-small', model_max_length=max_seq_len) 
22        self.vocab_size = len(self.tokenizer)
23
24        # pe
25        match PE_type:
26            case 'fape': self.positional_encoder = FAPE(d_model, max_seq_len)
27            case 'lpe': self.positional_encoder = LPE(d_model, max_seq_len)
28            case 'rpe': self.positional_encoder = nn.Identity()
29            case 'rope':
30                assert d_V % 2 == 0, 'd_V must be even for RoPE'
31                self.positional_encoder = nn.Identity()
32            case _:
33                raise ValueError(f'unknown pe type: {PE_type}')
34        self.freqs_cis = None if PE_type != 'rope' else precompute_freqs_cis(d_V, max_seq_len).to(device)
35
36        # encoder
37        self.input_token_embedding = nn.Embedding(self.vocab_size, d_model)
38        self.dropout_encoder = nn.Dropout(0.1) 
39        encoder_layer = partial(EncoderLayer, d_model=d_model, d_V=d_V, H=H, PE_type=self.PE_type, freqs_cis=self.freqs_cis)
40        self.encoder = nn.ModuleList([encoder_layer() for _ in range(n_layers)]) # stacking
41
42        # decoder
43        self.target_token_embedding = nn.Embedding(self.vocab_size, d_model)
44        self.dropout_decoder = nn.Dropout(0.1)
45        decoder_layer = partial(DecoderLayer, d_model=d_model, d_V=d_V, H=H, PE_type=self.PE_type, freqs_cis=self.freqs_cis)
46        self.decoder = nn.ModuleList([decoder_layer() for _ in range(n_layers)])
47
48        # final head
49        self.linear_head = nn.Linear(d_model, self.vocab_size)
50        self.to(device, t.float32)
51
52    def forward(self, input_ids: t.Tensor, Y_true: t.Tensor) -> t.Tensor:
53        src_seq_len = input_ids.size(1)
54        tgt_seq_len = Y_true.size(1) - 1 # input for decoder - Y_true[:, :-1]
55
56        # encoder
57        input_tokens = self.input_token_embedding(input_ids) * math.sqrt(self.d_model)
58        enc_input = self.dropout_encoder(self.positional_encoder(input_tokens)) # apply pe
59        encoder_output = enc_input # forward pass
60        for enc_layer in self.encoder: encoder_output = enc_layer(encoder_output)
61
62        # decoder
63        tgt_input_ids = Y_true[:, :-1] # shifted right
64        tgt = self.target_token_embedding(tgt_input_ids) * math.sqrt(self.d_model)
65        dec_input = self.dropout_decoder(self.positional_encoder(tgt)) # apply pe
66        tgt_mask = self._generate_square_subsequent_mask(tgt_seq_len).to(self.device) # look-ahead mask for self-attention
67        decoder_output = dec_input # forward pass
68        for dec_layer in self.decoder: decoder_output = dec_layer(decoder_output, encoder_output, tgt_mask) 
69
70        # final output
71        logits = self.linear_head(decoder_output)
72        return logits
73
74# instantiate
75t = Transformer(
76    d_model=D_MODEL,
77    d_V=D_V,
78    H=H,
79    max_seq_len=max_seq_len + extrapolate_len,
80    PE_type=pe_type, # either 'fape', 'lpe', 'rpe', 'rope'
81    n_layers=N_LAYERS,
82    device=DEVICE
83)
84

◼ Examining Extrapolation Capabilities

The Transformer models were trained for 100 epochs on a synthetic dataset with a sequence length of 512, and the results were validated via extrapolation to a sequence length of 10,240.

Other hyperparameters are set:

The dimensionality of the model: d_model = 128
The dimensionality of the Value vector: D_V = 32
The number of attention heads: H = 4 (ensuring H * D_V = d_model)
The number of identical encoder and/or decoder layers: N_LAYERS = 2
The size of the vocabulary that the model know and can generate: VOCAB_SIZE = 1000
The batch size (the number of sequences) processed together in a single epoch: BATCH_SIZE = 10

◼ Results

All four positional encoding methods demonstrated excellent extrapolation capability, but RoPE achieved the best overall performance, excelling in both training fit and long-context generalization.

▫ Training Performance

RoPE performed the best on the trained context (N=512). It achieved the lowest perplexity by a significant margin (5,657.39), indicating the most effective learning and best fit to the training data.
LPE performed the worst on the trained context, resulting in the highest perplexity.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure F. Training convergence comparison (Ave. losses over 100 epochs)

▫ Extrapolation Performance (Out-of-Context)

All four methods successfully maintained an extremely low extrapolation perplexity, confirming their ability to process sequences 20 times longer than those they were trained on (512 vs 10,240).
RoPE also slightly outperformed the others in extrapolation, achieving the lowest extrapolation perplexity of 1.0057, which suggests the best generalization and prediction accuracy on the unseen long sequences.

All code is available in my Github repository.

Conclusion

Positional Encoding (PE) is a key element of the Transformer architecture, crucial for injecting the sequential order into the order-agnostic self-attention mechanism.

In our experiments, we observed that RoPE (Rotary Positional Embedding) slightly outperformed other methods like LPE, FAPE, and RPE, demonstrating superior performance in both training fit (lower perplexity) and long-context extrapolation capability.

Moving forward, these results suggest that PE methods designed for inherent long-context generalization, such as RoPE, are essential for developing large language models capable of reliably processing and understanding extensive input sequences.