However, their computational demands are massive primarily because they spend a significant amount of time and energy accessing the Key-Value (KV) cache of previously processed tokens.

This process degrades inference throughput especially for generating long, detailed responses.

Grouped-Query Attention (GQA) offers a smart solution to drastically cut down on memory usage and speed up inference.

In this article, I’ll examine a cost-effective method for optimizing the GQA configuration, comparing the performance with its counterparts like Multi-Head Attention (MHA) and Multi-Query Attention (MQA).

Attention Mechanism and Grouped-Query Attention

Attention mechanism is a key component in the Transformer architecture to weigh the importance of each word (token) in the input embedding.

The below diagram shows how the attention mechanism works, leveraging the Q-K-V mechanism:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. The Q-K-V mechanism in the encoder (left) and Transformer architecture (right) (Created by Kuriko IWAI)

In the attention layer (red boxes in Figure A), the Query (Q, orange box), Key (K, grey box), and Value (V, blue box) vectors are generated by a linear transformation of the input embedding X, such that:

Q = XW_{Q}, K = XW_{K}, and V = XW_{V}

where W_{Q}, W_{K}, and W_{V} are learnable weight matrices.

These vectors hold specific information such that:

Q holds a query for the current token, asking the information in the input embedding to look for to understand the full context,
K: The labeled information of each token that Q will look for, and
V: The actual content (information payload) of each token,

and are used to compute the attention weights Z:

Z = \text{Softmax} \left( \frac{Q K^T}{\sqrt{d_k}} \right) \cdot V

where Q, K, V are the Q, K, V vectors and d_k is the dimension of the Key vector (scaling factor).

This final result is passed on to the feed forward layer to generate a context-rich representation of the input.

◼ What is Grouped-Query Attention

Grouped-Query Attention (GQA) is a type of attention mechanisms designed to reduce the memory bandwidth requirements and latency during the decoding phase.

Below diagram illustrates how GQA works:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. How GQA works (Created by Kuriko IWAI)

In Figure B, after receiving the input embedding X, the network creates eight heads and applies a linear transformation to generate eight corresponding Query vectors (Q(0) to Q(7)).

Then, it groups the Query vectors into four groups (Groups j=0 to j=3).

Each group computes the attention weights (Z(0) to Z(3)), while sharing the same Key and Value projections in the group to reduce the KV cache size.

Lastly, the network concatenates all attention weights from the groups and performs a liner transformation to generate the final output O.

Mathematically, the process is generalized by using i-th head (i ∈ {0, 1, …, H-1}) and j-th group (j ∈ {0, 1, …, G-1}):

Z_i^{(G_j)} = \operatorname{softmax}\left(\frac{Q_i (K_{G_j})^{\top}}{\sqrt{d_k}}\right) V_{G_j}

where:

Z_i(G): The attention output of the i-th query head in the j-th group,
Q_i: The Query vector corresponding to the i-th query head,
K_{Gj}: The Key vector of the j-th group,
d_k: The dimension of the Key vector, and
V_{G_j}: The Value vector of the j-th group.

The GQA layer concatenates all outputs Z_i(G)’s of individual query heads and perform linear transformation:

O_{GQA} = \text{concat}(Z_1^{(G_j)}, Z_2^{(G_k)}, \dots, Z_H^{(G_l)}) \text{ } W^{O}

where:

O_{GQA}: The final output of the GQA layer (which passed onto the feed forward layer), and
W^O: The weight vector of the output layer.

◼ Benefits of GQA - Comparing with MHA and MQA

GQA is a variant of the standard attention mechanism: Multi-Head Attention (MHA).

The below diagram compares GQA with MHA and its another variant, Multi-Query Attention (MQA):

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Comparison of attention mechanisms - Left: MHA, Center: GQA, Right: MQA(Created by Kuriko IWAI)

MHA (left in Figure C) uses full sets of Key and Value vectors for each query head.

For example, in Figure C, MHA has eight query heads, so has eight Key vectors and eight Value vectors.

As these query heads represent different subspaces (aspects) of the input sequence, MHA is most expressive in enriching contextual understanding.

However, its primary challenge is massive memory and cost consumption:

High computational cost with many matrix multiplications,
Large memory footprint required for backpropagation, and
High memory consumption during inference (decoding) due to its massive Key-Value (KV) cache size.

On the other hand, MQA (right in Figure C) is the opposite extreme in saving memory and computation cost by sharing a single Key and Value vectors with all eight query heads.

This significantly reduces the KV cache size, and so memory bandwidth requirements, making the inference extremely fast.

But MQA compromises computational efficiency due to contention as all query heads compete to access the shared Key and Value data.

GQA balances the trade-off between performance and speed by grouping multiple query heads.

In Figure B, GQA creates four groups by grouping two query heads, nearly half the KV cache size compared to MHA.

This helps GQA to achieve faster inference (low latency in generating tokens) while securing near MHA quality of the attention.

The below table simplifies key characteristics:

	MHA	GQA	MQA
Number of query heads	H	G (1 < G < H)	1
Memory footprints	High	Mid	Low
Inference speed	Slow	Mid	Fast
Focus	Quality	Quality & Speed	Speed

Table A. Attention mechanism comparison (Created by Kuriko IWAI)

GQA offers a marginal solution for Transformers, but a quesion remains: what are the optimal grouping strategies?

I’ll explore this in the next section.

Finding the Grouping Strategies

Whether GQA achieves better performance and faster inference speed depends heavily on the grouping strategies.

In this section, I’ll optimize query head grouping of the GQA Transformer, and then compare its performance and inference speed with three Transformers:

Standard Transformer (as a baseline),
MHA Transformer, and
MQA Transformer.

All transformers have the encoder-decoder architecture and are trained on the English-French translation dataset.

◼ Optimal Strategies

▫ For Performance

Merging query heads can compromise GQA performance compared to its MHA counterparts.

I’ll use Procrustes analysis to enhance the similarity among the heads in each group referring to the research paper: Aligning Attention Heads Before Merging Them (arXiv:2412.20677v2).

This method computes the cosine similarity scores between every two elements in the Value vector and finds the optimal grouping with the highest total similarity scores.

▫ For Inference Speed

The optimal group strategies for inference speed depends on whether the system is:

Memory-bound: Waiting for data from the KV cache, common on budget GPUs or long sequences, or
Compute-bound: Waiting for calculation to finish, common on high-end GPUs or very large batches.

In this analysis, I’ll systematically sweep across:

Sequence length: N = 512 to 2,048 tokens and
Batch size: B = 1 to 16 requests,

while the memory bandwidth is set constant (no hardware diversity) to see which configurations maximize the benefits of the GQA Transformer.

◼ Step 1. Defining Attention Layers

The first step is to define the four attention layer:

StandardAttention (Baseline)
MHA,
MQA (Inheriting the StandardAttention class), and
GQA (Inheriting the StandardAttention class).

Each class generates the final output O as showed in Figure C.

1from typing import Optional
2import torch as t
3import torch.nn as nn
4import torch.nn.functional as F
5
6class StandardAttention(nn.Module):
7    def __init__(self, d_model: int = 512, d_V: int = 64, projection: bool = True) -> None: # d_model - dim of the model input layer, d_v - dim of value vector
8        super().__init__()
9        self.d_model = d_model
10        self.d_V = d_V
11        self.projection = projection
12        self.scaling_factor = 1 / t.sqrt(t.tensor(self.d_V, requires_grad=False))
13        self.query = nn.Linear(in_features=self.d_model, out_features=self.d_V, bias=True)
14        self.key = nn.Linear(in_features=self.d_model, out_features=self.d_V, bias=True)
15        self.value = nn.Linear(in_features=self.d_model, out_features=self.d_V, bias=True)
16        self.output_proj = nn.Linear(in_features=self.d_V, out_features=self.d_model, bias=False) # output projection layer
17
18    def self_attention(self, Q: t.Tensor, K: t.Tensor, V: t.Tensor, mask: Optional[t.BoolTensor] = None) -> t.Tensor:       
19        K_T = t.transpose(K, -1, -2)  # [b, N, D]
20        S = t.matmul(Q, K_T) * self.scaling_factor  # attention score
21        if mask is not None: S = t.masked_fill(S, mask==0, -t.inf) # mask (if any)
22        A = t.softmax(S, dim=-1) # attention weight
23        Z = t.matmul(A, V)
24        return Z
25
26    def forward(self, x: t.Tensor, mask: Optional[t.BoolTensor] = None) -> t.Tensor:
27        Q = self.query(x) # [b, N, D_V]
28        K = self.key(x) # [b, N, D_V]
29        V = self.value(x) # [b, N, D_V]
30        Z = self.self_attention(Q, K, V, mask=mask) # [b, N, D_V]
31        O = self.output_proj(Z) if self.projection else Z # [b, N, d_model] 
32        return O
33
34
35class MHA(nn.Module):
36    def __init__(self, d_model: int = 512, d_V: int = 64, H: int = 8) -> None: # H: total heads 
37        super().__init__()
38        # input features: H * d_V. output features: d_model
39        self.proj = nn.Linear(in_features=H * d_V, out_features=d_model, bias=False) 
40        self.multihead = nn.ModuleList([StandardAttention(d_model, d_V, False) for _ in range(H)])
41
42    def forward(self, x: t.Tensor, mask: Optional[t.BoolTensor] = None) -> t.Tensor:
43        Z = t.cat([head(x, mask) for head in self.multihead], dim=2) 
44        O = self.proj(Z)
45        return O
46
47
48class MQA(StandardAttention):
49    def __init__(self, d_model: int = 512, d_V: int = 64, n_queries: int = 8) -> None:
50        super().__init__(d_model, d_V)
51        self.n_queries = n_queries
52        self.proj = nn.Linear(in_features=d_V * n_queries, out_features=d_model, bias=False) 
53        delattr(self, 'query') # remove inherited query
54
55        self.queries = nn.ModuleList([nn.Linear(in_features=d_model, out_features=d_V, bias=True) for _ in range(n_queries)])
56        self.key = nn.Linear(in_features=d_model, out_features=d_V, bias=True)
57        self.value = nn.Linear(in_features=d_model, out_features=d_V, bias=True)
58
59
60    def forward(self, x: t.Tensor, mask: Optional[t.BoolTensor] = None) -> t.Tensor:
61        K = self.key(x)
62        V = self.value(x)
63        Z = t.cat([self.self_attention(query(x), K, V, mask) for query in self.queries], dim=2)
64        O = self.proj(Z)
65        return O
66
67
68class GQA(StandardAttention):
69    def __init__(self, d_model: int = 512, d_V: int = 64, n_groups: int = 4, n_queries: int = 2) -> None: # n_queries (for each group
70        super().__init__(d_model, d_V)
71        delattr(self, 'query') # remove inherited query
72        delattr(self, 'key')
73        delattr(self, 'value')
74        # groups mqa
75        self.groups = nn.ModuleList([MQA(d_model=d_model, d_V=d_V, n_queries=n_queries) for _ in range(n_groups)])
76        self.proj = nn.Linear(in_features=d_model * n_groups, out_features=d_model, bias=False)
77
78    def forward(self, x: t.Tensor, mask: Optional[t.BoolTensor] = None) -> t.Tensor:
79        Z = t.cat([head(x, mask) for head in self.groups], dim=2)
80        O = self.proj(Z)
81        return O
82

◼ Step 2. Defining the Transformer

Next, I’ll instantiate the Transformer class.

The Transformer class has the EncoderLayer and DecoderLayer, both of which have one of the attention layers defined in Step 1.

For the GQA module, the class also has the find_optimal_n_groups method where the cosine similarity scores are aggregated to select the best query head grouping.

1import random
2import torch as t
3import torch.nn as nn
4
5
6class Transformer(nn.Module):
7    def __init__(
8            self,
9            attention_module: StandardAttention | MHA | MQA | GQA,
10            d_model: int,
11            max_seq_len: int,
12            tokenizer,
13            device: t.device = DEVICE,
14        ):
15        super().__init__()
16
17        # device
18        self.device = device
19
20        # dim, model name
21        self.d_model = d_model
22        self.attention_module = attention_module
23        self.model_name = self.attention_module.__class__.__name__
24
25        # tokenizer
26        self.tokenizer = tokenizer
27        self.vocab_size = len(self.tokenizer)
28        self.pad_token_id = self.tokenizer.pad_token_id
29
30        # encoder
31        self.input_token_embedding = nn.Embedding(self.vocab_size, d_model)
32        self.positional_encoder = PositionalEncoding(d_model=self.d_model, max_len=max_seq_len)
33        self.dropout_encoder = nn.Dropout(0.1) # after embeddings
34        self.encoder = EncoderLayer(attention_module=attention_module, d_model=d_model)
35
36        # decoder 
37        self.target_token_embedding = nn.Embedding(self.vocab_size, d_model)
38        self.dropout_decoder = nn.Dropout(0.1)
39        self.decoder = DecoderLayer(attention_module=attention_module, d_model=d_model)
40
41        # final linear/softmax head for logits
42        self.linear_head = nn.Linear(d_model, self.vocab_size)
43
44
45    def forward(self, input_ids: t.Tensor, Y_true: t.Tensor) -> tuple[t.Tensor, t.Tensor]:
46        ## encoder
47        # token embedding, scaling
48        input_tokens = self.input_token_embedding(input_ids) * math.sqrt(self.d_model)
49        # add positional encodings w/ dropout
50        input_tokens = self.dropout_encoder(self.positional_encoder(input_tokens)) 
51        # forward pass
52        output_encoder = self.encoder(input_tokens)
53
54        ## decoder
55        # decoder input - Y_true shifted right to exclude the last item
56        tgt_input_tokens = Y_true[:, :-1]
57        # token embedding, scaling
58        tgt = self.target_token_embedding(tgt_input_tokens) * math.sqrt(self.d_model)
59        # add positional encodings w/ dropout
60        tgt = self.dropout_decoder(self.positional_encoder(tgt)) 
61        # forward pass
62        output_decoder = self.decoder(tgt, output_encoder) 
63
64        ## final output
65        # linear head to project D_MODEL to vocab size
66        logits = self.linear_head(output_decoder)
67        # target labels - Y_true shifted left (T1, T2, ..., Tn, EOS)
68        target_labels = Y_true[:, 1:] 
69        return logits, target_labels
70
71    def simulate_grouping(self, H: int, D: int, sim_matrix: t.Tensor, max_iter: int) -> tuple[float, list[list[int]]]:
72        G = H // D
73        heads = list(range(H))
74        random.shuffle(heads)
75        current_grouping = [heads[i * D:(i + 1) * D] for i in range(G)]
76        score_current = self._compute_grouping_score(current_grouping, sim_matrix)
77        score_best = score_current
78        best_grouping = current_grouping
79
80        for _ in range(max_iter):
81            # select two different groups for swapping
82            g1_idx, g2_idx = random.sample(range(G), 2)
83
84            # select a head index within each group
85            h1_idx, h2_idx = random.randrange(D), random.randrange(D)
86
87            # create groups
88            new_grouping = [g[:] for g in current_grouping]
89
90            # swap the two heads
91            h1 = new_grouping[g1_idx][h1_idx]
92            h2 = new_grouping[g2_idx][h2_idx]
93            new_grouping[g1_idx][h1_idx] = h2
94            new_grouping[g2_idx][h2_idx] = h1
95
96            # calc score
97            score_new = self._compute_grouping_score(new_grouping, sim_matrix)
98
99            # accept the swap if the score improves (or equals, for simulated annealing)
100            if score_new >= score_current:
101                current_grouping = new_grouping
102                score_current = score_new
103                if score_new > score_best:
104                    score_best = score_new
105                    best_grouping = new_grouping
106
107        return score_best, best_grouping
108
109    def find_optimal_n_groups(self, H: int, V_caches: t.Tensor, max_iter_per_G: int) -> dict:
110        print(f"... calculating similarity matrix for {H} heads ...")
111
112        # create similarity matrix
113        sim_matrix = self._create_similarity_matrix(V_caches.to(self.device))
114
115        # list up possible dimension sizes
116        D_options = [d for d in range(1, H + 1) if H % d == 0]
117
118        # start searching 
119        best_overall_score = -float('inf')
120        optimal_N_GROUPS = H
121        optimal_N_QUERIES = 1
122        best_grouping_A = None
123        print(f"... testing possible group sizes D's ...")
124        for D in D_options:
125            G = H // D
126
127            # case 1. G = 1 - MQA - one group
128            if G == 1:
129                grouping = [list(range(H))]
130                # use the static method to compute the score
131                score = self._compute_grouping_score(grouping, sim_matrix)
132                print(f"   -> G={G} (D={D}): mqa with a single group. score = {score:.4f}")
133
134            # case 2. D = 1  - MHA - trivial grouping, score is 0.0
135            elif D == 1:
136                score = 0.0
137                grouping = [[i] for i in range(H)]
138                print(f"   -> G={G} (D={D}): mha with max groups. score = {score:.4f}")
139
140            # case 3. G >= 2 - GQA - sim. best gr
141            elif G >= 2:
142                score, grouping = self.simulate_grouping(H=H, D=D, sim_matrix=sim_matrix, max_iter=max_iter_per_G)
143                print(f"   -> G={G} (D={D}): Best Sim Score (SA) = {score:.4f}")
144
145            # update the optimal config
146            if score > best_overall_score:
147                best_overall_score = score
148                optimal_N_GROUPS = G
149                optimal_N_QUERIES = D
150                best_grouping_A = grouping
151
152        return {
153            'optimal_N_GROUPS': optimal_N_GROUPS,
154            'optimal_N_QUERIES': optimal_N_QUERIES,
155            'max_score': best_overall_score,
156            'best_grouping': best_grouping_A
157        }
158

◼ Step 3. Finding Optimal Grouping for Query Heads

Next, using the Transformer class, I’ll find the optimal grouping for the GQA Transformer:

1
2INITIAL_N_GROUPS = 4
3INITIAL_N_QUERIES = 2 
4MAX_ITER_PER_G = 300
5D_V = 64
6H = 80
7D_MODEL = D_V * H
8SEQ_LEN = N_TEST
9
10# instantiate gqa transformer
11t_gqa = Transformer(
12    attention_module=GQA(d_model=D_MODEL, d_V=D_V, n_groups=INITIAL_N_GROUPS, n_queries=INITIAL_N_QUERIES),
13    d_model=D_MODEL,
14    max_seq_len=N,
15    tokenizer=TOKENIZER,
16    device=DEVICE
17)
18
19# simulate v caches - shape: [H, d_h, N]
20V_caches_init = t.randn(H, D_V, SEQ_LEN) * 0.1
21
22# introduce strong grouping bias to ensure finding a non-trivial solution
23V_caches_init[0:4] += t.randn(1, D_V, SEQ_LEN) * 0.8 # bias heads 0-3 together, and heads 4-7 together
24V_caches_init[4:8] += t.randn(1, D_V, SEQ_LEN) * 0.8 
25
26# execute optimization method
27optimal_result = t_gqa.find_optimal_n_groups(
28    H=H, V_caches=V_caches_init, max_iter_per_G=MAX_ITER_PER_G
29)
30

▫ Results

The best grouping turned out 40 groups with two query heads:

Optimal number of groups: 40
Optimal group size: 2
Maximum similarity score: 0.0319
Best grouping (head indices):
[[13, 23], [70, 30], [9, 0], [79, 10], [62, 28], [38, 46], [47, 54], [44, 3], [1, 33], [76, 45], [49, 65], [71, 15], [26, 16], [78, 7], [77, 55], [36, 75], [73, 18], [14, 4], [67, 34], [2, 40], [48, 66], [5, 6], [60, 64], [17, 50], [8, 52], [35, 63], [53, 27], [56, 41], [68, 58], [51, 57], [42, 24], [39, 11], [12, 43], [21, 32], [20, 61], [69, 74], [72, 29], [59, 25], [19, 37], [22, 31]]

◼ Step 4. Instantiating Standard, MHA, MQA, and GQA Transformers

Using the optimal grouping results, I’ll instantiate the Transformers:

1from transformers import AutoTokenizer
2
3TOKENIZER = AutoTokenizer.from_pretrained("t5-small", model_max_length=N) # pre-trained tokenizer
4D_MODEL = 512
5D_V = 64
6H = 8
7N_QUERIES = 8
8
9def instantiate_transformers(N: int) -> list[Transformer]:
10    t_baseline = Transformer(
11        attention_module=StandardAttention(d_model=D_MODEL, d_V=D_V),
12        d_model=D_MODEL,
13        max_seq_len=N,
14        tokenizer=TOKENIZER,
15    )
16    t_mha = Transformer(
17        attention_module=MHA(d_model=D_MODEL, d_V=D_V, H=H),
18        d_model=D_MODEL,
19        max_seq_len=N,
20        tokenizer=TOKENIZER,
21    )
22    t_mqa = Transformer(
23        attention_module=MQA(d_model=D_MODEL, d_V=D_V, n_queries=N_QUERIES),
24        d_model=D_MODEL,
25        max_seq_len=N,
26        tokenizer=TOKENIZER,
27    )
28    t_gqa = Transformer(
29        attention_module=GQA(
30            d_model=D_MODEL,
31            d_V=D_V,
32            n_groups=optimal_n_groups, # use the optimal result from the prev step
33            n_queries=optimal_n_queries # use the optimal result from the prev step
34        ),
35        d_model=D_MODEL,
36        max_seq_len=N,
37        tokenizer=TOKENIZER,
38        device=DEVICE
39    )
40    return [t_baseline, t_mha, t_mqa, t_gqa]
41
42models = instantiate_transformers(N=N)
43

◼ Step 5. Performance Evaluation

Lastly, I’ll train the four Transformers on different sequence lengths and batch sizes, and measure each performance by the validation losses as well as inference speed:

1# sequence length options
2N_VALUES = [512, 1026, 2048]
3
4# batch size options
5B_VALUES = [1, 4, 8, 16]
6
7NUM_VAL_SAMPLES = 512
8results = list()
9
10for N in N_VALUES:
11    for B in B_VALUES:
12        print(f"\n[N={N}, B={B}]")
13
14        models = instantiate_transformers(N=N)
15        for model in models:
16            model_name = model.model_name
17
18            # metrics 1. inference speed (latency in ms)
19            try:
20                latency = test_inference_speed(model, N=N, B=B)
21                print(f"  -> {model_name} latency: {latency:,.3f} ms")
22            except Exception as e:
23                latency = float('nan')
24                print(f"  -> {model_name} error -{ e}")
25
26            # metrics 2: performance (val loss)
27            try:
28                val_loss = train_model(model=model, train_data_loader=train_data_loader, val_data_loader=val_data_loader,num_epochs=10,)
29                print(f"  -> {model_name} min. val loss: {val_loss:,.4f}")
30            except Exception as e:
31                val_loss = float('nan')
32                print(f"  -> {model_name} error - {e}")
33
34            # store the result
35            results.append(dict(model=model_name, N=N, B=B, latency_ms=latency, val_loss=val_loss))
36

◼ Results

Smaller groups (closer to MQA) are better for the long sequence length re both performance and speed.

Larger groups (closer to MHA) are better for the short sequence length as the parallelism benefit of more KV heads is beneficial when the memory bottleneck is less severe.

The entire source code is available in my Github repository.

Conclusion

Grouped-Query Attention (GQA) offers a critical balance between the high quality of MHA and the inference speed of MQA, making it essential for efficient Large Language Model (LLM) decoding.

In the experiment, we found the cost-optimal GQA configuration using Procrustes analysis (cosine similarity), effectively minimizing decoding latency.

To further refine efficiency for real-world LLM tasks involving long contexts and sequence lengths, securing greater hardware diversity (from low to high-end GPUs) will enable finding more optimal results tailored to specific hardware configurations.