The Reasoning Wall: A Comparative Benchmark of Llama 3.2 vs. Qwen 3

Stress-testing multi-hop logic chains using Multi-LogiEval, Process Benchmarking, and Thought-to-Output Ratios

Deep LearningData ScienceLLMPython

By Kuriko IWAI

Introduction

Large Language Models (LLMs) have demonstrated proficiency in multi-step problem solving, yet their ability to navigate complex logical chains remains a challenge.

Assessing how models handle reasoning depth is critical because the metric directly impacts their reliability and performance in multi-hop reasoning tasks.

In this article, I’ll explore the core mechanics of reasoning depth and provide a framework for measuring it effectively based on specific task goals by assessing Llama 3.2 and Qwen 3 over stress-test metrics:

Multi-LogiEval,
Process Benchmarking,
Robustness Coefficient, and
Thought-to-output Ratio.

Understanding Multi-Hop Reasoning and Depth

A multi-hop problem is a complex task that requires the model to navigate a logical chain, where each "hop" represents an inference necessary to bridge the gap between the initial query and the final answer.

The following diagram illustrates how reasoning depth scales with the number of logical operations required for a mathematical task:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Examples of the multi-hop problem and reasoning depth (Created by Kuriko IWAI)

Reasoning depth is defined by the logical distance between the input and the conclusion.

So, the left problem in Figure A has reasoning depth of two (d = 2), while the right one has four (d = 4).

Regardless of the task type, multi-hop problems share several fundamental characteristics:

Layered information: Whether the task involves simple fact-linking or mathematical logic, the final solution is hidden beneath layers of sub-problems.
Logical stamina: Multi-hop problems require the model to handle a high reasoning depth without hallucinations or errors.
Scalability of complexity: Low depth problems are suitable for basic comparisons or direct retrieval, while high depth ones are more sophisticated reasoning tasks, where a single logical lapse halfway through the chain would invalidate the entire conclusion.

Mapping Task Complexity to Reasoning Depth

Understanding the relationship between the task in hand and its required reasoning steps is key to select model architectures.

Here, I classify major LLM tasks into three tiers:

Shallow (d < 3),
Moderate (d = 3–5), and
Deep (d = 6+)

and assess engineering requirements in each tier.

◼ The Taxonomy of Reasoning Depth

The following table categorizes tasks by the three tiers:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B: Reasoning depth (d) across task domains (Created by Kuriko IWAI)

Each cell in Figure B includes a primary task, its engineering requirements, and major LLM variants available in the market.

As depth increases from shallow (left in Figure B) to deep (right), the model shifts from simple pattern matching to multi-step executive function.

◼ Shallow Reasoning: Direct Pattern Mapping

Shallow reasoning tasks only have up to 2 steps of reasoning (d = 1 or d = 2).

At this depth, the model performs a one-and-done transformation.

It simply maps input features directly to an output label without iterative planning or internal deliberation.

Mechanism:

Single-pass inference focusing on local feature extraction.

Examples (d = 1):

Tabular: Predict if a transaction is flaud based on a single threshold.
NLP: Identify if a customer review is “positive” or “negative”.
Vision: Label an image as “cat“ or “dog“.

The Model Stack:

Models optimized for pattern matching like BERT, DistilBERT, or Gemini Flash are the best.
For tabular data, classical trees like LightGBM can outperform LLMs due to lower latency and higher precision on structured data.

◼ Moderate Reasoning: Contextual Assembly

Moderate reasoning tasks have up to five reasoning depth, and require the model to synthesize disparate information points into a coherent structure.

The model must maintain intermediate states in its context window to ensure the final output is logically consistent with the beginning.

Mechanism:

Contextual synthesis and multi-point attention.

Examples (d = 5):

Tabular: Multi-horizon demand forecasting where the model must account for seasonality, holidays, and trend shifts.
NLP: Summarize 50-page legal transcript while maintaining the speaker's intent.
Multimodal: Extract key financial data from a complex PDF invoice, and calculate the tax variance.

The Model Stack:

Models which excel at contextualizing like GPT-4o, Llama 3.1, or Claude 3.5 Sonnet are good fit.

◼ Deep Reasoning: System 2 Thinking

Deep reasoning tasks have 6+ reasoning depth with sequential dependency where the validity of the current step depends entirely on the logical integrity of the previous steps.

This requires branching logic, error detection, and backtracking.

Mechanism:

Test-Time Compute and Chain-of-Thought (CoT) reinforcement.

Examples (d = 10):

Tabular: A humanoid robot calculating the torque required to lift an unknown object while balancing on uneven terrain.
NLP: Debug a microservices architecture where the error is distributed across three different files and two languages.
Vision: Reconstruct 3D Neural Radiance Fields (NeRF) from sparse 2D video frames.

The Model Stack:

Models utilizing System 2 thinking (think through the logic before committing the answer) such as OpenAI o1/o3 and DeepSeek-R1.
Mid-tier LLMs without self-correction loops tend to fail due to logical hallucination in halfway.

◼ Engineering Requirements for Deep Reasoning

To successfully deploy models for deep reasoning, the infrastructure must shift from simple inference to complex state management.

Below are primary examples:

▫ Test-Time Computing

Unlike shallow models, deep reasoners scale their performance with thinking time.

By generating thousands of internal CoT tokens, models like DeepSeek-R1 can verify hypotheses and discard incorrect logic paths.

▫ KV Cache Management

Deep tasks demand massive Key-Value (KV) Cache capacity to interpret the long context and reasoning chain.

Efficient cache compression or PagedAttention is necessary to prevent the model from catastrophic forgetting.

▫ Feedback Loop and Self-Correction

Deep reasoner models act as their own critics.

For example, in a coding task, the model simulates the execution, identifies a potential exception, and rewrites the logic block internally.

Key Takeaway:

The goal here is to match the inference budget to the reasoning depth.

Using a reasoner model with its own correction loop (d=10) for a sentiment task (d=1) would significantly waste the computational expense.

How to Measure LLMs’ Reasoning Horizon

A suitable measurement of the reasoning depth depends on specific goals.

I break down the 4 metrics to identify exactly where a model’s logic collapses, and conclude how to choose a model based on the test results.

Multi-LogiEval:
How much does the accuracy decay as the reasoning depth increases?
Process Benchmarking:
At which reasoning step does the model hallucinate?
Robustness Coefficient:
Does the model leverage first principals or just pattern matching?
Thought-to-output Ratio:
How deeply does the model think through before committing the final answer?

Multi-LogiEval

Multi-LogiEval is the most common stress-test framework designed to identify the reasoning depth horizon of LLMs.

It measures step-aware accuracy decay, which quantifies exactly where the model’s logical coherence collapses when increasing reasoning depth.

◼ The Core Metric

The method plots accuracy as a function of reasoning depth (d) and reveals the reasoning wall where the model’s accuracy at d (Acc@d) drops below a critical floor (i.e., 50%).

Acc@d is a ratio of correct answers to total questions such that:

\text{Acc@d} = \frac{\text{Correct inferences at depth } d}{\text{Total Iterations}} \quad \cdots (1)

Then, the decay rate calculates how much the accuracy drops when increasing the depths:

\text{dr} = Acc@d_{8} - Acc@d_{2} \quad \cdots (2)

where Acc@d8 and Acc@d2 indicate the accuracy of the model at depth d = 8 and d = 2 respectively.

When this rate goes below the threshold, the corresponding depth is considered as the reasoning wall of the model.

For instance, computing the decay rates from d = 2 to the following depths can reveal the reasoning wall:

d = 2: 0
d = 3: - 20pt
d = 4: - 30pt
d = 5: - 32pt
d = 6: - 40pt
d = 7: - 55pt → Reasoning wall (Below the threshold)
d = 8: - 68pt

◼ Advantages

Precise horizon mapping: Identifies the exact reasoning wall.
Diverse logical rules: Covers over 30+ inference rules, ensuring a holistic assessment of a model's logical horizon.
Zero-shot focus: Measures a model's intrinsic reasoning capacity rather than its ability to copy examples from a few-shot prompt.

◼ Limitations

Cliff-edge results: The performance of LLMs drops so sharply at the wall because the benchmark provides little data on how to fix the model once it passes its depth limit.
Synthetic nature: Because the logic chains are systematically generated to control for depth, the model might sound artificial.
Arithmetic confusion: Models might hit the wall not because they lost the logic, but because they made a simple error (i.e., the error in Step 4 is due to a mathematical error. Although the error cascaded to the last, the model was actually following the logic until the last).

◼ Primary Use Cases

Model comparison & selection: When choosing between two models for a complex task like an AI Research Assistant, Multi-LogiEval reveals which one has the logical stamina for long-chain thinking.
Bottleneck identification in architecture: Multi-LogiEval can reveal if increasing model parameters actually increases reasoning depth. For example, it helps determine if a 70B model is smarter than an 8B model, or it just has a larger memory of the facts.
Risk assessment for autonomous agents: If an AI agent is tasked with a 10-step workflow, Multi-LogiEval predicts the probability of the agent failing at a specific step like Step 7 before deployment.

◼ Implementation Strategy: The Incremental Complexity Ladder

To implement this method, I’ll first create isomorphic logic tasks where multiple tasks share the identical core concept, but are injected logical interrupters as the chain extends.

I’ll take a mathematical logic problem for an example and demonstrate how it works.

▫ Isomorphic Logic Tasks

Baseline (d = 2): Shallow reasoning task. Establishes the model's floor accuracy for basic instruction following.
Task A (d = 5): Moderate reasoning task.
Task B (d = 8): Deep reasoning task.

I’ll then draft prompts for each task, combining Chain-of-Thought (CoT) prompting to identify the specific failure node:

Think step-by-step and number each step of your calculation below.

▫ The Baseline:

A train starts with 50 passengers.
10 people get off., and 5 people get on.
How many people are on the train now?

Logical steps (d = 2):

Start: 50
Step 1: 50 - 10 = 40
Step 2: 40 + 5 = 45
The correct answer: 45.

▫ Task A:

A train starts with 50 passengers.
At the first stop, 10 people get off, and 5 people get on.
At the second stop, 20% of the remaining people get off.
How many people are on the train now?

Logical steps (d = 5):

Start: 50
Step 1: 50 - 10 = 40
Step 2: 40 + 5 = 45
Step 3: 45 * 0.2 = 9
Step 4: 45 - 9 = 36
The correct answer: 36.

Complexity:

Includes a 20% reduction at Steps 3 and 4.
Requires the model to maintain the running state while performing operations that aren't simple addition.

▫ Task B (d = 8):

A train starts with 50 passengers.
At the first stop, 10 people get off, and 5 people get on.
At the second stop, 20% of the remaining people get off.
At the third stop, the number of passengers doubles.
At the last stop, 20 people get off, 10 people get on.
But the station staff decides that only 30 people can still be on the train.
Now, how many people must get off the train?

Logical steps:

Start: 50
Step 1: 50 - 10 = 40
Step 2: 40 + 5 = 45
Step 3: 45 * 0.2 = 9
Step 4: 45 - 9 = 36
Step 5: 36 × 2 = 72
Step 6: 72 - 20 = 52
Step 7: 52 + 10 = 62
Step 8: 62 - 30 = 32
The correct answer: 32.

Complexity:

Includes a doubling effect at Step 5 and a final capacity constraint at Step 8.
Requires high sequential dependency where a single error at Step 3 cascades into a total failure.

Notes:

For Visual Reasoning Ladder for Multimodal Reasoning Tasks

Implement a visual reasoning ladder to test multimodal models like GPT-4o or Llama 3.2-Vision:

Image: A picture of a messy desk with various objects.
d=2 query: "How many pens are on the desk?"
d=5 query: "If I move the red pen to the blue mug, and the blue mug is currently next to the stapler, will the red pen be on the left or right of the stapler?"

◼ Quantitative Analysis

Then, I’ll run N=100 iterations over each task to generate a statistically significant decay curve.

1import re
2import torch
3from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoConfig
4
5model_id = 'Qwen/Qwen3-8B'
6
7# config tokenizer
8tokenizer = AutoTokenizer.from_pretrained(model_id)
9
10# config model
11config = AutoConfig.from_pretrained(model_id)
12model = AutoModelForCausalLM.from_pretrained(
13    model_id,
14    config=config,
15    trust_remote_code=True,
16    dtype=torch.float16,
17    device_map="cpu" 
18)
19
20# config llm pipeline
21generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
22
23# prompt
24messages = [{"role": "user", "content": prompt}]
25formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
26
27# iteartion
28n_iterations = 20
29for i in range(n_iterations):
30    # call generator
31    output_list = generator(
32        formatted_prompt, 
33        max_new_tokens=512, 
34        do_sample=True, 
35        temperature=0.7
36    )
37
38    # extract answer
39    output = output_list[0]['generated_text']
40    answer_part = output.replace(formatted_prompt, "")
41    matches = re.findall(r"The correct answer: (\d+)", answer_part)
42
43    # count success
44    if matches and int(matches[-1]) == answers[d]: success_count += 1
45
46# calc acc_d score
47acc = success_count / n_iterations
48

Developer Tip:

In production benchmarking, I always set temperature=0 to ensure reproducibility, but for stress-testing 'Stochastic failure,' I intentionally sweep from 0.7 to 1.2.

The below table summarize the results:

Metric	Qwen 3 (8B)	Llama 3.2 (8B)
Acc@d=2	99%	100%
Acc@d=5	62%	94%
Acc@d=8	24%	65%
Decay Rate	24 - 99 = -75 points	65 - 100 = - 45 points
Reasoning Wall (t = 0.5)	d = 6	d > 8

Table 1. Multi-LogiEval stress test results (Created by Kuriko IWAI)

The results highlight the strong reasoning depth of Llama 3.2 compared to its counterpart, Qwen 3.

Regarding the metric acc@d (Eq. 1), both models are nearly perfect at simple, 2-step logic. But at 8 steps, Qwen 3 fails the test, whereas Llama 3.2 maintains the relatively hight score (65%).

This results in the severe decay rate (Eq. 2) of Qwen 3, while Llama 3.2 maintains stable decay rate, showing much higher logical endurance than its counterpart.

Notes:

A reasoning wall is common in mid-tier models.

They tend to show a significant performance drop at d = 5 (e.g., falling from 80% to 40%), while reasoner models like OpenAI's o3 or DeepSeek-R1 can maintain a flat line much longer.

◼ Scaling the Evaluation

To eliminate prompt luck, I’ll generate 20 variations of the logic ladder by, for example, changing "passengers on a train" to "inventory in a warehouse" or "packets in a network buffer".

Then, I’d observe the model’s failure patterns, either consistent or random:

Consistent step failure: For example, if the model fails at d = 4 across all 20 variations, it indicates a structural deficit in deductive state management because Step 4 handles the deduction.
Stochastic failure: If the failure points are random, it suggests a context window noise or attention drift issue where the model fails to consistently retrieve relevant information across iterations of the same prompt.

Developer Tip:

Stochastic failures can point out underlying mechanics of the logic horizon:

“Lost in the Middle” Attention: The model fails to maintain focus on information placed in the center of a long context window.
→ Solutions: Prompt engineering (Move the critical data to the beginning or end, repeat key instructions, make the model summarize the context before performing the task.)
Token competition: Irrelevant tokens win the attention weight especially in high-noise environments.
→ Solutions: Prompt engineering (Use strict formats like ### DATA ### or constraints DO NOT …), few-shot samplings.
Softmax variance: Minor floating-point variations can cause the model to miss a logical connection it previously caught.
→ Solutions: Greedy decoding, Lowering the temperature (T=0)

Process Benchmarking

Process Benchmarking uses a larger LM as an LLM-as-a-Judge to grade individual steps in CoT.

◼ The Core Metric

It computes the stepwise validity score such that:

\text{Validity Score} = \frac{\text{Number of valid logical transitions}}{\text{Total steps to solution}} \quad \cdots (3)

and assess the quality of each step.

◼ Advantages

Precision debugging: Identifies the exact failure node.
Granular insights: Provides a clear picture of a model's effective reasoning depth.
Eliminates lucky guesses: Distinguishes between a model that actually understands the logic and one that hallucinates its way.
Reward signal accuracy: Provides a much better signal for RLHF by rewarding specific logical behaviors.

◼ Limitations

Computational expense: Running a judge model to evaluate every single step of a 10-step reasoning chain can be 10x more expensive and slower than standard benchmarking.
Judge reliability: The results are only as good as the judge. If the Judge model isn't significantly more capable than the model being tested, it may miss subtle logical fallacies.
Context window issues: For extremely long chains, the judge may lose track of earlier constraints, leading to recursive hallucinations where the evaluation itself is flawed.

◼ Primary Use Cases

Safety-critical reasoning: The "how" matters more than the final answer in medical or engineering fields because every step of the calculation must be verified for safety. Process benchmarking ensures no invisible errors exist in the middle of the logic.
Educational technology (AI tutoring): To be an effective teacher, an AI must identify exactly where a student went wrong. Process benchmarking allows the system to provide feedback like: "Step 1 through 3 are correct, but you made a calculation error in Step 4 when multiplying the percentages."
Logic-Heavy Workflows: Ideal for auditing legal analysis or financial forecasting where a single logical step (e.g., misinterpreting a tax code clause) invalidates the entire multi-page report.
Benchmark Transparency: Process benchmarking can detect data contamination, showing invalid middle steps even for the correct final answer, a red flag that the model simply memorized the correct answer from its training data.

◼ Implementation

The following table summarize the results of Task B with d = 8:

Reasoning Step in Task B	Logic Type	Qwen 3 (8B)	Llama 3.2 (8B)
Step 1 (Sub)	Simple Arithmetic	1	1
Step 2 (Add)	Simple Arithmetic	1	1
Step 3 (Perc)	Percentage Calculation	1	1
Step 4 (Sub)	State Tracking	1	1
Step 5 (Mult)	Scaling Logic	0 (Hallucinated 36x2=82)	1
Step 6 (Sub)	Sequential Update	0	1
Step 7 (Add)	Sequential Update	0	1
Step 8 (Comp)	Final Constraint	0	1
Final Result	Correctness	Fail	Pass
Validity Score (Eq. 3)		0.50 (4/8)	1.00 (8/8)

Table 2. Step-wise validity heatmap of Qwen 3 and Llama 3.2 over Task B (Created by Kuriko IWAI)

The LLM-as-a-Judge evaluates responses at each step whether they are valid (1) or invalid (0).

Because Qwen 3 hallucinated at Step 5, the judge capped its reasoning depth at 4.

After Step 5, although Qwen 3 might be able to produce correct answers by chance, the judge will disregard all of them as they are built on a hallucinated premise.

Consequently, Qwen 3 earns only 4 out of 8, failing the test because its final validity score fell below the required threshold.

In contrast, Llama 3.2 maintained accuracy throughout, earning a perfect score of 1.00 and passing the test.

Robustness Coefficient

Robustness coefficient measures how brittle the reasoning is.

This method is based on the assumption that a model relying too much on pattern matching will likely fail when facing non-critical changes.

Taking Task B for an example, the method first changes a non-essential detail like “passengers in the train” to “inventories in the warehouse“, and creates a perturbation test:

Canonical Task: Task B Original	Isomorphic Variant: Task B Warehouse Version
Start with 50 passengers.	Start with 50 crates.
10 off, 5 on.	10 shipped out, 5 received.
20% of remaining leave.	20% of remaining are damaged/removed.
Total doubles.	Stock is replenished (doubled).
20 off, 10 on.	20 shipped out, 10 received.
Constraint: Only 30 can remain.	Constraint: Max capacity is 30.

Table 3. The perturbation test with Task B isomorphs (Created by Kuriko IWAI)

◼ The Core Metric

Then, the method assesses whether the model’s logic holds up by calculating the robustness coefficient (ρ) such that:

\rho = \frac{\sum_{i=1}^{n} \text{score}(V_{i})}{n \cdot \text{score}(C)} \quad \cdots (4)

where:

ρ: The robustness coefficient,
C: The canonical task (the original task),
V_i: The i-th isomorphic variant, a transformed version of the original task,
n: The total number of isomorphic variants tested, and
score(C): A binary metric representing the model's success (success = 1, failure = 0).

The goal of Eq. 4 is to distinguish between true reasoning and pattern matching by running a perturbation test over n (i.e., n = 10) patterns of isomorpgic variants.

Higher ρ indicates that the model is more robust to the attention drift:

ρ = 1.0: The model is perfectly robust; its logic is invariant to phrasing.
0.5 < ρ < 1.0: The model is brittle. It understands the logic but is easily distracted by attention drift.
ρ = 0.0: The model likely memorized the canonical answer during training and cannot actually reason through the steps.

For example, the perturbation test in Table 3 results as follow:

Model	C \= Task B Original	V_1 \= Task B Warehouse	Robustness Coefficient ρ
Qwen 3 (1.7B)	24% Acc	8% Acc	0.33 (Brittle)
Llama 3.2 (8B)	78% Acc	74% Acc	0.95 (Robust)

Table 4. The results of the perturbation test (Created by Kuriko IWAI)

The low coefficient of Qwen 3 indicates that the model is likely pattern matching where it can recognize train-style word problems but fails when the same logic is applied to a warehouse context.

Its authentic reasoning depth is lower than its benchmarked depth.

On the other hand, the high coefficient of Llama 3.2 proves first-principles logic.

The model treats the entities (passengers vs. crates) as abstract variables, demonstrating that it is actually executing the logical steps rather than just predicting the next most likely token in a familiar puzzle format.

◼ Advantages

Exposes benchmark memorization: The most effective way to detect data contamination.
Measures authentic reasoning depth: By averaging scores across multiple isomorphic versions, the coefficient reveals a true reasoning depth, reflecting how the model will perform in unpredictable real-world scenarios.
Language & domain agnostic: Applied to anything from medical diagnosis logic to Python coding tasks, making it a universal stress test for reasoning depth.

◼ Limitations

Hard to standardize: Creating perfect perturbations is challenging as some variations might trigger the model’s safety guardrails, making the version extremely difficult to solve.
Benchmark creation overhead: Needs 5,000 (5 variations per question) test questions, instead of 1,000 to get a statistically significant coefficient, increasing the labor of benchmark creation.
Sensitivity to prompting: A model might fail due to non-logic driven reasons (i.e., A model with a linguistic bias finds the warehouse version sounds more formal, generating better results.)

◼ Primary Use Cases

Enterprise AI workflow deployment: When a company deploys an AI workflows like summarizing the invoice, the AI must work regardless of whether the invoice comes from a shipping company or a local vendor. The robustness coefficient ensures the model isn't confused by varying terminology for the same underlying data.
Scientific & technical reasoning: In fields where constants and units change frequently like Materials Science or Chemistry. The robustness coefficient ensures the model can calculate density whether the units are in g/cm^3 or kg/m^3.
Detecting model overfitting: If the model's accuracy is 95% on the original but 40% on the perturbed version, it is a clear sign that the model's performance is based on memorization.
Automated Code Generation: The robustness coefficient ensures that the model can handle the algorithms over a variety of variants like list_a or user_inventory_data for production use.

Thought-to-Output Ratio

The thought-to-output ratio measure the model’s reasoning depth by its internal thinking time or hidden tokens generated before a final answer.

◼ The Core Metric

The thought-to-output ratio is calculated by dividing the number of internal reasoning tokens by the number of tokens in the final visible answer such that:

R_{to} = \frac{T_{hidden}}{T_{output}} \quad \cdots (5)

where:

R_to: The thought-to-output ratio,
T_{hidden}: The number of tokens generated within the reasoning blocks (e.g., between <think> tags), and
T_{output}: The number of tokens in the final answer provided to the user.

Higher R_to values consider that the model needed to handle a deeper reasoning task:

R_to > 10: Deep reasoning tasks.
R_to = 1.0: Standard explanation.
R_to < 0.1: Direct answer. Little to no internal deliberation.

For example, Qwen 3 and Llama 3.2 consumes different number of hidden tokens to generate an output with similar length for Task B:

Metric	Qwen 3 (8B)	Llama 3.2 (8B)	Interpretation
Thought Tokens T_{hidden}	840	0*	Qwen uses "Thinking Mode" for complex steps.
Output Tokens T_{output}	120	115	Both produce similar-length final answers.
Thought-to-Output R_to	840 / 120= 7.0	0 / 115 = 0.0	Qwen is a Heavy Thinker; Llama is an Instant Reasoner.

*Note: Llama 3.2 8B is a standard instruct model and does not generate hidden reasoning tokens unless using a specific CoT prompt.

Table 5. Thought-to-output ratio comparison (Created by Kuriko IWAI)

The results indicate that Qwen 3 generated 7 internal tokens for every 1 word it actually said.

This hidden work is what allowed a tiny 8B model to reach Task B’s reasoning depth before collapsing.

On the other hand, Llama 3.2 produced the answer instantly without consuming any hidden tokens.

◼ Advantages

Measures cognitive effort: Provides a direct metric for how hard a model is working.
Identifies self-correction: Reasoning models use hidden tokens to backtrack. Monitoring these reveals if a model caught its own mistake before providing the final output.
Inference efficiency tuning: Ensures the model thinks long enough to be right, but not too long to become prohibitively expensive or slow.
Detecting fake reasoning: Reveals if a model is actually reasoning or just using a standard CoT template. True reasoning models show dynamic thought lengths that scale with problem difficulty.

◼ Limitations

Implementation feasibility: For many commercial models, hidden tokens are not visible to end users, making the metrics hard to compute in the first place.
Ruminating loops: Being stuck in the logical loop can improve this metric by consuming many internal thought tokens, but won’t improve the response.
Illegibility: As reasoning depth increases, some models start to optimize for its own internal logic rather than human readability, mixing languages or nonsensical characters in their internal thought tokens.

◼ Primary Use Cases

Benchmarking System 2 thinking capabilities in models designed for PhD-level science, advanced coding, and law.
Cost-benefit optimization: If a 7B reasoning model has a high thought-to-output ratio but achieves the same accuracy as a 70B model, the 7B model might be more intelligent but slower and potentially more expensive per query.
Debugging logic collapse by identifying when a model hits a reasoning wall and no longer attempting to solve the problem.

Wrapping Up

An evaluation of Qwen 3 and Llama 3.2 across four performance metrics provides a comprehensive assessment of their deep reasoning capabilities (Task B).

◼ Summary of Metrics for Task B

Metric	Qwen 3 (8B)	Llama 3.2 (8B)	Interpretation
Acc@d=8	24%	65%	Qwen hits its reasoning wall at d = 6.
Step-Wise Validity Score	0.50	1.00	Qwen makes a math error at Step 5, invalidating the chain.
Robustness Coeff.	0.33	0.95	Qwen relies on pattern matching; Llama uses principles.
Thought-to-Output ($R_{to}$)	7.0	0.0	Qwen consumes 7 hidden tokens to generate one token.

Table 6. Evaluation summary (Created by Kuriko IWAI)

The results indicate that:

Llama 3.2 is an intuitive reasoner whose reasoning is baked into its pre-trained weights (T_{hidden} = 0). So that’s the reason it can solve the problem through first-principles logic.
Qwen 3 is a System 2 deliberator which needs to talk to itself to solve complex tasks (R_to = 7), but cannot always hold up deeper reasoning tasks (reasoning wall = 6, acc@d=8 = 24%), although Step 5 math collapse was due to mathematical errors.

Given the results, I would recommend using:

Llama 3.2 (8B) for production environments where speed and reliability are paramount. Its reasoning is stable and phrasing-independent.
Qwen 3 (8B) for research or complex multi-step prompting where the latency of hidden tokens are acceptable. For the math problem, ensure you pair it with an external calculator tool to prevent a simple math error.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Regularizing LLMs with Kullback-Leibler Divergence
Explore how Forward and Reverse KL Divergence regularize LLMs. Learn to balance exploration and exploitation in SFT and RLHF to prevent policy collapse.
Transformer Architecture: Self-Attention & MLOps Guide
Master the inner workings of Transformers. A technical walkthrough of self-attention, multi-head mechanisms, and positional encoding with vector math examples.
Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN
Master the mechanics of Generative Adversarial Networks. Explore the minimax value function, solve vanishing gradients with modified loss, and compare DCGAN, cGAN, and ProGAN.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Share What You Learned

Kuriko IWAI, "The Reasoning Wall: A Comparative Benchmark of Llama 3.2 vs. Qwen 3" in Kernel Labs

https://kuriko-iwai.com/llm-reasoning-depth-benchmark

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

The Reasoning Wall: A Comparative Benchmark of Llama 3.2 vs. Qwen 3

Stress-testing multi-hop logic chains using Multi-LogiEval, Process Benchmarking, and Thought-to-Output Ratios

Table of Contents

Introduction

Understanding Multi-Hop Reasoning and Depth

Mapping Task Complexity to Reasoning Depth

◼ The Taxonomy of Reasoning Depth

◼ Shallow Reasoning: Direct Pattern Mapping

◼ Moderate Reasoning: Contextual Assembly

◼ Deep Reasoning: System 2 Thinking

◼ Engineering Requirements for Deep Reasoning

▫ Test-Time Computing

▫ KV Cache Management

▫ Feedback Loop and Self-Correction

How to Measure LLMs’ Reasoning Horizon

Multi-LogiEval

◼ The Core Metric

◼ Advantages

◼ Limitations

◼ Primary Use Cases

◼ Implementation Strategy: The Incremental Complexity Ladder

▫ Isomorphic Logic Tasks

▫ The Baseline:

▫ Task A:

▫ Task B (d = 8):

◼ Quantitative Analysis

◼ Scaling the Evaluation

Process Benchmarking

◼ The Core Metric

◼ Advantages

◼ Limitations

◼ Primary Use Cases

◼ Implementation

Robustness Coefficient

◼ The Core Metric

◼ Advantages

◼ Limitations

◼ Primary Use Cases

Thought-to-Output Ratio

◼ The Core Metric

◼ Advantages

◼ Limitations

◼ Primary Use Cases

Wrapping Up

◼ Summary of Metrics for Task B

Continue Your Learning

Regularizing LLMs with Kullback-Leibler Divergence

Transformer Architecture: Self-Attention & MLOps Guide

Generative Adversarial Network (GAN): From Vanilla Minimax to ProGAN

Related Books for Further Understanding

Share What You Learned

Looking for Solutions?