CogniLoad

A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Understanding long-context reasoning performance through principled, controllable dimensions inspired by Cognitive Load Theory.

ICLR 2026  ·  31 models  ·  140 configurations

Benchmark grid

Configurations per model
140
Difficulty levels (d)
1, 3, 5, 7, 10
Task lengths (N)
20, 50, 100, 250
Distractor ratios (ρ)
5%, 10%, 25%, 50%, 75%, 90%, 95%

What is CogniLoad?

Existing long-context benchmarks conflate context length with reasoning depth and distractor noise, making it hard to tell why a model fails. CogniLoad fixes this by generating natural-language logic puzzles where each cognitive demand is controlled independently.

Each puzzle presents a set of people with mutable attributes, a sequence of conditional update statements, and a final query. The model must track state changes across all updates and answer correctly.

Why Cognitive Load Theory?

CLT distinguishes three demands on working memory: intrinsic load (task complexity), extraneous load (irrelevant noise), and germane load (sustained schema maintenance). CogniLoad maps each to an independent benchmark parameter, enabling precise failure diagnosis rather than a single blended "difficulty" score.

What makes CogniLoad special?

CogniLoad is designed as a diagnostic benchmark, not just a hard one. The paper's main contribution is that it combines CLT-grounded dimensional control with randomized, reproducible generation, so evaluations can separate reasoning limits from benchmark artifacts.

Three genuinely independent axes

d, N, and ρ are controlled separately.

CogniLoad does not collapse reasoning depth, context length, and distractor density into one blended difficulty score. Each axis can be varied while the others stay fixed, making failure attribution much more precise.

Randomized generation counters leakage

Instances are generated automatically at evaluation time.

Puzzles are sampled from an algorithmic generator over people, attributes, rules, and queries rather than drawn from a small static set. That makes contamination and test-set memorization far less useful.

Deterministic and reproducible

No LLM is used to write benchmark items.

Logical statements are rendered with deterministic templates and scored with a fixed evaluation pipeline. The result is a benchmark that is easier to reproduce and audit end to end.

Distractors are truly distractors

Hay statements are validated not to alter the target person.

Generation constraints ensure distractor updates remain extraneous to the person of interest while preserving sequential reasoning pressure. Models still need to filter noise, but the noise is not accidentally changing the answer.

Three independent load dimensions

Each parameter targets a specific reasoning demand. They vary independently, so you can test models against one axis while holding the others constant.

Intrinsic load

Controlled by difficulty d.

How many entities, attributes, and rules interact. Higher d means more complex reasoning per step.

Extraneous load

Controlled by distractor ratio ρ.

The share of irrelevant updates the model must filter out. Lower ρ means more noise to ignore.

Germane load proxy

Controlled by task length N.

How many sequential updates the model must track. Longer tasks demand sustained schema maintenance.

How a puzzle works

Models receive an initial state, a sequence of conditional updates, and a query about one person's final attribute value.

Instruction

Solve this logic puzzle. Finalize your response with a single sentence about the asked property. Reason through the statements in strictly sequential order.

Initial state

  • Brent is wearing green socks and purple gloves and last listened to classical music.
  • Anthony is wearing purple socks and yellow gloves and last listened to disco music.
  • Other people each start with their own attribute values.

Update statements

  1. The people wearing green socks listen to electronic music.
  2. The people who last listened to classical music and wear purple gloves put on yellow gloves.
  3. Further updates are applied strictly in order.

Query

What color of socks is Brent wearing?

What the benchmark measures

d

Intrinsic Difficulty

Controls the number of interacting people, attribute categories, value domains, and logical conditions in each puzzle.

N

Task Length

Sets the number of sequential state updates that must be tracked without changing per-step difficulty or distractor ratio.

ρ

Distractor Ratio

Adjusts the proportion of person-of-interest statements relative to distractors, modulating how much irrelevant structure surrounds the signal.

Key findings

Aggregated behavior across 22 reasoning models reveals three consistent performance patterns.

Task length is the strongest bottleneck

Average accuracy falls from 66% at N = 20 to 24% at N = 250.

Mid-range distractors are hardest

Accuracy dips lowest at ρ = 50%, producing the characteristic U-shaped curve.

Capacity thresholds fingerprint each model

ECL₅₀, NT₅₀, and ID₅₀ distill raw performance curves into compact summaries of each model's limits on length, distractors, and complexity.

Performance degradation over task length

Task length serves as the dominant performance bottleneck. The following curves show the exact-match accuracy for the top 5 models by ECL₅₀.

0%25%50%75%100%2050100250gpt-5-2025-08-07gpt-5-2025-08-07 · 20 · 99.7% · n=350gpt-5-2025-08-07 · 50 · 98.0% · n=350gpt-5-2025-08-07 · 100 · 96.3% · n=350gpt-5-2025-08-07 · 250 · 76.0% · n=350o3-2025-04-16o3-2025-04-16 · 20 · 98.6% · n=350o3-2025-04-16 · 50 · 97.7% · n=350o3-2025-04-16 · 100 · 93.7% · n=350o3-2025-04-16 · 250 · 67.8% · n=348gpt-5-mini-2025-08-07gpt-5-mini-2025-08-07 · 20 · 95.7% · n=350gpt-5-mini-2025-08-07 · 50 · 92.6% · n=350gpt-5-mini-2025-08-07 · 100 · 83.7% · n=350gpt-5-mini-2025-08-07 · 250 · 44.3% · n=350gemini-2.5-progemini-2.5-pro · 20 · 99.4% · n=350gemini-2.5-pro · 50 · 98.3% · n=350gemini-2.5-pro · 100 · 82.0% · n=350gemini-2.5-pro · 250 · 20.6% · n=350o4-mini-2025-04-16o4-mini-2025-04-16 · 20 · 98.9% · n=350o4-mini-2025-04-16 · 50 · 92.9% · n=350o4-mini-2025-04-16 · 100 · 75.7% · n=350o4-mini-2025-04-16 · 250 · 38.2% · n=348

Task length N

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07
o3-2025-04-16
gpt-5-mini-2025-08-07
o4-mini-2025-04-16

Google

gemini-2.5-pro

Top models by context capacity

ECL₅₀ predicts how far a model can go along the task-length axis before accuracy drops to 50%. Higher is better.

1. gpt-5-2025-08-07 383
2. o3-2025-04-16 357
3. gpt-5-mini-2025-08-07 164
4. gemini-2.5-pro 153
5. o4-mini-2025-04-16 132

Citation

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, and Benjamin Ricaud. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density. ICLR 2026.

@inproceedings{kaiser2026cogniload,
  title={CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density},
  author={Daniel Kaiser and Arnoldo Frigessi and Ali Ramezani-Kebrya and Benjamin Ricaud},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Authors

Daniel Kaiser · Arnoldo Frigessi · Ali Ramezani-Kebrya · Benjamin Ricaud

  • Integreat - Norwegian Centre for Knowledge-driven Machine Learning
  • UiT The Arctic University of Norway
  • University of Oslo