Benchmark grid
- Configurations per model
- 140
- Difficulty levels (d)
- 1, 3, 5, 7, 10
- Task lengths (N)
- 20, 50, 100, 250
- Distractor ratios (ρ)
- 5%, 10%, 25%, 50%, 75%, 90%, 95%
A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density
Understanding long-context reasoning performance through principled, controllable dimensions inspired by Cognitive Load Theory.
Existing long-context benchmarks conflate context length with reasoning depth and distractor noise, making it hard to tell why a model fails. CogniLoad fixes this by generating natural-language logic puzzles where each cognitive demand is controlled independently.
Each puzzle presents a set of people with mutable attributes, a sequence of conditional update statements, and a final query. The model must track state changes across all updates and answer correctly.
CLT distinguishes three demands on working memory: intrinsic load (task complexity), extraneous load (irrelevant noise), and germane load (sustained schema maintenance). CogniLoad maps each to an independent benchmark parameter, enabling precise failure diagnosis rather than a single blended "difficulty" score.
CogniLoad is designed as a diagnostic benchmark, not just a hard one. The paper's main contribution is that it combines CLT-grounded dimensional control with randomized, reproducible generation, so evaluations can separate reasoning limits from benchmark artifacts.
d, N, and ρ are controlled separately.
CogniLoad does not collapse reasoning depth, context length, and distractor density into one blended difficulty score. Each axis can be varied while the others stay fixed, making failure attribution much more precise.
Instances are generated automatically at evaluation time.
Puzzles are sampled from an algorithmic generator over people, attributes, rules, and queries rather than drawn from a small static set. That makes contamination and test-set memorization far less useful.
No LLM is used to write benchmark items.
Logical statements are rendered with deterministic templates and scored with a fixed evaluation pipeline. The result is a benchmark that is easier to reproduce and audit end to end.
Hay statements are validated not to alter the target person.
Generation constraints ensure distractor updates remain extraneous to the person of interest while preserving sequential reasoning pressure. Models still need to filter noise, but the noise is not accidentally changing the answer.
Each parameter targets a specific reasoning demand. They vary independently, so you can test models against one axis while holding the others constant.
Controlled by difficulty d.
How many entities, attributes, and rules interact. Higher d means more complex reasoning per step.
Controlled by distractor ratio ρ.
The share of irrelevant updates the model must filter out. Lower ρ means more noise to ignore.
Controlled by task length N.
How many sequential updates the model must track. Longer tasks demand sustained schema maintenance.
Models receive an initial state, a sequence of conditional updates, and a query about one person's final attribute value.
Solve this logic puzzle. Finalize your response with a single sentence about the asked property. Reason through the statements in strictly sequential order.
What color of socks is Brent wearing?
d
Controls the number of interacting people, attribute categories, value domains, and logical conditions in each puzzle.
N
Sets the number of sequential state updates that must be tracked without changing per-step difficulty or distractor ratio.
ρ
Adjusts the proportion of person-of-interest statements relative to distractors, modulating how much irrelevant structure surrounds the signal.
Aggregated behavior across 22 reasoning models reveals three consistent performance patterns.
Average accuracy falls from 66% at N = 20 to 24% at N = 250.
Accuracy dips lowest at ρ = 50%, producing the characteristic U-shaped curve.
ECL₅₀, NT₅₀, and ID₅₀ distill raw performance curves into compact summaries of each model's limits on length, distractors, and complexity.
Task length serves as the dominant performance bottleneck. The following curves show the exact-match accuracy for the top 5 models by ECL₅₀.
Task length N
Exact-match accuracy with 90% Wilson confidence intervals when available.
OpenAI
ECL₅₀ predicts how far a model can go along the task-length axis before accuracy drops to 50%. Higher is better.
| 1. gpt-5-2025-08-07 | 383 |
|---|---|
| 2. o3-2025-04-16 | 357 |
| 3. gpt-5-mini-2025-08-07 | 164 |
| 4. gemini-2.5-pro | 153 |
| 5. o4-mini-2025-04-16 | 132 |
Charts, thresholds, and heatmap comparisons.
The full ICLR 2026 paper.
Generation, evaluation, and analysis.
Benchmark data on Hugging Face.
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, and Benjamin Ricaud. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density. ICLR 2026.
@inproceedings{kaiser2026cogniload,
title={CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density},
author={Daniel Kaiser and Arnoldo Frigessi and Ali Ramezani-Kebrya and Benjamin Ricaud},
booktitle={International Conference on Learning Representations},
year={2026}
}