Highest accuracy
gpt-5-2025-08-07
92.5%
Performance of 31 reasoning LLMs across independently controlled intrinsic difficulty, task length, and distractor density.
The ICLR 2026 paper evaluates 22 models on 100 random CogniLoad puzzles per configuration (14,000 per model) with a maximum context length of 32K tokens. All puzzles are solvable within this budget; context-budget overflows are model-specific serving failures counted as errors in the paper-default scoring.
Filter by developer, switch scoring mode, and select models for comparison.
Paper-default scoring. Context-budget overflows count as failures, reflecting both reasoning quality and serving stability.
gpt-5-2025-08-07
92.5%
gpt-5-2025-08-07
383
50%
38.3%
8 active in charts
8
Select models via checkboxes to populate the performance curves and heatmaps below.
| Compare | Model | Developer | Overall | N=250 | d=10 | ECL₅₀ 1 | NT₅₀ 2 | ID₅₀ 3 | Overflow |
|---|---|---|---|---|---|---|---|---|---|
| gpt-5-2025-08-07 | OpenAI | 92.5% | 76.0% | 82.1% | 383 | — | 14.8 | 2.3% | |
| o3-2025-04-16 | OpenAI | 89.5% | 67.8% | 80.4% | 357 | — | 19.1 | 1.7% | |
| gpt-5-mini-2025-08-07 | OpenAI | 79.1% | 44.3% | 67.5% | 164 | — | 11.7 | 3.0% | |
| o4-mini-2025-04-16 | OpenAI | 76.5% | 38.2% | 64.6% | 132 | — | 10.9 | 1.5% | |
| gemini-2.5-pro | 75.1% | 20.6% | 65.4% | 153 | — | 12.7 | 22.6% | ||
| gpt-oss-120b-medium | OpenAI | 67.4% | 38.6% | 38.6% | 111 | — | 6.9 | 0.4% | |
| DeepSeek-R1-0528 | DeepSeek | 67.0% | 26.9% | 47.5% | 105 | — | 7.5 | 4.4% | |
| gemini-2.5-flash | 66.1% | 14.3% | 51.8% | 111 | — | 8.6 | 27.8% | ||
| gpt-oss-120b-high | OpenAI | 65.9% | 23.7% | 46.4% | 102 | — | 7.2 | 23.5% | |
| Qwen3-Next-80B-A3B-Thinking-FP8 | Alibaba | 62.4% | 18.3% | 45.4% | 95 | — | 7.0 | 22.6% | |
| DeepSeek-R1-Distill-Llama-70B | DeepSeek | 57.6% | 26.7% | 36.1% | 70 | 36.0% | 5.1 | 0.6% | |
| gpt-oss-20b-medium | OpenAI | 54.9% | 24.6% | 27.9% | 66 | 34.8% | 4.9 | 1.1% | |
| Phi-4-reasoning | Microsoft | 50.9% | 23.5% | 25.6% | 52 | 23.8% | 4.1 | 12.3% | |
| DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 50.8% | 29.2% | 31.4% | 54 | 12.3% | 4.0 | 2.3% | |
| QwQ-32B | Alibaba | 50.5% | 29.1% | 30.0% | 48 | 25.0% | 3.6 | 3.2% | |
| Qwen3-32B | Alibaba | 50.3% | 26.0% | 28.0% | 54 | 23.1% | 4.1 | 0.0% | |
| gpt-oss-20b-high | OpenAI | 49.1% | 7.7% | 28.2% | 55 | 16.7% | 3.8 | 39.6% | |
| GLM-Z1-32B-0414 | Zhipu AI | 47.5% | 24.3% | 23.1% | 47 | 15.6% | 3.7 | 0.0% | |
| Phi-4-reasoning-plus | Microsoft | 46.4% | 19.6% | 20.5% | 46 | 15.6% | 3.7 | 20.6% | |
| gpt-5-nano-2025-08-07 | OpenAI | 46.2% | 24.6% | 21.4% | 36 | 10.3% | 2.9 | 0.0% | |
| Qwen3-30B-A3B | Alibaba | 44.8% | 24.8% | 22.7% | 37 | 4.4% | 3.0 | 1.1% | |
| Qwen3-8B | Alibaba | 41.3% | 23.0% | 21.9% | 31 | — | 2.2 | 3.6% | |
| gpt-oss-120b-low | OpenAI | 40.0% | 24.3% | 21.4% | 31 | — | 1.7 | 0.0% | |
| gemini-2.5-flash-lite | 37.9% | 29.1% | 16.4% | 9 | 92.7% | 1.5 | 1.2% | ||
| EXAONE-Deep-32B | LG AI | 34.7% | 24.3% | 16.9% | 14 | — | 1.0 | 0.3% | |
| gpt-oss-20b-low | OpenAI | 26.9% | 19.4% | 10.4% | 4 | — | -0.4 | 0.0% | |
| Phi-4-mini-reasoning | Microsoft | 25.7% | 21.3% | 11.4% | 1 | — | -0.5 | 0.3% | |
| Qwen3-4B | Alibaba | 25.4% | 19.7% | 12.1% | 2 | — | -1.7 | 2.1% | |
| DeepSeek-R1-Distill-Qwen-7B | DeepSeek | 24.8% | 18.7% | 11.5% | 3 | — | -0.5 | 7.8% | |
| Qwen3-1.7B | Alibaba | 20.9% | 18.3% | 11.0% | 0 | — | -4.1 | 0.0% | |
| DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek | 15.8% | 14.2% | 8.4% | 0 | — | -3.9 | 22.4% |
1 ECL₅₀ (Effective Context Length): Maximum number of statements the model can process while maintaining ≥50% accuracy. Higher values indicate superior context handling. Derived from the paper's binomial GLM at mean values of d and ρ.
2 NT₅₀ (Needle-to-hay Threshold): Minimum proportion of relevant (needle) information required to maintain ≥50% accuracy. Lower values indicate greater robustness to distractors. A missing value means the model does not cross the 50% threshold for any ρ ∈ [0, 1] at mean d and N.
3 ID₅₀ (Intrinsic Difficulty): Maximum intrinsic complexity (number of interacting entities and attributes) the model can handle while maintaining ≥50% accuracy. Negative values indicate failure to reach 50% even at the simplest configuration under mean N and ρ.
Accuracy along each load dimension for selected models. Up to 12 models displayed.
Performance as the number of interacting entities, attributes, and conditions increases.
Intrinsic difficulty d
Exact-match accuracy with 90% Wilson confidence intervals when available.
OpenAI
DeepSeek
Performance as the number of sequential state transitions increases.
Task length N
Exact-match accuracy with 90% Wilson confidence intervals when available.
OpenAI
DeepSeek
Performance as the share of person-of-interest statements changes from sparse to dense.
Needle-to-Hay ratio ρ [%]
Exact-match accuracy with 90% Wilson confidence intervals when available.
OpenAI
DeepSeek
| d \ ρ | 5% | 10% | 25% | 50% | 75% | 90% | 95% |
|---|---|---|---|---|---|---|---|
| 1 | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| 3 | 90% | 70% | 90% | 100% | 100% | 100% | 90% |
| 5 | 90% | 80% | 70% | 60% | 90% | 100% | 90% |
| 7 | 100% | 90% | 50% | 70% | 60% | 60% | 40% |
| 10 | 80% | 90% | 80% | 0% | 0% | 10% | 10% |
| d \ ρ | 5% | 10% | 25% | 50% | 75% | 90% | 95% |
|---|---|---|---|---|---|---|---|
| 1 | 100% | 90% | 80% | 80% | 90% | 90% | 100% |
| 3 | 70% | 70% | 90% | 90% | 90% | 80% | 80% |
| 5 | 60% | 80% | 70% | 70% | 70% | 80% | 56% |
| 7 | 100% | 70% | 70% | 60% | 50% | 60% | 50% |
| 10 | 80% | 60% | 40% | 30% | 0% | 10% | 10% |
| d \ ρ | 5% | 10% | 25% | 50% | 75% | 90% | 95% |
|---|---|---|---|---|---|---|---|
| 1 | 100% | 80% | 80% | 80% | 80% | 70% | 100% |
| 3 | 60% | 30% | 40% | 70% | 60% | 70% | 70% |
| 5 | 50% | 40% | 40% | 20% | 50% | 70% | 80% |
| 7 | 40% | 50% | 30% | 20% | 10% | 30% | 20% |
| 10 | 0% | 0% | 0% | 0% | 0% | 0% | 10% |
The three load dimensions interact non-linearly. Per-model logistic GLMs with pairwise and three-way interaction terms reveal the following significant effects among the 22 paper models.
d × N
17 / 22
Difficulty and length compound beyond additive predictions.
d × ρ
13 / 22
Higher intrinsic difficulty amplifies sensitivity to distractor noise.
N × ρ
4 / 22
Length and distractor ratio interact significantly.
d × N × ρ
12 / 22
Three-way interaction: increasing signal share yields larger gains when tasks are simultaneously long and complex.
The evaluation pipeline distinguishes reasoning failures from context-budget exhaustion and formatting errors, enabling precise failure attribution.
At N = 250, the most common non-overflow failure is incorrect final attribution — the model loses track of sequential state updates rather than failing to parse the task.
All puzzles are solvable within the 32K token context budget . Overflows concentrate in specific model–condition pairs at extreme N due to output verbosity.
Answer-format failures rise in smaller models under high N and d, but account for a minor share of total errors compared to reasoning failures.