Results

Performance of 31 reasoning LLMs across independently controlled intrinsic difficulty, task length, and distractor density.

The ICLR 2026 paper evaluates 22 models on 100 random CogniLoad puzzles per configuration (14,000 per model) with a maximum context length of 32K tokens. All puzzles are solvable within this budget; context-budget overflows are model-specific serving failures counted as errors in the paper-default scoring.

Controls

Filter by developer, switch scoring mode, and select models for comparison.

Paper-default scoring. Context-budget overflows count as failures, reflecting both reasoning quality and serving stability.

Developers

Highest accuracy

gpt-5-2025-08-07

92.5%

Best ECL₅₀

gpt-5-2025-08-07

383

Hardest ρ regime

50%

38.3%

Selected models

8 active in charts

8

Model comparison

Select models via checkboxes to populate the performance curves and heatmaps below.

CompareModelDeveloperOverallN=250d=10ECL₅₀ 1NT₅₀ 2ID₅₀ 3Overflow
gpt-5-2025-08-07OpenAI92.5%76.0%82.1%38314.82.3%
o3-2025-04-16OpenAI89.5%67.8%80.4%35719.11.7%
gpt-5-mini-2025-08-07OpenAI79.1%44.3%67.5%16411.73.0%
o4-mini-2025-04-16OpenAI76.5%38.2%64.6%13210.91.5%
gemini-2.5-proGoogle75.1%20.6%65.4%15312.722.6%
gpt-oss-120b-mediumOpenAI67.4%38.6%38.6%1116.90.4%
DeepSeek-R1-0528DeepSeek67.0%26.9%47.5%1057.54.4%
gemini-2.5-flashGoogle66.1%14.3%51.8%1118.627.8%
gpt-oss-120b-highOpenAI65.9%23.7%46.4%1027.223.5%
Qwen3-Next-80B-A3B-Thinking-FP8Alibaba62.4%18.3%45.4%957.022.6%
DeepSeek-R1-Distill-Llama-70BDeepSeek57.6%26.7%36.1%7036.0%5.10.6%
gpt-oss-20b-mediumOpenAI54.9%24.6%27.9%6634.8%4.91.1%
Phi-4-reasoningMicrosoft50.9%23.5%25.6%5223.8%4.112.3%
DeepSeek-R1-Distill-Qwen-32BDeepSeek50.8%29.2%31.4%5412.3%4.02.3%
QwQ-32BAlibaba50.5%29.1%30.0%4825.0%3.63.2%
Qwen3-32BAlibaba50.3%26.0%28.0%5423.1%4.10.0%
gpt-oss-20b-highOpenAI49.1%7.7%28.2%5516.7%3.839.6%
GLM-Z1-32B-0414Zhipu AI47.5%24.3%23.1%4715.6%3.70.0%
Phi-4-reasoning-plusMicrosoft46.4%19.6%20.5%4615.6%3.720.6%
gpt-5-nano-2025-08-07OpenAI46.2%24.6%21.4%3610.3%2.90.0%
Qwen3-30B-A3BAlibaba44.8%24.8%22.7%374.4%3.01.1%
Qwen3-8BAlibaba41.3%23.0%21.9%312.23.6%
gpt-oss-120b-lowOpenAI40.0%24.3%21.4%311.70.0%
gemini-2.5-flash-liteGoogle37.9%29.1%16.4%992.7%1.51.2%
EXAONE-Deep-32BLG AI34.7%24.3%16.9%141.00.3%
gpt-oss-20b-lowOpenAI26.9%19.4%10.4%4-0.40.0%
Phi-4-mini-reasoningMicrosoft25.7%21.3%11.4%1-0.50.3%
Qwen3-4BAlibaba25.4%19.7%12.1%2-1.72.1%
DeepSeek-R1-Distill-Qwen-7BDeepSeek24.8%18.7%11.5%3-0.57.8%
Qwen3-1.7BAlibaba20.9%18.3%11.0%0-4.10.0%
DeepSeek-R1-Distill-Qwen-1.5BDeepSeek15.8%14.2%8.4%0-3.922.4%

1 ECL₅₀ (Effective Context Length): Maximum number of statements the model can process while maintaining ≥50% accuracy. Higher values indicate superior context handling. Derived from the paper's binomial GLM at mean values of d and ρ.

2 NT₅₀ (Needle-to-hay Threshold): Minimum proportion of relevant (needle) information required to maintain ≥50% accuracy. Lower values indicate greater robustness to distractors. A missing value means the model does not cross the 50% threshold for any ρ ∈ [0, 1] at mean d and N.

3 ID₅₀ (Intrinsic Difficulty): Maximum intrinsic complexity (number of interacting entities and attributes) the model can handle while maintaining ≥50% accuracy. Negative values indicate failure to reach 50% even at the simplest configuration under mean N and ρ.

Performance curves

Accuracy along each load dimension for selected models. Up to 12 models displayed.

Intrinsic difficulty d

Performance as the number of interacting entities, attributes, and conditions increases.

0%25%50%75%100%135710gpt-5-2025-08-07gpt-5-2025-08-07 · 1 · 99.6% · n=280gpt-5-2025-08-07 · 3 · 97.5% · n=280gpt-5-2025-08-07 · 5 · 93.9% · n=280gpt-5-2025-08-07 · 7 · 89.3% · n=280gpt-5-2025-08-07 · 10 · 82.1% · n=280o3-2025-04-16o3-2025-04-16 · 1 · 96.1% · n=279o3-2025-04-16 · 3 · 92.9% · n=280o3-2025-04-16 · 5 · 89.6% · n=279o3-2025-04-16 · 7 · 88.6% · n=280o3-2025-04-16 · 10 · 80.4% · n=280gpt-5-mini-2025-08-07gpt-5-mini-2025-08-07 · 1 · 92.5% · n=280gpt-5-mini-2025-08-07 · 3 · 82.9% · n=280gpt-5-mini-2025-08-07 · 5 · 78.2% · n=280gpt-5-mini-2025-08-07 · 7 · 74.3% · n=280gpt-5-mini-2025-08-07 · 10 · 67.5% · n=280o4-mini-2025-04-16o4-mini-2025-04-16 · 1 · 88.2% · n=279o4-mini-2025-04-16 · 3 · 81.1% · n=280o4-mini-2025-04-16 · 5 · 78.6% · n=280o4-mini-2025-04-16 · 7 · 69.9% · n=279o4-mini-2025-04-16 · 10 · 64.6% · n=280gemini-2.5-progemini-2.5-pro · 1 · 94.6% · n=280gemini-2.5-pro · 3 · 74.3% · n=280gemini-2.5-pro · 5 · 71.8% · n=280gemini-2.5-pro · 7 · 69.3% · n=280gemini-2.5-pro · 10 · 65.4% · n=280gpt-oss-120b-mediumgpt-oss-120b-medium · 1 · 88.2% · n=280gpt-oss-120b-medium · 3 · 83.6% · n=280gpt-oss-120b-medium · 5 · 67.1% · n=280gpt-oss-120b-medium · 7 · 59.6% · n=280gpt-oss-120b-medium · 10 · 38.6% · n=280DeepSeek-R1-0528DeepSeek-R1-0528 · 1 · 87.9% · n=280DeepSeek-R1-0528 · 3 · 77.5% · n=280DeepSeek-R1-0528 · 5 · 66.8% · n=280DeepSeek-R1-0528 · 7 · 55.4% · n=280DeepSeek-R1-0528 · 10 · 47.5% · n=280gemini-2.5-flashgemini-2.5-flash · 1 · 89.3% · n=280gemini-2.5-flash · 3 · 71.1% · n=280gemini-2.5-flash · 5 · 62.9% · n=280gemini-2.5-flash · 7 · 55.7% · n=280gemini-2.5-flash · 10 · 51.8% · n=280

Intrinsic difficulty d

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07
o3-2025-04-16
gpt-5-mini-2025-08-07
o4-mini-2025-04-16
gpt-oss-120b-medium

Google

gemini-2.5-pro
gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Task length N

Performance as the number of sequential state transitions increases.

0%25%50%75%100%2050100250gpt-5-2025-08-07gpt-5-2025-08-07 · 20 · 99.7% · n=350gpt-5-2025-08-07 · 50 · 98.0% · n=350gpt-5-2025-08-07 · 100 · 96.3% · n=350gpt-5-2025-08-07 · 250 · 76.0% · n=350o3-2025-04-16o3-2025-04-16 · 20 · 98.6% · n=350o3-2025-04-16 · 50 · 97.7% · n=350o3-2025-04-16 · 100 · 93.7% · n=350o3-2025-04-16 · 250 · 67.8% · n=348gpt-5-mini-2025-08-07gpt-5-mini-2025-08-07 · 20 · 95.7% · n=350gpt-5-mini-2025-08-07 · 50 · 92.6% · n=350gpt-5-mini-2025-08-07 · 100 · 83.7% · n=350gpt-5-mini-2025-08-07 · 250 · 44.3% · n=350o4-mini-2025-04-16o4-mini-2025-04-16 · 20 · 98.9% · n=350o4-mini-2025-04-16 · 50 · 92.9% · n=350o4-mini-2025-04-16 · 100 · 75.7% · n=350o4-mini-2025-04-16 · 250 · 38.2% · n=348gemini-2.5-progemini-2.5-pro · 20 · 99.4% · n=350gemini-2.5-pro · 50 · 98.3% · n=350gemini-2.5-pro · 100 · 82.0% · n=350gemini-2.5-pro · 250 · 20.6% · n=350gpt-oss-120b-mediumgpt-oss-120b-medium · 20 · 91.1% · n=350gpt-oss-120b-medium · 50 · 77.7% · n=350gpt-oss-120b-medium · 100 · 62.3% · n=350gpt-oss-120b-medium · 250 · 38.6% · n=350DeepSeek-R1-0528DeepSeek-R1-0528 · 20 · 97.4% · n=350DeepSeek-R1-0528 · 50 · 87.4% · n=350DeepSeek-R1-0528 · 100 · 56.3% · n=350DeepSeek-R1-0528 · 250 · 26.9% · n=350gemini-2.5-flashgemini-2.5-flash · 20 · 98.9% · n=350gemini-2.5-flash · 50 · 90.9% · n=350gemini-2.5-flash · 100 · 60.6% · n=350gemini-2.5-flash · 250 · 14.3% · n=350

Task length N

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07
o3-2025-04-16
gpt-5-mini-2025-08-07
o4-mini-2025-04-16
gpt-oss-120b-medium

Google

gemini-2.5-pro
gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Distractor ratio ρ

Performance as the share of person-of-interest statements changes from sparse to dense.

0%25%50%75%100%5%10%25%50%75%90%95%gpt-5-2025-08-07gpt-5-2025-08-07 · 5% · 97.5% · n=200gpt-5-2025-08-07 · 10% · 95.5% · n=200gpt-5-2025-08-07 · 25% · 93.0% · n=200gpt-5-2025-08-07 · 50% · 88.5% · n=200gpt-5-2025-08-07 · 75% · 90.0% · n=200gpt-5-2025-08-07 · 90% · 92.0% · n=200gpt-5-2025-08-07 · 95% · 91.0% · n=200o3-2025-04-16o3-2025-04-16 · 5% · 93.0% · n=200o3-2025-04-16 · 10% · 92.0% · n=200o3-2025-04-16 · 25% · 88.5% · n=200o3-2025-04-16 · 50% · 89.0% · n=200o3-2025-04-16 · 75% · 86.5% · n=200o3-2025-04-16 · 90% · 89.0% · n=200o3-2025-04-16 · 95% · 88.4% · n=198gpt-5-mini-2025-08-07gpt-5-mini-2025-08-07 · 5% · 84.0% · n=200gpt-5-mini-2025-08-07 · 10% · 77.5% · n=200gpt-5-mini-2025-08-07 · 25% · 77.5% · n=200gpt-5-mini-2025-08-07 · 50% · 70.5% · n=200gpt-5-mini-2025-08-07 · 75% · 77.0% · n=200gpt-5-mini-2025-08-07 · 90% · 80.5% · n=200gpt-5-mini-2025-08-07 · 95% · 86.5% · n=200o4-mini-2025-04-16o4-mini-2025-04-16 · 5% · 80.5% · n=200o4-mini-2025-04-16 · 10% · 77.0% · n=200o4-mini-2025-04-16 · 25% · 72.0% · n=200o4-mini-2025-04-16 · 50% · 67.0% · n=200o4-mini-2025-04-16 · 75% · 73.0% · n=200o4-mini-2025-04-16 · 90% · 83.3% · n=198o4-mini-2025-04-16 · 95% · 82.5% · n=200gemini-2.5-progemini-2.5-pro · 5% · 73.5% · n=200gemini-2.5-pro · 10% · 77.5% · n=200gemini-2.5-pro · 25% · 75.0% · n=200gemini-2.5-pro · 50% · 73.0% · n=200gemini-2.5-pro · 75% · 74.0% · n=200gemini-2.5-pro · 90% · 75.5% · n=200gemini-2.5-pro · 95% · 77.0% · n=200gpt-oss-120b-mediumgpt-oss-120b-medium · 5% · 78.0% · n=200gpt-oss-120b-medium · 10% · 68.5% · n=200gpt-oss-120b-medium · 25% · 63.0% · n=200gpt-oss-120b-medium · 50% · 61.0% · n=200gpt-oss-120b-medium · 75% · 64.5% · n=200gpt-oss-120b-medium · 90% · 67.5% · n=200gpt-oss-120b-medium · 95% · 69.5% · n=200DeepSeek-R1-0528DeepSeek-R1-0528 · 5% · 73.5% · n=200DeepSeek-R1-0528 · 10% · 66.0% · n=200DeepSeek-R1-0528 · 25% · 62.0% · n=200DeepSeek-R1-0528 · 50% · 63.5% · n=200DeepSeek-R1-0528 · 75% · 63.5% · n=200DeepSeek-R1-0528 · 90% · 69.5% · n=200DeepSeek-R1-0528 · 95% · 71.0% · n=200gemini-2.5-flashgemini-2.5-flash · 5% · 65.5% · n=200gemini-2.5-flash · 10% · 66.5% · n=200gemini-2.5-flash · 25% · 64.0% · n=200gemini-2.5-flash · 50% · 63.5% · n=200gemini-2.5-flash · 75% · 66.5% · n=200gemini-2.5-flash · 90% · 67.5% · n=200gemini-2.5-flash · 95% · 69.5% · n=200

Needle-to-Hay ratio ρ [%]

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07
o3-2025-04-16
gpt-5-mini-2025-08-07
o4-mini-2025-04-16
gpt-oss-120b-medium

Google

gemini-2.5-pro
gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Heatmap comparison

gpt-5-2025-08-07

d \ ρ5%10%25%50%75%90%95%
1100%100%100%100%100%100%100%
390%70%90%100%100%100%90%
590%80%70%60%90%100%90%
7100%90%50%70%60%60%40%
1080%90%80%0%0%10%10%

o3-2025-04-16

d \ ρ5%10%25%50%75%90%95%
1100%90%80%80%90%90%100%
370%70%90%90%90%80%80%
560%80%70%70%70%80%56%
7100%70%70%60%50%60%50%
1080%60%40%30%0%10%10%

gpt-5-mini-2025-08-07

d \ ρ5%10%25%50%75%90%95%
1100%80%80%80%80%70%100%
360%30%40%70%60%70%70%
550%40%40%20%50%70%80%
740%50%30%20%10%30%20%
100%0%0%0%0%0%10%

Load interactions

The three load dimensions interact non-linearly. Per-model logistic GLMs with pairwise and three-way interaction terms reveal the following significant effects among the 22 paper models.

d × N

17 / 22

Difficulty and length compound beyond additive predictions.

d × ρ

13 / 22

Higher intrinsic difficulty amplifies sensitivity to distractor noise.

N × ρ

4 / 22

Length and distractor ratio interact significantly.

d × N × ρ

12 / 22

Three-way interaction: increasing signal share yields larger gains when tasks are simultaneously long and complex.

Failure modes

The evaluation pipeline distinguishes reasoning failures from context-budget exhaustion and formatting errors, enabling precise failure attribution.

State-tracking errors dominate

At N = 250, the most common non-overflow failure is incorrect final attribution — the model loses track of sequential state updates rather than failing to parse the task.

Context-budget overflows are model-specific

All puzzles are solvable within the 32K token context budget . Overflows concentrate in specific model–condition pairs at extreme N due to output verbosity.

Formatting drift is secondary

Answer-format failures rise in smaller models under high N and d, but account for a minor share of total errors compared to reasoning failures.