Results

Performance of 31 reasoning LLMs across independently controlled intrinsic difficulty, task length, and distractor density.

The ICLR 2026 paper evaluates 22 models on 100 random CogniLoad puzzles per configuration (14,000 per model) with a maximum context length of 32K tokens. All puzzles are solvable within this budget; context-budget overflows are model-specific serving failures counted as errors in the paper-default scoring.

Controls

Filter by developer, switch scoring mode, and select models for comparison.

Paper-default scoring. Context-budget overflows count as failures, reflecting both reasoning quality and serving stability.

Highest accuracy

gpt-5-2025-08-07

92.5%

Best ECL₅₀

gpt-5-2025-08-07

383

Hardest ρ regime

50%

38.3%

Selected models

8 active in charts

Model comparison

Select models via checkboxes to populate the performance curves and heatmaps below.

Compare	Model	Developer	Overall	N=250	d=10	ECL₅₀ ¹	NT₅₀ ²	ID₅₀ ³	Overflow
On	gpt-5-2025-08-07	OpenAI	92.5%	76.0%	82.1%	383	—	14.8	2.3%
On	o3-2025-04-16	OpenAI	89.5%	67.8%	80.4%	357	—	19.1	1.7%
On	gpt-5-mini-2025-08-07	OpenAI	79.1%	44.3%	67.5%	164	—	11.7	3.0%
On	o4-mini-2025-04-16	OpenAI	76.5%	38.2%	64.6%	132	—	10.9	1.5%
On	gemini-2.5-pro	Google	75.1%	20.6%	65.4%	153	—	12.7	22.6%
On	gpt-oss-120b-medium	OpenAI	67.4%	38.6%	38.6%	111	—	6.9	0.4%
On	DeepSeek-R1-0528	DeepSeek	67.0%	26.9%	47.5%	105	—	7.5	4.4%
On	gemini-2.5-flash	Google	66.1%	14.3%	51.8%	111	—	8.6	27.8%
Off	gpt-oss-120b-high	OpenAI	65.9%	23.7%	46.4%	102	—	7.2	23.5%
Off	Qwen3-Next-80B-A3B-Thinking-FP8	Alibaba	62.4%	18.3%	45.4%	95	—	7.0	22.6%
Off	DeepSeek-R1-Distill-Llama-70B	DeepSeek	57.6%	26.7%	36.1%	70	36.0%	5.1	0.6%
Off	gpt-oss-20b-medium	OpenAI	54.9%	24.6%	27.9%	66	34.8%	4.9	1.1%
Off	Phi-4-reasoning	Microsoft	50.9%	23.5%	25.6%	52	23.8%	4.1	12.3%
Off	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	50.8%	29.2%	31.4%	54	12.3%	4.0	2.3%
Off	QwQ-32B	Alibaba	50.5%	29.1%	30.0%	48	25.0%	3.6	3.2%
Off	Qwen3-32B	Alibaba	50.3%	26.0%	28.0%	54	23.1%	4.1	0.0%
Off	gpt-oss-20b-high	OpenAI	49.1%	7.7%	28.2%	55	16.7%	3.8	39.6%
Off	GLM-Z1-32B-0414	Zhipu AI	47.5%	24.3%	23.1%	47	15.6%	3.7	0.0%
Off	Phi-4-reasoning-plus	Microsoft	46.4%	19.6%	20.5%	46	15.6%	3.7	20.6%
Off	gpt-5-nano-2025-08-07	OpenAI	46.2%	24.6%	21.4%	36	10.3%	2.9	0.0%
Off	Qwen3-30B-A3B	Alibaba	44.8%	24.8%	22.7%	37	4.4%	3.0	1.1%
Off	Qwen3-8B	Alibaba	41.3%	23.0%	21.9%	31	—	2.2	3.6%
Off	gpt-oss-120b-low	OpenAI	40.0%	24.3%	21.4%	31	—	1.7	0.0%
Off	gemini-2.5-flash-lite	Google	37.9%	29.1%	16.4%	9	92.7%	1.5	1.2%
Off	EXAONE-Deep-32B	LG AI	34.7%	24.3%	16.9%	14	—	1.0	0.3%
Off	gpt-oss-20b-low	OpenAI	26.9%	19.4%	10.4%	4	—	-0.4	0.0%
Off	Phi-4-mini-reasoning	Microsoft	25.7%	21.3%	11.4%	1	—	-0.5	0.3%
Off	Qwen3-4B	Alibaba	25.4%	19.7%	12.1%	2	—	-1.7	2.1%
Off	DeepSeek-R1-Distill-Qwen-7B	DeepSeek	24.8%	18.7%	11.5%	3	—	-0.5	7.8%
Off	Qwen3-1.7B	Alibaba	20.9%	18.3%	11.0%	0	—	-4.1	0.0%
Off	DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek	15.8%	14.2%	8.4%	0	—	-3.9	22.4%

¹ ECL₅₀ (Effective Context Length): Maximum number of statements the model can process while maintaining ≥50% accuracy. Higher values indicate superior context handling. Derived from the paper's binomial GLM at mean values of d and ρ.

² NT₅₀ (Needle-to-hay Threshold): Minimum proportion of relevant (needle) information required to maintain ≥50% accuracy. Lower values indicate greater robustness to distractors. A missing value means the model does not cross the 50% threshold for any ρ ∈ [0, 1] at mean d and N.

³ ID₅₀ (Intrinsic Difficulty): Maximum intrinsic complexity (number of interacting entities and attributes) the model can handle while maintaining ≥50% accuracy. Negative values indicate failure to reach 50% even at the simplest configuration under mean N and ρ.

Performance curves

Accuracy along each load dimension for selected models. Up to 12 models displayed.

Intrinsic difficulty d

Performance as the number of interacting entities, attributes, and conditions increases.

Intrinsic difficulty d

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07

o3-2025-04-16

gpt-5-mini-2025-08-07

o4-mini-2025-04-16

gpt-oss-120b-medium

Google

gemini-2.5-pro

gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Task length N

Performance as the number of sequential state transitions increases.

Task length N

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07

o3-2025-04-16

gpt-5-mini-2025-08-07

o4-mini-2025-04-16

gpt-oss-120b-medium

Google

gemini-2.5-pro

gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Distractor ratio ρ

Performance as the share of person-of-interest statements changes from sparse to dense.

Needle-to-Hay ratio ρ [%]

Exact-match accuracy with 90% Wilson confidence intervals when available.

OpenAI

gpt-5-2025-08-07

o3-2025-04-16

gpt-5-mini-2025-08-07

o4-mini-2025-04-16

gpt-oss-120b-medium

Google

gemini-2.5-pro

gemini-2.5-flash

DeepSeek

DeepSeek-R1-0528

Heatmap comparison

gpt-5-2025-08-07

d \ ρ	5%	10%	25%	50%	75%	90%	95%
1	100%	100%	100%	100%	100%	100%	100%
3	90%	70%	90%	100%	100%	100%	90%
5	90%	80%	70%	60%	90%	100%	90%
7	100%	90%	50%	70%	60%	60%	40%
10	80%	90%	80%	0%	0%	10%	10%

o3-2025-04-16

d \ ρ	5%	10%	25%	50%	75%	90%	95%
1	100%	90%	80%	80%	90%	90%	100%
3	70%	70%	90%	90%	90%	80%	80%
5	60%	80%	70%	70%	70%	80%	56%
7	100%	70%	70%	60%	50%	60%	50%
10	80%	60%	40%	30%	0%	10%	10%

gpt-5-mini-2025-08-07

d \ ρ	5%	10%	25%	50%	75%	90%	95%
1	100%	80%	80%	80%	80%	70%	100%
3	60%	30%	40%	70%	60%	70%	70%
5	50%	40%	40%	20%	50%	70%	80%
7	40%	50%	30%	20%	10%	30%	20%
10	0%	0%	0%	0%	0%	0%	10%

Load interactions

The three load dimensions interact non-linearly. Per-model logistic GLMs with pairwise and three-way interaction terms reveal the following significant effects among the 22 paper models.

d × N

17 / 22

Difficulty and length compound beyond additive predictions.

d × ρ

13 / 22

Higher intrinsic difficulty amplifies sensitivity to distractor noise.

N × ρ

4 / 22

Length and distractor ratio interact significantly.

d × N × ρ

12 / 22

Three-way interaction: increasing signal share yields larger gains when tasks are simultaneously long and complex.

Failure modes

The evaluation pipeline distinguishes reasoning failures from context-budget exhaustion and formatting errors, enabling precise failure attribution.

State-tracking errors dominate

At N = 250, the most common non-overflow failure is incorrect final attribution — the model loses track of sequential state updates rather than failing to parse the task.

Context-budget overflows are model-specific

All puzzles are solvable within the 32K token context budget . Overflows concentrate in specific model–condition pairs at extreme N due to output verbosity.

Formatting drift is secondary

Answer-format failures rise in smaller models under high N and d, but account for a minor share of total errors compared to reasoning failures.