Capacity, Not Format: Rethinking Structured Reasoning Failures

Hengxin Fan

Jun 8, 2026arXiv:2606.09410v1

cs.AIcs.CL

#953of 3489·Artificial Intelligence

#953 of 3489 · Artificial Intelligence

Tournament Score

1449±45

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty5.5

Clarity7.5

Abstract

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$ % JSON vs. $89.3\pm1.7$ % CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ( $p < 0.0001$ ) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ( $p < 0.001$ ), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$ ) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ( $- 5.3$ pp; the displayed percentages are independently rounded, exact difference is $7 / 133 = 5.26$ pp $\approx 5.3$ pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper reframes the "format tax" narrative — the idea that structured output formats (JSON, XML) uniformly degrade LLM reasoning — by arguing that the penalty is capacity-dependent. The central claim is that structured formatting only harms reasoning when a model operates near its capability boundary; models with sufficient "headroom" absorb format constraints without cost. The paper introduces a useful decomposition: an information-matched prose control isolates format-specific effects from prompt-length confounds, and a delayed-structure ablation (think first, format later) tests the capacity competition hypothesis directly.

The practical takeaway — match format demands to model capacity, use delayed serialization for hard tasks — is actionable and relevant to the large community building agentic systems with structured API outputs.

Methodological Rigor

Strengths:

The information-matched detailed prose (DP) control is a genuinely useful methodological contribution. Prior work confounds prompt complexity with format requirements; by matching informational content without syntactic constraints, the paper cleanly decomposes prompt-length effects from format-specific effects (e.g., Haiku: -8pp for prompt length, -28pp for format-specific cost).

The four-level schema complexity gradient (minimal → heavy) provides dose-response evidence for the capacity competition mechanism, with monotonic degradation for weaker models across reasoning-bearing schemas.

Multiple ablations (delayed structure, token-budget, production API modes) attack the same hypothesis from different angles, providing converging evidence.

0% parse failure rate on successfully generated responses eliminates extraction artifacts as a confound.

The unified logistic regression with cross-validated capacity proxy and clustered standard errors is appropriately conservative.

Weaknesses:

The study uses only closed-source models, making mechanistic claims about "capacity competition" inherently speculative. The paper acknowledges this but the terminology implies more mechanistic understanding than the evidence supports.

Sample sizes are modest (100 problems per benchmark in most cases), and several important cells rely on single runs. The cross-benchmark table (Table 5) is entirely single-run.

The Sonnet heavy-schema replication failure (93% → 60-63% in a later audit, attributed to "API serving-layer drift") is concerning. It undermines confidence in the stability of results from closed-API models and raises questions about which other results might be similarly fragile.

The schema complexity gradient confounds prompt length, field count, and nesting depth simultaneously, as acknowledged.

The "capacity" concept remains operationally underspecified. Haiku and Sonnet have near-identical CoT baselines (~89%) on MATH-Hard yet diverge dramatically under JSON, which the paper interprets as differences in "surplus capacity" proxied by token efficiency. This is post-hoc reasoning — the token efficiency metric is not independently validated as a capacity measure.

Potential Impact

Practical impact is likely the paper's strongest contribution. The reasoning-formatting separation principle and the concrete design recommendations (prefer delayed serialization, allow text alongside tool calls, keep schemas lightweight) are directly applicable to production LLM systems. The finding that forced function calling collapses accuracy to 10% for capacity-limited models is immediately actionable for API designers.

Scientific impact is moderate. The core finding — that format effects interact with task difficulty and model capability — is intuitive and not deeply surprising given existing scaling and emergence literature. The paper provides systematic evidence for what practitioners already suspect but doesn't offer a predictive theory. The "capacity" framework lacks formal definition beyond operational proxies.

The paper could influence how benchmarking studies control for format effects and may encourage API providers to design interfaces that separate reasoning from structured output.

Timeliness & Relevance

This paper addresses a timely and practical concern. With the proliferation of agentic systems, function calling, and structured output APIs, understanding when and why format constraints hurt reasoning is directly relevant. The concurrent work by Lee et al. (2026) on "The Format Tax" demonstrates this is an active area, and this paper's capacity-dependent framing adds a useful dimension. The finding that even frontier models (Opus 4.7) show degradation on sufficiently hard tasks challenges claims of frontier immunity that could lead to overconfident deployment.

Strengths & Limitations

Key Strengths:

1. The information-matched prose control is a clean methodological innovation that should become standard practice in format-effect studies.

2. Multiple converging ablations (delayed structure, token budget, schema gradient, production API modes) provide robust support for the central claim.

3. The practical recommendations are concrete, actionable, and immediately applicable.

4. The paper honestly acknowledges limitations, including the Sonnet replication issue and the frontier boundary gap.

Notable Weaknesses:

1. Reliance on closed-source models fundamentally limits mechanistic insight. Claims about "capacity competition" are descriptive rather than explanatory.

2. The operational definition of "capacity" is circular or at least tautological in places — models that are harmed by format are defined as capacity-limited, and capacity limitation is evidenced by format harm. The token-efficiency proxy partially addresses this but is not independently validated.

3. The 4-model, 5-benchmark design, while spanning provider families, is still relatively narrow. The absence of open-weight models, non-English tasks, and generation tasks limits generalizability claims.

4. Statistical concerns: modest sample sizes, extensive single-run cells, post-hoc pooling of frontier results, and potential API instability weaken confidence in precise effect estimates.

5. The paper's framing as "rethinking" structured reasoning failures somewhat overstates the novelty — the interaction between task difficulty and prompting technique effects is well-established in the scaling literature.

Additional Observations

The paper is clearly written and well-organized, with effective visualizations. The acknowledgment of GPT-5.5 for language editing is appropriately transparent. The distinction between truncation-driven and non-truncation capacity competition (Haiku vs. GPT-4o-mini) is a useful empirical finding that enriches the narrative beyond simple token-budget explanations. However, the paper could benefit from a formal predictive model that forecasts format tax from observable model properties, rather than the current descriptive framework.

Rating:5.8/ 10

Significance 6Rigor 5.5Novelty 5.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (18)

Wonvs. A History-Aware Visually Grounded Critic for Computer Use Agents

Paper 2 addresses a fundamental and ubiquitous issue in large language models: the interaction between structured output generation (like JSON) and reasoning capabilities. Its insights into capacity competition and the actionable 'think first, format later' strategy offer broad, immediate impact across virtually all LLM applications. In contrast, Paper 1 presents a valuable but more narrowly focused improvement for Computer Use Agents and GUI navigation, limiting its overall breadth of scientific impact compared to Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Paper 2 has higher likely scientific impact: it introduces a novel RL/OR framework for MDPs with state-dependent, implicitly constrained action sets—an important, broadly occurring but under-served setting. It offers methodological rigor via a formal performance guarantee decomposing approximation vs. learning error, and demonstrates applicability on queueing network control with scalable gains. This can generalize across constrained control, logistics, energy, and manufacturing. Paper 1 is timely and practically useful for LLM deployment, but its core contribution is largely empirical/diagnostic and may have narrower theoretical and cross-domain reach than a general constrained-MDP learning framework.

gpt-5.2·Jun 10, 2026

Wonvs. Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

Paper 1 provides a more broadly applicable theoretical framework—capacity-dependent format penalties—that impacts how the entire field thinks about structured output from LLMs, affecting prompt engineering, API design, and evaluation methodology across all applications. Its rigorous experimental design (information-matched controls, complexity gradients, ablations) and actionable 'think first, format later' principle have wide practical relevance. Paper 2, while insightful about metaprogramming in coding agents, addresses a narrower niche (esoteric languages) with less generalizable implications for the broader AI/ML community.

claude-opus-4-6·Jun 10, 2026

Wonvs. Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Paper 1 addresses a fundamental and broadly applicable question about structured output's interaction with model capacity, providing rigorous methodology (information-matched controls, complexity gradients, ablations) across multiple models and benchmarks. Its findings—that format penalties are capacity-dependent, not inherent—have immediate implications for the entire LLM community's prompting practices. Paper 2 presents a valuable applied system (Trace2Policy) for compliance decision-making with real deployment results, but its impact is narrower, targeting a specific class of enterprise decision tasks. Paper 1's breadth of relevance across all structured-output LLM applications gives it higher potential scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Paper 1 offers a more fundamental and broadly applicable scientific contribution. It rigorously disentangles the relationship between model capacity and structured output formatting, a question relevant to essentially all LLM applications. The controlled experimental design across multiple models and benchmarks, with clear mechanistic explanations (truncation vs. capacity competition), provides actionable insights for the entire field. Paper 2, while practically useful, presents a more incremental engineering framework for skill extraction in narrow agent domains (CLI tools), with less generalizable scientific insight and limited benchmark diversity.

claude-opus-4-6·Jun 9, 2026

Lostvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 1 addresses a fundamental challenge in AI safety (reward hacking) by introducing a novel mechanistic framework to detect misalignment precursors before they manifest. This contributes deeply to RL and alignment theory. Paper 2 provides highly practical but empirically transient prompt-engineering insights that may become obsolete as base models improve, giving Paper 1 a higher long-term scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

Paper 1 offers a clearly novel, experimentally grounded contribution with quantified effects, strong controls (information-matched prose, schema complexity gradient), multiple models/benchmarks, and mechanistic ablations (delayed-structure) that yield actionable guidance (“think then format”). Its findings generalize beyond a single domain to LLM evaluation, prompting, and structured-output system design, making it timely and broadly impactful. Paper 2 is a valuable conceptual framework for legal RAG, but appears less empirically validated and its impact is more domain-specific and dependent on future engineering instantiations.

gpt-5.2·Jun 9, 2026

Wonvs. Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

Paper 2 addresses a fundamental and pervasive issue in large language models: the degradation of reasoning under structured formatting constraints. Its findings challenge existing assumptions about the 'reasoning tax' and offer a broadly applicable, model-agnostic mechanism (capacity competition) and practical solution ('think first, format later'). While Paper 1 is highly valuable for clinical informatics, Paper 2's insights have a much broader potential impact across the entire AI and NLP community, affecting almost all applications that rely on structured LLM outputs.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation

Paper 1 introduces a novel conceptual framework (Selective Abstraction) that addresses a major open problem in LLMs—factual hallucinations in long-form generation. By moving beyond binary abstention to granular abstraction (trading specificity for reliability), it offers a highly innovative methodology with broad real-world applicability for high-risk domains. While Paper 2 provides valuable empirical insights into structured output failures, Paper 1 proposes a more profound architectural approach to uncertainty mitigation that could fundamentally shift how reliable generation is evaluated and deployed.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

Paper 1 addresses a fundamental and timely question about structured output in LLMs, offering a rigorous methodological framework (information-matched controls, complexity gradients, multiple models/benchmarks) that yields actionable insights for the rapidly growing field of AI engineering. Its 'capacity-dependent' framing reframes a widely-held assumption and provides practical guidance (think first, format later) applicable across many LLM deployment scenarios. Paper 2 makes a solid contribution to brain decoding via data augmentation, but addresses a narrower audience and builds more incrementally on existing encoding models. Paper 1's breadth of impact across AI/ML practice is greater.

claude-opus-4-6·Jun 9, 2026

#953of 3489·Artificial Intelligence

#953 of 3489 · Artificial Intelligence

Tournament Score

1449±45

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty5.5

Clarity7.5