Hengxin Fan
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: % JSON vs. % CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp () largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp (), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar ) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON (pp; the displayed percentages are independently rounded, exact difference is pp pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.
This paper reframes the "format tax" narrative — the idea that structured output formats (JSON, XML) uniformly degrade LLM reasoning — by arguing that the penalty is capacity-dependent. The central claim is that structured formatting only harms reasoning when a model operates near its capability boundary; models with sufficient "headroom" absorb format constraints without cost. The paper introduces a useful decomposition: an information-matched prose control isolates format-specific effects from prompt-length confounds, and a delayed-structure ablation (think first, format later) tests the capacity competition hypothesis directly.
The practical takeaway — match format demands to model capacity, use delayed serialization for hard tasks — is actionable and relevant to the large community building agentic systems with structured API outputs.
Practical impact is likely the paper's strongest contribution. The reasoning-formatting separation principle and the concrete design recommendations (prefer delayed serialization, allow text alongside tool calls, keep schemas lightweight) are directly applicable to production LLM systems. The finding that forced function calling collapses accuracy to 10% for capacity-limited models is immediately actionable for API designers.
Scientific impact is moderate. The core finding — that format effects interact with task difficulty and model capability — is intuitive and not deeply surprising given existing scaling and emergence literature. The paper provides systematic evidence for what practitioners already suspect but doesn't offer a predictive theory. The "capacity" framework lacks formal definition beyond operational proxies.
The paper could influence how benchmarking studies control for format effects and may encourage API providers to design interfaces that separate reasoning from structured output.
This paper addresses a timely and practical concern. With the proliferation of agentic systems, function calling, and structured output APIs, understanding when and why format constraints hurt reasoning is directly relevant. The concurrent work by Lee et al. (2026) on "The Format Tax" demonstrates this is an active area, and this paper's capacity-dependent framing adds a useful dimension. The finding that even frontier models (Opus 4.7) show degradation on sufficiently hard tasks challenges claims of frontier immunity that could lead to overconfident deployment.
1. The information-matched prose control is a clean methodological innovation that should become standard practice in format-effect studies.
2. Multiple converging ablations (delayed structure, token budget, schema gradient, production API modes) provide robust support for the central claim.
3. The practical recommendations are concrete, actionable, and immediately applicable.
4. The paper honestly acknowledges limitations, including the Sonnet replication issue and the frontier boundary gap.
1. Reliance on closed-source models fundamentally limits mechanistic insight. Claims about "capacity competition" are descriptive rather than explanatory.
2. The operational definition of "capacity" is circular or at least tautological in places — models that are harmed by format are defined as capacity-limited, and capacity limitation is evidenced by format harm. The token-efficiency proxy partially addresses this but is not independently validated.
3. The 4-model, 5-benchmark design, while spanning provider families, is still relatively narrow. The absence of open-weight models, non-English tasks, and generation tasks limits generalizability claims.
4. Statistical concerns: modest sample sizes, extensive single-run cells, post-hoc pooling of frontier results, and potential API instability weaken confidence in precise effect estimates.
5. The paper's framing as "rethinking" structured reasoning failures somewhat overstates the novelty — the interaction between task difficulty and prompting technique effects is well-established in the scaling literature.
The paper is clearly written and well-organized, with effective visualizations. The acknowledgment of GPT-5.5 for language editing is appropriately transparent. The distinction between truncation-driven and non-truncation capacity competition (Haiku vs. GPT-4o-mini) is a useful empirical finding that enriches the narrative beyond simple token-budget explanations. However, the paper could benefit from a formal predictive model that forecasts format tax from observable model properties, rather than the current descriptive framework.
Generated Jun 9, 2026
Paper 2 addresses a fundamental and ubiquitous issue in large language models: the interaction between structured output generation (like JSON) and reasoning capabilities. Its insights into capacity competition and the actionable 'think first, format later' strategy offer broad, immediate impact across virtually all LLM applications. In contrast, Paper 1 presents a valuable but more narrowly focused improvement for Computer Use Agents and GUI navigation, limiting its overall breadth of scientific impact compared to Paper 2.
Paper 2 has higher likely scientific impact: it introduces a novel RL/OR framework for MDPs with state-dependent, implicitly constrained action sets—an important, broadly occurring but under-served setting. It offers methodological rigor via a formal performance guarantee decomposing approximation vs. learning error, and demonstrates applicability on queueing network control with scalable gains. This can generalize across constrained control, logistics, energy, and manufacturing. Paper 1 is timely and practically useful for LLM deployment, but its core contribution is largely empirical/diagnostic and may have narrower theoretical and cross-domain reach than a general constrained-MDP learning framework.
Paper 1 provides a more broadly applicable theoretical framework—capacity-dependent format penalties—that impacts how the entire field thinks about structured output from LLMs, affecting prompt engineering, API design, and evaluation methodology across all applications. Its rigorous experimental design (information-matched controls, complexity gradients, ablations) and actionable 'think first, format later' principle have wide practical relevance. Paper 2, while insightful about metaprogramming in coding agents, addresses a narrower niche (esoteric languages) with less generalizable implications for the broader AI/ML community.
Paper 1 addresses a fundamental and broadly applicable question about structured output's interaction with model capacity, providing rigorous methodology (information-matched controls, complexity gradients, ablations) across multiple models and benchmarks. Its findings—that format penalties are capacity-dependent, not inherent—have immediate implications for the entire LLM community's prompting practices. Paper 2 presents a valuable applied system (Trace2Policy) for compliance decision-making with real deployment results, but its impact is narrower, targeting a specific class of enterprise decision tasks. Paper 1's breadth of relevance across all structured-output LLM applications gives it higher potential scientific impact.
Paper 1 offers a more fundamental and broadly applicable scientific contribution. It rigorously disentangles the relationship between model capacity and structured output formatting, a question relevant to essentially all LLM applications. The controlled experimental design across multiple models and benchmarks, with clear mechanistic explanations (truncation vs. capacity competition), provides actionable insights for the entire field. Paper 2, while practically useful, presents a more incremental engineering framework for skill extraction in narrow agent domains (CLI tools), with less generalizable scientific insight and limited benchmark diversity.
Paper 1 addresses a fundamental challenge in AI safety (reward hacking) by introducing a novel mechanistic framework to detect misalignment precursors before they manifest. This contributes deeply to RL and alignment theory. Paper 2 provides highly practical but empirically transient prompt-engineering insights that may become obsolete as base models improve, giving Paper 1 a higher long-term scientific impact.
Paper 1 offers a clearly novel, experimentally grounded contribution with quantified effects, strong controls (information-matched prose, schema complexity gradient), multiple models/benchmarks, and mechanistic ablations (delayed-structure) that yield actionable guidance (“think then format”). Its findings generalize beyond a single domain to LLM evaluation, prompting, and structured-output system design, making it timely and broadly impactful. Paper 2 is a valuable conceptual framework for legal RAG, but appears less empirically validated and its impact is more domain-specific and dependent on future engineering instantiations.
Paper 2 addresses a fundamental and pervasive issue in large language models: the degradation of reasoning under structured formatting constraints. Its findings challenge existing assumptions about the 'reasoning tax' and offer a broadly applicable, model-agnostic mechanism (capacity competition) and practical solution ('think first, format later'). While Paper 1 is highly valuable for clinical informatics, Paper 2's insights have a much broader potential impact across the entire AI and NLP community, affecting almost all applications that rely on structured LLM outputs.
Paper 1 introduces a novel conceptual framework (Selective Abstraction) that addresses a major open problem in LLMs—factual hallucinations in long-form generation. By moving beyond binary abstention to granular abstraction (trading specificity for reliability), it offers a highly innovative methodology with broad real-world applicability for high-risk domains. While Paper 2 provides valuable empirical insights into structured output failures, Paper 1 proposes a more profound architectural approach to uncertainty mitigation that could fundamentally shift how reliable generation is evaluated and deployed.
Paper 1 addresses a fundamental and timely question about structured output in LLMs, offering a rigorous methodological framework (information-matched controls, complexity gradients, multiple models/benchmarks) that yields actionable insights for the rapidly growing field of AI engineering. Its 'capacity-dependent' framing reframes a widely-held assumption and provides practical guidance (think first, format later) applicable across many LLM deployment scenarios. Paper 2 makes a solid contribution to brain decoding via data augmentation, but addresses a narrower audience and builds more incrementally on existing encoding models. Paper 1's breadth of impact across AI/ML practice is greater.