When Mean CE Fails: Median CE Can Better Track Language Model Quality

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

#1431 of 2682 · Artificial Intelligence
Share
Tournament Score
1401±42
10501800
58%
Win Rate
11
Wins
8
Losses
19
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and characterizes a specific failure mode of mean cross-entropy (CE) as a validation metric for language models: when training reshapes the per-token loss distribution non-uniformly (bulk vs. tail moving differently), mean CE can become decoupled from actual task performance. The authors propose a simple, practical remedy—reporting percentile CE summaries (particularly median and p95) alongside the mean—and introduce a "concordance" diagnostic to detect when the choice of summary statistic becomes consequential for model selection.

The paper demonstrates this in two scenarios: (1) Qwen2.5-1.5B supervised fine-tuning on synthetic fact-learning, where mean CE rises 60% in later epochs while accuracy barely changes (~3pp), and (2) top-K self-distillation on TinyStories, where decreasing K improves median CE and LLM-judge scores while worsening mean CE. The key insight is that task performance tracks the bulk of the loss distribution rather than the tail that dominates the mean.

Methodological Rigor

The experimental design is generally sound, particularly the distillation experiment which provides clean controlled variation (same architecture, data, compute; only K varies). The monotonic dose-response across K values is convincing. The concordance framework, while simple, provides a useful formalization.

However, several limitations weaken the evidence:

Scale concerns. The primary experiments use ~34M parameter models on TinyStories and 1.5B parameters on synthetic facts. The FineWebEdu replication (150M student, 355M teacher) helps but still doesn't reach frontier scale. The authors acknowledge this, but the claim's practical importance depends on whether this phenomenon matters at scale.

Evaluation weakness. The LLM-judge scores are modest in absolute terms (1.92–2.06 on a 1–5 scale) and the differences are small. The paired bootstrap for Top-5 vs. Full KL is significant (+0.145, 95% CI [+0.06, +0.24]), but Top-5 vs. teacher is "within bootstrap uncertainty." For linear attention, the authors explicitly decline to make significance claims. The reliance on a single judge (Claude Sonnet 4) with absolute scoring introduces potential biases that same-judge consistency doesn't fully address.

Synthetic fact-learning. The Qwen experiment uses artificially constructed data (120 fictional facts with formulaic phrasings). The tail examples at epoch 12 are dominated by completion-style format failures ("___" → "?"), suggesting the phenomenon may partly reflect format sensitivity rather than a general training dynamic.

Correlation vs. causation. The paper shows correlations between median CE and task performance but doesn't establish a causal mechanism for why downstream evaluations should track the bulk. The connection to token typicality (Meister et al., 2023) is suggestive but not developed.

Potential Impact

Practical utility. The recommendation to report percentile CE summaries is low-cost and immediately actionable. Any practitioner can implement this with minimal code changes. The concordance diagnostic is similarly lightweight. This could influence training monitoring practices.

Model selection. The finding that mean CE can mislead in distillation settings is directly relevant to the growing knowledge distillation literature. However, the scope is narrower than the title suggests—the paper demonstrates the phenomenon primarily in top-K distillation (a deliberate tail truncation) and in a specific SFT setup.

Limited generality. The "when" question from the introduction receives a partial answer: deliberate reshaping (top-K distillation) and implicit reshaping (SFT dynamics). But these are specific scenarios rather than a general characterization. The paper doesn't address pretraining at scale, RLHF, or other common training paradigms where this diagnostic would be most valuable.

Timeliness & Relevance

The paper addresses a real need. As LLM training grows more complex (multi-stage training, distillation, alignment), the limitations of perplexity as a validation metric become more pressing. The observations from Fang et al. (2025) and Ruan et al. (2024) about perplexity's limitations are gaining attention, and this paper offers a complementary perspective focused on distributional statistics rather than token weighting or scaling laws.

The connection to top-K distillation is timely given the growth of distillation methods (Dasgupta et al., 2026), offering a concrete example where standard evaluation can mislead.

Strengths

1. Clean experimental design in the distillation experiment with monotonic controlled variation.

2. Actionable recommendation that is trivial to implement—percentile reporting adds negligible compute.

3. Thorough appendices with robustness checks (linear attention, FineWebEdu, CBT correlation sanity check, per-percentile correlation tables).

4. Concordance framework provides a principled way to detect when summary statistics diverge.

5. Honest scope-setting—the authors explicitly note limitations and don't overclaim.

Limitations & Weaknesses

1. Scale gap. The most impactful claim would be that this matters for frontier model development; current evidence is at toy-to-moderate scale.

2. Narrow task coverage. Only fact-recall accuracy and LLM-judge story quality are tested as downstream metrics. No evaluation on standard benchmarks (MMLU, HumanEval, etc.).

3. The top-K mechanism is somewhat circular. Top-K distillation explicitly removes tail signal, so it's unsurprising that the resulting model has a different tail. The more interesting question—how often does implicit reshaping occur in standard training—is addressed only with one synthetic SFT experiment.

4. No theoretical grounding for why task performance should track the bulk. The paper is empirical throughout.

5. The "median is better" framing slightly oversells. Table 7 shows that in the Qwen case, p95 CE actually selects the best checkpoint, not median CE (which selects epoch 12, a worse checkpoint).

Overall Assessment

This is a well-executed empirical study that identifies a real but somewhat niche phenomenon. The contribution is primarily diagnostic rather than methodological—it doesn't propose a new training objective or architecture, but rather a monitoring practice. The practical recommendation is sound and low-cost, making adoption easy. However, the limited scale, narrow evaluation, and partially circular experimental design (top-K removes tail → tail changes) moderate the significance. The paper makes a useful contribution to the LM evaluation toolkit but falls short of being a major advance.

Rating:5.5/ 10
Significance 5Rigor 6Novelty 5.5Clarity 7.5

Generated May 26, 2026

Comparison History (19)

vs. Cultural Binding Heads in Language Models
gpt-5.25/28/2026

Paper 2 likely has higher impact: it challenges a ubiquitous evaluation practice (mean cross-entropy) with a simple, broadly applicable alternative (median/percentile CE) demonstrated across two realistic training regimes. The recommendation is low-cost to adopt, immediately actionable for model selection and monitoring, and relevant across architectures, datasets, and tasks—potentially influencing standard reporting norms. Paper 1 is novel mechanistic work with clear societal relevance, but its scope is narrower (specific cultural-binding behavior, benchmark-dependent), and adoption requires interpretability tooling and interventions that may be less standardized.

vs. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
gpt-5.25/28/2026

Paper 2 likely has higher impact due to broader cross-field relevance (logic, complexity theory, automated reasoning, evaluation of LLM reasoning), clearer real-world applicability (reliable assessment of models used in planning/verification-like tasks), and a methodological contribution (matched-pair protocol + ADR) that addresses known pitfalls in SAT evaluation and extends across representations (CNF, Vertex Cover, 3D packing). Paper 1 offers a useful diagnostic for LM training/selection, but its scope is narrower and its metric refinement (median/percentile CE) is less broadly transformative.

vs. Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
claude-opus-4.65/28/2026

Paper 2 presents a novel edge-cloud framework addressing multiple practical challenges (privacy, bandwidth, multilingual translation) with a concrete system achieving state-of-the-art results across 45 languages. It combines architectural innovation with practical deployment considerations and releases code/models. Paper 1 offers a useful diagnostic insight (median vs. mean CE) but is more incremental—it identifies and characterizes an existing issue with a relatively straightforward recommendation (report percentile summaries). Paper 2 has broader real-world applicability and methodological contribution.

vs. The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?
gpt-5.25/27/2026

Paper 2 has higher likely impact: it challenges a ubiquitous training/selection metric (mean cross-entropy) with a simple, broadly applicable alternative (median/percentile CE) and provides concrete evidence across two realistic settings (SFT fact learning, top-K distillation). The proposed diagnostic is low-cost, easy to adopt, and relevant to many labs, affecting evaluation practice across NLP/LLM training. Paper 1 is novel and useful for KG-guided scientific discovery, but its scope is narrower (battery materials + KG prompting) and the “compression/redundancy” finding may be more context-dependent.

vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
gpt-5.25/27/2026

Paper 2 likely has higher impact due to its creation of substantial community infrastructure (a large real-world dataset across hundreds of apps plus an open benchmarking toolkit), enabling broad, reproducible progress in an active area (VLM agents, mobile UI automation). Its findings on scaling and RL vs supervised finetuning are directly actionable and relevant to both academia and industry, with cross-field reach (HCI, RL, multimodal learning, agent evaluation). Paper 1 is a valuable metric insight for LM training diagnostics, but is narrower in scope and may see more incremental adoption.

vs. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
claude-opus-4.65/26/2026

Paper 1 addresses a critical and timely gap in LLM agent safety—trajectory-level hallucination detection in multi-agent systems—introducing a novel taxonomy, dataset, and evaluation framework. As agentic AI deployment accelerates in industry, this work has broad implications for reliability and safety. Paper 2 offers a useful but narrower practical insight (median vs. mean CE as a training metric), which is more of an incremental diagnostic recommendation than a paradigm shift. Paper 1's contribution is more novel, has broader cross-field impact, and addresses a more pressing problem in the AI safety landscape.

vs. Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP
gemini-3.15/26/2026

Paper 1 challenges a ubiquitous standard metric (mean cross-entropy) in language model training, offering a simple, universally applicable alternative (median CE) that better correlates with task performance. Its findings have broad, immediate implications for the entire field of LLM training, alignment, and evaluation. Paper 2, while methodologically rigorous, focuses on a much narrower subfield (uncertainty in subjective NLP), making its potential breadth of impact significantly smaller.

vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
claude-opus-4.65/26/2026

MobileGym introduces a novel simulation platform enabling verifiable RL training for mobile GUI agents—a rapidly growing research area. It provides infrastructure (416 task templates, 28 apps, scalable parallel rollouts) that can accelerate an entire subfield, with strong sim-to-real transfer (95.1% retention). Paper 1 offers a useful but incremental diagnostic insight (median vs. mean CE), which is a practical recommendation rather than a new capability. MobileGym's broader applicability to GUI agents, RL research, and mobile automation gives it significantly higher potential impact across multiple communities.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support
gemini-3.15/26/2026

Paper 1 challenges a foundational metric (mean cross-entropy) in LLM training, proposing median CE as a more robust alternative. Given the massive scale of current LLM research, an improvement to standard evaluation metrics carries enormous potential for widespread, immediate adoption across the entire AI field, impacting countless models and downstream applications. While Paper 2 offers excellent methodological rigor and vital environmental applications, its direct scientific impact is more confined to the narrower domain of industrial control. Paper 1's fundamental insights into model evaluation offer broader, highly timely scientific impact.

vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
gpt-5.25/26/2026

Paper 1 is more novel and impactful: it introduces a diffusion-LLM-specific safety monitoring paradigm leveraging intermediate denoising trajectories and a hesitation-based difficulty signal, then turns it into a practical, compute-adaptive routing system with strong multi-model, multi-dataset results. This has clear real-world deployment relevance (efficient, always-on safety) and broad implications for monitoring other iterative generative models. Paper 2 provides an important diagnostic/metrics insight (median vs mean CE) but is more incremental and primarily affects evaluation/reporting practices rather than enabling new system capabilities.

vs. CODESKILL: Learning Self-Evolving Skills for Coding Agents
gemini-3.15/26/2026

Paper 2 has higher potential scientific impact because it challenges a fundamental and universally used metric (mean cross-entropy) in language model training. By demonstrating scenarios where median CE better correlates with task performance and explaining the underlying distributional shifts, its findings are broadly applicable across all LLM research, evaluation, and distillation tasks. In contrast, Paper 1 presents a valuable but more narrowly focused framework specific to procedural skill learning for coding agents.

vs. Associations between echocardiographic traits and AI-ECG predictions of heart failure
claude-opus-4.65/26/2026

Paper 2 identifies a fundamental and broadly applicable issue with the standard validation metric (mean cross-entropy) used across all language model training, proposing a simple yet practical fix (median CE and percentile summaries). This has potential to impact the entire LLM training and evaluation ecosystem. Paper 1, while methodologically sound, is more incremental—it provides physiological validation of an existing AI-ECG model using established echocardiographic measures, offering interpretability insights but limited novelty in approach. Paper 2's breadth of impact across the rapidly growing LLM field gives it higher potential.

vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
claude-opus-4.65/26/2026

Paper 2 addresses a fundamental and broadly applicable issue in language model training—the inadequacy of mean cross-entropy as a validation metric—and proposes a simple, practical remedy (median CE and percentile summaries) relevant to virtually all LLM training pipelines. Its insight about distribution reshaping during training is novel and immediately actionable across the entire ML community. Paper 1, while rigorous and addressing a real gap in LLM benchmarking for scientific assistance, targets a narrower audience (computational science) and contributes primarily an evaluation benchmark rather than a transferable methodological insight.

vs. EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages
gemini-3.15/26/2026

Paper 2 addresses a fundamental and ubiquitous metric (cross-entropy) in language model training, revealing critical flaws in how mean CE tracks model quality and proposing median CE as a better alternative. This finding has broad implications for all LLM research and development. In contrast, Paper 1 is highly specialized, focusing on a specific clinical NLP task (EPPC extraction) with modest performance gains, limiting its impact primarily to the medical informatics domain.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
gpt-5.25/26/2026

Paper 2 has higher potential impact: it introduces a neurosymbolic framework (ImProver 2) for automated proof optimization in Lean 4 with an expert-iteration pipeline, structured scaffolding, and new proof-structure metrics—advances that can directly improve formal math libraries, neural theorem proving training data, and maintainability at scale. It demonstrates strong empirical results (7B model beating much larger ones; scaffold helping across model sizes), suggesting methodological rigor and real-world applicability. Paper 1 is a valuable diagnostic insight for LM evaluation, but its contribution is narrower and more incremental.

vs. AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
gemini-3.15/26/2026

Paper 1 challenges a deeply entrenched fundamental metric (mean cross-entropy) used universally in language model training. By demonstrating that median CE better tracks model quality and avoids mean CE's pitfalls, it offers a simple yet transformative insight that could shift how the entire NLP community evaluates and selects models. While Paper 2 provides a valuable benchmark for the growing field of audio-video generation, Paper 1's findings have a significantly broader scope and the potential to immediately impact a much larger portion of foundational AI research and development.

vs. PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback
gemini-3.15/26/2026

Paper 2 challenges a fundamental and universally used metric (mean cross-entropy) in LLM training, proposing median CE as a more accurate tracker of model quality. This insight has immense breadth of impact, potentially influencing how all language models are evaluated during training. Paper 1, while innovative in its engineering and multi-agent resilience strategies, addresses a more specialized subfield (multi-agent systems) and focuses heavily on architectural design, which may not generalize as widely across the broader machine learning community.

vs. Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions
gemini-3.15/26/2026

Paper 1 addresses a fundamental metric in the highly active field of Large Language Model training. By demonstrating that median CE tracks model quality better than the standard mean CE during SFT and distillation, it offers an immediate, highly applicable improvement for AI practitioners. Paper 2 presents rigorous theoretical advancements in fuzzy logic, but its near-term practical adoption and breadth of impact are likely narrower. Due to the exploding relevance of LLM optimization and the widespread reliance on validation metrics, Paper 1 promises a significantly broader and more immediate scientific impact.

vs. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions
gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a new benchmark for an emerging, high-stakes application area (computer-use agents) with clear real-world relevance and broad utility for evaluating robustness across models and labs. Benchmarks often become community standards, shaping subsequent research, and the work includes systematic corruptions, empirical findings, and a mitigation framework plus public release—supporting adoption and follow-on work. Paper 1 is insightful and useful for LM training diagnostics, but it is a narrower metric analysis and is less likely to catalyze widespread, cross-field uptake than a robustness benchmark for deployed agents.