Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Yongtaek Lim, Hyeji Choi, Minwoo Kim

Jun 8, 2026arXiv:2606.09165v1

cs.AI

#1839of 3489·Artificial Intelligence

#1839 of 3489 · Artificial Intelligence

Tournament Score

1393±44

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5

Abstract

Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper reframes LLM safety judgment as a rubric-following problem rather than a fixed classification task. The key insight is that safety judges should generalize across different phrasings and styles of evaluation rubrics, not just memorize one template. The authors propose two intertwined mechanisms: (1) instance-conditioned dynamic rubrics generated from prompt-response-label triples using GPT-4.1, and (2) a "reliable-to-expressive" curriculum that begins training with clean fixed-rubric data and progressively introduces noisier dynamic-rubric data. They also introduce cross-rubric range as a robustness metric—measuring the spread of performance when the same data is evaluated under different rubric formulations.

Methodological Rigor

Strengths in design: The evaluation protocol—holding data fixed while varying rubrics—is a clean experimental setup that isolates the rubric-following capability. Testing under HarmBench-style, ShieldGemma-style, and domain-specific rubrics provides meaningful diversity in evaluation philosophy (intent-realization vs. policy-compliance vs. domain-specific framing).

Concerns about rigor:

1. Single training run: The authors acknowledge that headline results come from a single training seed, with only Table 4 (rephrasing robustness) reporting multi-seed statistics. This is a notable gap—the 0.76 cross-rubric range advantage over baselines could partially reflect seed-specific effects.

2. Evaluation set size and domain: The evaluation corpus contains only 520 instances from a single regulated domain (financial). While the 26 fine-grained risk categories provide some coverage, generalization to other domains (medical, legal, general-purpose) is untested. The paper acknowledges this limitation.

3. Dynamic rubric generation dependency on GPT-4.1: The quality of the training signal depends on a proprietary model, which introduces potential biases and limits reproducibility. The label-recovery filter is a reasonable safeguard but cannot catch all failure modes (e.g., rubrics that are technically consistent with the label but emphasize the wrong criteria).

4. Baseline fairness: Some comparisons may be unfair—safety-specialized classifiers like Llama-Guard and GuardReasoner were designed for fixed policy schemas, not rubric-following. Their poor cross-rubric performance is somewhat expected by design. The more meaningful comparisons are against reasoning-oriented models, where the curriculum judge does show genuine advantages.

5. No adversarial evaluation: The paper tests synonym substitution and perspective shifts but does not evaluate against adversarial rubric perturbations or adversarial model outputs, which are critical failure modes referenced in the motivation (Eiras et al., 2025).

Potential Impact

The practical framing—shifting safety judge maintenance from a model-update problem to a rubric-authoring problem—is compelling for deployment. Organizations could version-control rubrics and audit evaluation behavior without retraining, which is particularly valuable in regulated industries where policies change faster than model release cycles.

The cross-rubric range metric is a simple but useful contribution that could become standard practice in safety judge evaluation. Currently, papers often report accuracy on a single rubric format, which hides dangerous instabilities.

The curriculum learning strategy (reliable-to-expressive) offers a general recipe applicable beyond safety: any task where generated supervision is noisier than human-labeled data could benefit from staged introduction. The ablation clearly demonstrates that naive mixing degrades performance (cross-rubric variance 1.44→3.60), while the curriculum improves it (→0.76).

Timeliness & Relevance

This paper addresses a genuine and growing concern. As LLM deployment scales, safety evaluation becomes a bottleneck, and the brittleness documented by Eiras et al. (2025) is a real problem. The shift toward rubric-conditioned evaluation aligns with industry trends where safety policies are increasingly dynamic and domain-specific. The paper is well-timed for the current discourse on LLM governance and auditable AI systems.

Strengths

Clean problem formulation: Reframing safety judgment as rubric-following with AND-of-criteria semantics is conceptually crisp and practically motivated.

Ablation is convincing: The three-variant ablation (Fixed, Dynamic, Curriculum) clearly isolates the curriculum's contribution and disproves the naive "more data = better" intuition.

Efficiency argument: A 12B model outperforming 20B-30B reasoning models on both accuracy and stability is a meaningful practical result.

Error case study (§5.6) adds qualitative depth, showing the curriculum judge catches cosmetic refusals that fool baseline models—a high-stakes failure mode.

Rephrasing robustness (Table 4) demonstrates stability under within-rubric paraphrases (Acc range 0.19), complementing the cross-rubric style experiments.

Limitations & Weaknesses

Narrow evaluation domain: A single financial-domain corpus with 520 instances is insufficient to claim broad generalizability. The paper would be significantly strengthened by evaluation on medical, legal, or general-purpose safety domains.

Single seed results undermine confidence in the precise numerical advantages claimed.

Curriculum schedule appears hand-tuned: The linear schedule with α=1, 2-epoch warmup, and 10 total epochs is presented without sensitivity analysis. How robust are results to schedule variations?

No comparison with DPO/RLHF approaches: The paper mentions preliminary RL runs were "costlier and less stable" but provides no data. Given the relevance of preference-based training for alignment, this omission is notable.

The dynamic rubric generation prompt uses GPT-4.1 conditioned on the ground-truth label z, which could create circularity—the rubric is constructed to justify a known answer rather than independently capturing relevant criteria.

Missing statistical tests: No significance tests are reported for the main comparisons. Given the single-seed issue and relatively small evaluation set, some differences may not be statistically meaningful.

Overall Assessment

This paper makes a clearly articulated and practically motivated contribution to safety evaluation methodology. The rubric-following framing is sound, the curriculum strategy is well-ablated, and the cross-rubric range metric fills a genuine gap. However, the empirical evaluation is narrow in scope (single domain, small evaluation set, single training seed), and the reliance on GPT-4.1 for dynamic rubric generation raises questions about reproducibility and potential biases. The work would benefit substantially from broader domain coverage and multi-seed replication. It represents a solid incremental advance with practical value rather than a paradigm-shifting contribution.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (21)

Lostvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Paper 1 offers a more general and methodologically grounded contribution: it formalizes long-horizon agent memory retention as constrained stochastic optimization with explicit delayed costs and observability constraints, then proposes an online-deployable learning framework (OAS separation) with empirical gains on established memory benchmarks. This tackles a core bottleneck for many agentic systems (context limits) and can influence broader work in RL/decision-making under partial observability, retrieval/memory systems, and agent evaluation. Paper 2 is timely and useful for safety evaluation robustness but is narrower in scope and evaluated on a single labeled set.

gpt-5.2·Jun 10, 2026

Lostvs. The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

Paper 2 is likely to have higher impact because it targets an emerging, timely problem—auditing multi-agent LLM systems—introducing an interactive, budget-aware monitoring paradigm with clear real-world applicability to agentic deployments. Its framing generalizes across domains where agents negotiate/coordinate, and its evaluation spans multiple adversarial conversation conditions, tool configurations, and backbones, plus code release, supporting methodological rigor and adoption. Paper 1 is solid and relevant for safety rubric robustness, but its scope is narrower (single labeled set, rubric prompt stability) and more incremental within existing judge-training lines.

gpt-5.2·Jun 10, 2026

Wonvs. A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

Paper 1 addresses a highly timely and broadly impactful problem in AI safety evaluation, proposing a novel curriculum-based training strategy for rubric-following safety judges. It demonstrates strong empirical results with clear ablations, tackles the critical issue of LLM safety assessment robustness, and has broad applicability across the rapidly growing AI safety field. Paper 2 addresses a more traditional and narrower problem in fault diagnosis using belief rule bases, with incremental methodological contributions and limited scope beyond industrial equipment diagnostics.

claude-opus-4-6·Jun 10, 2026

Wonvs. ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Paper 1 addresses a critical and immediate bottleneck in AI safety and alignment: the brittleness of automated safety judges. By proposing a novel curriculum learning strategy that significantly improves cross-rubric robustness, its methodology has direct, high-impact applications in the deployment of secure and reliable LLMs across various domains. While Paper 2 offers a valuable benchmark for mathematical reasoning, Paper 1's contribution to fundamental AI safety mechanisms presents broader societal implications and more immediate relevance to current industry challenges.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

Paper 2 has higher likely scientific impact: it presents a concrete, testable method (dynamic rubrics + reliable-to-expressive curriculum), with quantitative evaluation, ablations, and clear baselines, addressing an urgent and widely relevant problem (robust safety evaluation under rubric variation). Its approach is immediately applicable to model governance, red-teaming, and automated evaluation pipelines, and can generalize to other rubric-following tasks. Paper 1 is largely conceptual/architectural with ambiguous definitions (e.g., “independent consciousness”) and limited methodological rigor or falsifiable claims, reducing near-term scientific and practical impact.

gpt-5.2·Jun 10, 2026

Wonvs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

Paper 2 has higher estimated impact due to broader applicability and timeliness: rubric-following safety judges are central to LLM deployment across many domains, and improving robustness to rubric/prompt variation addresses a widely reported failure mode. The curriculum + dynamic-rubric data strategy is a generally reusable method for evaluation and safety tooling, likely to influence both research and practice. Paper 1 is innovative and rigorous within autonomous driving world-model planning, but its impact is narrower to AV stacks/datasets and depends on integration into specific planners.

gpt-5.2·Jun 9, 2026

Lostvs. Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Paper 2 likely has higher impact: it introduces a broadly useful, novel evaluation paradigm (interactive, evidence-seeking reasoning with metacognition and robustness) plus a sizable executable benchmark (474 games) that can become a standard across labs. Its applications extend beyond safety to general agentic LLM evaluation, RL, HCI, and cognitive modeling, making impact broader and timely as interactive agents proliferate. Paper 1 is strong and practical for safety-judge robustness, but is narrower in scope and primarily advances a training recipe within a specific subproblem.

gpt-5.2·Jun 9, 2026

Lostvs. Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Paper 1 addresses a critical bottleneck in high-stakes clinical decision-making by enabling agents to continually learn and refine procedural knowledge without weight updates. Its closed-loop self-evolution framework offers profound implications for building reliable, autonomous medical agents. While Paper 2 presents a valuable contribution to AI safety evaluation, Paper 1's potential to transform interactive healthcare systems and advance continuous learning in agentic architectures gives it a broader and more transformative real-world impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Paper 2 challenges fundamental architectural assumptions in Multimodal LLMs, revealing that deep processing of vision tokens is often redundant. Its proposed asymmetric routing framework significantly reduces computational overhead while maintaining performance. This structural insight and efficiency gain will likely influence the foundation model design across multiple domains, offering broader scientific and practical impact than Paper 1's domain-specific curriculum strategy for safety judges.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

Paper 2 has higher likely scientific impact due to broader, timely relevance: robust rubric-following safety evaluation is central to deployment, governance, and benchmarking across many LLM applications. Its curriculum + dynamic-rubric formulation targets a well-identified failure mode (rubric brittleness), provides clear empirical stability gains across distinct rubrics, and includes an informative ablation isolating the curriculum’s effect, suggesting methodological rigor. Paper 1 is strong engineering with high benchmark performance, but its impact is narrower (web navigation) and may be more system-integration than a generally transferable scientific principle.

gpt-5.2·Jun 9, 2026

#1839of 3489·Artificial Intelligence

#1839 of 3489 · Artificial Intelligence

Tournament Score

1393±44

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5