Yongtaek Lim, Hyeji Choi, Minwoo Kim
Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).
This paper reframes LLM safety judgment as a rubric-following problem rather than a fixed classification task. The key insight is that safety judges should generalize across different phrasings and styles of evaluation rubrics, not just memorize one template. The authors propose two intertwined mechanisms: (1) instance-conditioned dynamic rubrics generated from prompt-response-label triples using GPT-4.1, and (2) a "reliable-to-expressive" curriculum that begins training with clean fixed-rubric data and progressively introduces noisier dynamic-rubric data. They also introduce cross-rubric range as a robustness metric—measuring the spread of performance when the same data is evaluated under different rubric formulations.
Strengths in design: The evaluation protocol—holding data fixed while varying rubrics—is a clean experimental setup that isolates the rubric-following capability. Testing under HarmBench-style, ShieldGemma-style, and domain-specific rubrics provides meaningful diversity in evaluation philosophy (intent-realization vs. policy-compliance vs. domain-specific framing).
1. Single training run: The authors acknowledge that headline results come from a single training seed, with only Table 4 (rephrasing robustness) reporting multi-seed statistics. This is a notable gap—the 0.76 cross-rubric range advantage over baselines could partially reflect seed-specific effects.
2. Evaluation set size and domain: The evaluation corpus contains only 520 instances from a single regulated domain (financial). While the 26 fine-grained risk categories provide some coverage, generalization to other domains (medical, legal, general-purpose) is untested. The paper acknowledges this limitation.
3. Dynamic rubric generation dependency on GPT-4.1: The quality of the training signal depends on a proprietary model, which introduces potential biases and limits reproducibility. The label-recovery filter is a reasonable safeguard but cannot catch all failure modes (e.g., rubrics that are technically consistent with the label but emphasize the wrong criteria).
4. Baseline fairness: Some comparisons may be unfair—safety-specialized classifiers like Llama-Guard and GuardReasoner were designed for fixed policy schemas, not rubric-following. Their poor cross-rubric performance is somewhat expected by design. The more meaningful comparisons are against reasoning-oriented models, where the curriculum judge does show genuine advantages.
5. No adversarial evaluation: The paper tests synonym substitution and perspective shifts but does not evaluate against adversarial rubric perturbations or adversarial model outputs, which are critical failure modes referenced in the motivation (Eiras et al., 2025).
The practical framing—shifting safety judge maintenance from a model-update problem to a rubric-authoring problem—is compelling for deployment. Organizations could version-control rubrics and audit evaluation behavior without retraining, which is particularly valuable in regulated industries where policies change faster than model release cycles.
The cross-rubric range metric is a simple but useful contribution that could become standard practice in safety judge evaluation. Currently, papers often report accuracy on a single rubric format, which hides dangerous instabilities.
The curriculum learning strategy (reliable-to-expressive) offers a general recipe applicable beyond safety: any task where generated supervision is noisier than human-labeled data could benefit from staged introduction. The ablation clearly demonstrates that naive mixing degrades performance (cross-rubric variance 1.44→3.60), while the curriculum improves it (→0.76).
This paper addresses a genuine and growing concern. As LLM deployment scales, safety evaluation becomes a bottleneck, and the brittleness documented by Eiras et al. (2025) is a real problem. The shift toward rubric-conditioned evaluation aligns with industry trends where safety policies are increasingly dynamic and domain-specific. The paper is well-timed for the current discourse on LLM governance and auditable AI systems.
This paper makes a clearly articulated and practically motivated contribution to safety evaluation methodology. The rubric-following framing is sound, the curriculum strategy is well-ablated, and the cross-rubric range metric fills a genuine gap. However, the empirical evaluation is narrow in scope (single domain, small evaluation set, single training seed), and the reliance on GPT-4.1 for dynamic rubric generation raises questions about reproducibility and potential biases. The work would benefit substantially from broader domain coverage and multi-seed replication. It represents a solid incremental advance with practical value rather than a paradigm-shifting contribution.
Generated Jun 9, 2026
Paper 1 offers a more general and methodologically grounded contribution: it formalizes long-horizon agent memory retention as constrained stochastic optimization with explicit delayed costs and observability constraints, then proposes an online-deployable learning framework (OAS separation) with empirical gains on established memory benchmarks. This tackles a core bottleneck for many agentic systems (context limits) and can influence broader work in RL/decision-making under partial observability, retrieval/memory systems, and agent evaluation. Paper 2 is timely and useful for safety evaluation robustness but is narrower in scope and evaluated on a single labeled set.
Paper 2 is likely to have higher impact because it targets an emerging, timely problem—auditing multi-agent LLM systems—introducing an interactive, budget-aware monitoring paradigm with clear real-world applicability to agentic deployments. Its framing generalizes across domains where agents negotiate/coordinate, and its evaluation spans multiple adversarial conversation conditions, tool configurations, and backbones, plus code release, supporting methodological rigor and adoption. Paper 1 is solid and relevant for safety rubric robustness, but its scope is narrower (single labeled set, rubric prompt stability) and more incremental within existing judge-training lines.
Paper 1 addresses a highly timely and broadly impactful problem in AI safety evaluation, proposing a novel curriculum-based training strategy for rubric-following safety judges. It demonstrates strong empirical results with clear ablations, tackles the critical issue of LLM safety assessment robustness, and has broad applicability across the rapidly growing AI safety field. Paper 2 addresses a more traditional and narrower problem in fault diagnosis using belief rule bases, with incremental methodological contributions and limited scope beyond industrial equipment diagnostics.
Paper 1 addresses a critical and immediate bottleneck in AI safety and alignment: the brittleness of automated safety judges. By proposing a novel curriculum learning strategy that significantly improves cross-rubric robustness, its methodology has direct, high-impact applications in the deployment of secure and reliable LLMs across various domains. While Paper 2 offers a valuable benchmark for mathematical reasoning, Paper 1's contribution to fundamental AI safety mechanisms presents broader societal implications and more immediate relevance to current industry challenges.
Paper 2 has higher likely scientific impact: it presents a concrete, testable method (dynamic rubrics + reliable-to-expressive curriculum), with quantitative evaluation, ablations, and clear baselines, addressing an urgent and widely relevant problem (robust safety evaluation under rubric variation). Its approach is immediately applicable to model governance, red-teaming, and automated evaluation pipelines, and can generalize to other rubric-following tasks. Paper 1 is largely conceptual/architectural with ambiguous definitions (e.g., “independent consciousness”) and limited methodological rigor or falsifiable claims, reducing near-term scientific and practical impact.
Paper 2 has higher estimated impact due to broader applicability and timeliness: rubric-following safety judges are central to LLM deployment across many domains, and improving robustness to rubric/prompt variation addresses a widely reported failure mode. The curriculum + dynamic-rubric data strategy is a generally reusable method for evaluation and safety tooling, likely to influence both research and practice. Paper 1 is innovative and rigorous within autonomous driving world-model planning, but its impact is narrower to AV stacks/datasets and depends on integration into specific planners.
Paper 2 likely has higher impact: it introduces a broadly useful, novel evaluation paradigm (interactive, evidence-seeking reasoning with metacognition and robustness) plus a sizable executable benchmark (474 games) that can become a standard across labs. Its applications extend beyond safety to general agentic LLM evaluation, RL, HCI, and cognitive modeling, making impact broader and timely as interactive agents proliferate. Paper 1 is strong and practical for safety-judge robustness, but is narrower in scope and primarily advances a training recipe within a specific subproblem.
Paper 1 addresses a critical bottleneck in high-stakes clinical decision-making by enabling agents to continually learn and refine procedural knowledge without weight updates. Its closed-loop self-evolution framework offers profound implications for building reliable, autonomous medical agents. While Paper 2 presents a valuable contribution to AI safety evaluation, Paper 1's potential to transform interactive healthcare systems and advance continuous learning in agentic architectures gives it a broader and more transformative real-world impact.
Paper 2 challenges fundamental architectural assumptions in Multimodal LLMs, revealing that deep processing of vision tokens is often redundant. Its proposed asymmetric routing framework significantly reduces computational overhead while maintaining performance. This structural insight and efficiency gain will likely influence the foundation model design across multiple domains, offering broader scientific and practical impact than Paper 1's domain-specific curriculum strategy for safety judges.
Paper 2 has higher likely scientific impact due to broader, timely relevance: robust rubric-following safety evaluation is central to deployment, governance, and benchmarking across many LLM applications. Its curriculum + dynamic-rubric formulation targets a well-identified failure mode (rubric brittleness), provides clear empirical stability gains across distinct rubrics, and includes an informative ablation isolating the curriculum’s effect, suggesting methodological rigor. Paper 1 is strong engineering with high benchmark performance, but its impact is narrower (web navigation) and may be more system-integration than a generally transferable scientific principle.