Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

#1037 of 2821 · Artificial Intelligence
Share
Tournament Score
1438±48
10501800
69%
Win Rate
11
Wins
5
Losses
16
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures"

1. Core Contribution

The paper identifies a genuine blind spot in LLM safety evaluation: ASR collapses temporally distinct failure modes into a single binary label, discarding information about *how* and *when* safety failures unfold during decoding. The proposed Temporal Logit Observability (TLO) framework addresses this by tracking a compliance–refusal logit margin (LMS) at each decoding step and projecting the resulting trajectory onto a calibrated 2D "RP-plane" using harmful and attack-formatted benign reference anchors. The key insight is that this plane is most informative precisely where ASR is least informative—among mid-ASR conditions where attacks succeed for genuinely different reasons.

The contribution is primarily diagnostic rather than defensive: TLO is positioned as a complementary evaluation tool, not an ASR replacement. The (1−ASR) bound on RP displacement is a deliberate design feature that formalizes this complementarity.

2. Methodological Rigor

The paper demonstrates substantial methodological care. The experimental grid (4 models × 3 attack paradigms × 60 prompts) is systematic, and the statistical apparatus is thorough: permutation tests with Benjamini-Hochberg FDR correction, stratified bootstrap CIs, Cohen's d effect sizes, and ICC for stochastic decoding stability (0.90–0.94).

The robustness checks are impressively comprehensive—lexicon perturbations (|ΔAUC| ≤ 0.018), metric decomposition showing common-mode rejection, non-lexical controls, stricter activation-timing variants, data-driven lexicon induction, cross-tokenizer analysis, and vocabulary top-k truncation simulations. The metric decomposition analysis showing that the margin outperforms refusal-only or compliance-only components is particularly convincing in establishing that TLO captures genuine compliance–refusal competition rather than surface artifacts.

However, several limitations temper confidence. The sample size (60 prompts per condition, 720 total harmful) is modest, with some conditions having very few successes (n_succ = 3 for Llama+MCM and Mistral+MCM). The Qwen model shows reversed expected signs under two conditions, revealing that the fixed-lexicon approach has model-specific boundaries. The hidden-state alignment validation (10/12 conditions) is presented honestly but also highlights that the logit signal is not universally aligned with internal safety representations.

3. Potential Impact

Safety Evaluation: The most immediate impact is conceptual—arguing that safety evaluation should be trajectory-aware, not just outcome-aware. This reframes how the community thinks about jailbreak benchmarking. If adopted, it would enrich safety evaluations beyond single-number ASR reporting.

Defense Design: The observation that the same attack produces qualitatively different temporal signatures across models (and vice versa) has practical implications: defenses tuned to a single attack family or applied uniformly across models will have condition-dependent effectiveness that current evaluation cannot predict. The early-stop rule demonstration (ASR reduction from 39.6% to 13.1% with 0% FPR on benign queries) shows actionable potential, though the authors appropriately frame it as a behavioral probe rather than a production defense.

Monitoring and Auditing: TLO's training-free, logit-only nature makes it suitable for open-weight model auditing. However, the requirement for full logits (not top-k API outputs, as demonstrated in Appendix F.10) limits applicability to API-restricted settings.

Scope Limitations: The framework is English-only, uses a fixed lexicon of 47 phrases, and has been tested on 7B-9B parameter models. Scalability to frontier models and multilingual settings remains undemonstrated.

4. Timeliness & Relevance

The paper addresses a timely concern. As jailbreak attacks proliferate and safety evaluation becomes increasingly important for regulatory compliance and deployment decisions, the inadequacy of ASR as a sole metric is becoming apparent. The observation that identical ASR can mask fundamentally different failure mechanisms is relevant to current debates about LLM safety measurement standards. The connection to inference-time detection and defense (SafeDecoding, CARE) positions TLO within an active research area.

5. Strengths & Limitations

Key Strengths:

  • Clean conceptual framing: the complementarity between TLO and ASR is formalized rather than hand-waved, with the (1−ASR) bound making the relationship precise.
  • Exceptional transparency about limitations: Qwen's boundary behavior, small-sample caveats, axis-dominance margins, and the observational (not causal) nature of TLO are all explicitly scoped.
  • The calibration via Relative Position makes cross-model, cross-attack comparisons meaningful—a non-trivial achievement given that raw logit scales differ across architectures.
  • The extensive appendix (over 20 pages of robustness checks) provides unusual depth of validation.
  • Notable Weaknesses:

  • The LMS signal (Eq. 1) is borrowed from Li and Liu [17]; the paper's novelty lies in the calibration and temporal analysis, not the underlying observable.
  • The fixed lexicon approach, while interpretable, is inherently limited. The Qwen boundary case suggests that model-specific or probe-derived lexicons may be necessary for reliability, partially undermining the "training-free" appeal.
  • The early-stop intervention, while demonstrating actionability, has an important gap: attack-formatted benign references trigger 46% false halts (reduced to 12.6% with S₀-gating), suggesting the temporal signal conflates format effects with safety-relevant dynamics.
  • The 60-prompt evaluation scale, while standard for jailbreak benchmarks, limits statistical power for per-condition analyses.
  • The paper does not evaluate against adaptive attackers who might exploit knowledge of TLO's lexicon or temporal monitoring.
  • Overall Assessment: This is a well-executed paper that makes a genuine conceptual contribution to LLM safety evaluation methodology. The technical execution is thorough and honest about limitations. The impact is primarily in shifting evaluation paradigms rather than providing a definitive solution—TLO opens a direction rather than closing one. The practical utility is currently limited to open-weight English auditing at moderate scale, but the diagnostic framework could influence how the community thinks about and reports safety evaluations.

    Rating:6.5/ 10
    Significance 6.5Rigor 7.5Novelty 6Clarity 8

    Generated May 29, 2026

    Comparison History (16)

    vs. Conformal Certification of Reasoning Trace Prefixes
    claude-opus-4.65/29/2026

    Paper 1 introduces a novel diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation—moving beyond binary ASR to temporal, mechanistic understanding of jailbreak failures. Its practical early-stop defense that halves successful jailbreaks without false alarms demonstrates immediate real-world applicability. The work opens a new evaluation paradigm for AI safety, a critically timely topic. Paper 2 (CROP) makes a solid methodological contribution applying conformal prediction to reasoning traces, but operates in a more incremental space. Paper 1's broader implications for safety evaluation standards give it higher potential impact.

    vs. GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
    gemini-3.15/29/2026

    Paper 1 shifts the paradigm of LLM safety evaluation from a binary outcome (Attack Success Rate) to a temporal, observable process using logits. This foundational diagnostic tool bridges mechanistic interpretability and real-time defense, offering broad implications for understanding and mitigating vulnerabilities. While Paper 2 presents a strong practical method for agent reliability, Paper 1 addresses a more critical, universally applicable bottleneck in AI safety with a highly novel, training-free approach.

    vs. OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
    claude-opus-4.65/29/2026

    Paper 2 introduces a novel, training-free diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation by making failure dynamics observable through logits alone. Its practical early-stop rule that cuts jailbreaks by >50% with no false alarms on benign queries offers immediate real-world applicability. The conceptual shift from binary outcome evaluation to temporal process observation is elegantly simple yet broadly applicable. Paper 1, while addressing an important gap (outcome vs. process anomalies), is more narrowly focused on agent trajectory benchmarking with moderate detector performance (F1=0.729) and relies on a specific pipeline setup, limiting its broader impact.

    vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
    claude-opus-4.65/29/2026

    Paper 2 introduces a novel, generalizable diagnostic framework (TLO) for understanding LLM safety failures that addresses a fundamental limitation of current evaluation methodology (ASR). It offers broad applicability across all aligned LLMs and jailbreak types, provides actionable defense mechanisms (early-stop rule cutting jailbreaks by >50%), and is training-free. Its impact spans AI safety, interpretability, and evaluation methodology. Paper 1, while addressing a real need in finance, is more domain-specific, presents a design-science contribution rather than empirical findings, and its architectural principles are less broadly transferable.

    vs. BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models
    gemini-3.15/29/2026

    Paper 1 addresses the critical field of LLM safety and interpretability by introducing a novel diagnostic tool (TLO) that analyzes the temporal dynamics of jailbreaks. Moving beyond binary Attack Success Rates offers deeper scientific insights into model behavior and vulnerabilities. While Paper 2 presents an impressive engineering feat for on-device image editing, its focus is primarily on efficiency and model compression rather than fundamentally advancing theoretical understanding. Therefore, Paper 1 demonstrates higher methodological novelty and broader potential impact in the rapidly evolving domain of AI safety.

    vs. Provably Secure Agent Guardrail
    gemini-3.15/29/2026

    Paper 2 tackles the critical and timely challenge of AI agent safety by introducing a formal verification framework that guarantees deterministic security bounds. While Paper 1 offers valuable insights into LLM jailbreak diagnostics, Paper 2's shift from probabilistic, empirical guardrails to provably secure, neuro-symbolic logical constraints represents a fundamental methodological leap. Achieving a zero attack success rate with zero false positives for agentic actions provides a highly rigorous and scalable foundation for real-world AI safety, yielding broader and more transformative potential impact across the field.

    vs. PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
    claude-opus-4.65/29/2026

    PassNet addresses a fundamental gap in compiler optimization by proposing LLM-based pass generation — a novel abstraction with clear practical impact. It introduces a large-scale dataset (18K+ graphs), a rigorous benchmark with integrity defenses, and demonstrates that fine-tuning small models on limited data approaches frontier performance. This opens a new research direction at the intersection of LLMs and compilers with broad applicability. Paper 1 (TLO) offers a useful diagnostic for LLM safety evaluation but is more incremental — it provides observability into jailbreak dynamics rather than opening an entirely new problem domain. PassNet's public infrastructure and demonstrated training utility give it stronger potential for sustained community impact.

    vs. Governing Technical Debt in Agentic AI Systems
    gpt-5.25/29/2026

    Paper 2 offers a more technically novel, concrete, and broadly useful contribution: a training-free, logit-only diagnostic (TLO) that adds temporal structure to LLM safety evaluation and yields an actionable mitigation (early stopping) with quantified results across models and attack types. It is methodologically more rigorous (empirical validation, calibration, comparisons to probes, ablation/limitations noted) and timely for LLM safety benchmarking. Paper 1 is valuable conceptually for governance, but is more managerial/definitional with less evidentiary grounding and fewer immediately testable technical artifacts, limiting near-term scientific uptake.

    vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation
    gemini-3.15/29/2026

    Paper 1 addresses the critical and highly timely issue of LLM safety and jailbreaking. By introducing Temporal Logit Observability (TLO), it moves beyond binary success rates to provide deeper insights into how safety failures unfold, offering a novel diagnostic tool with broad implications for AI alignment. Paper 2 presents a solid, domain-specific framework for urban planning, but its impact is narrower compared to the foundational and cross-disciplinary relevance of improving LLM safety evaluations.

    vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
    claude-opus-4.65/29/2026

    Paper 1 addresses a fundamental architectural challenge in multimodal reasoning—when and how to integrate visual evidence—proposing a novel cognitive scheduling framework (CSMR) with broad applicability across multimodal AI. This has wider impact potential across vision-language tasks and touches on core questions in AI architecture design. Paper 2 presents a useful diagnostic tool (TLO) for LLM safety evaluation that offers valuable mechanistic insights into jailbreak failures, but its scope is narrower, focusing on safety evaluation methodology. Paper 1's contribution to multimodal reasoning paradigms has broader implications for the field.

    vs. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
    gemini-3.15/29/2026

    Paper 2 presents a scalable, training-free diagnostic for LLM safety evaluated across multiple models and paradigms, offering actionable mitigations for jailbreaks. Its rigorous methodology and broad applicability in the highly active field of AI safety give it a significantly higher potential impact than Paper 1, which is an N=1 qualitative case study with limited generalizability.

    vs. From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks
    claude-opus-4.65/29/2026

    Paper 1 introduces a novel diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation—moving beyond binary ASR to temporal, mechanistic understanding of jailbreak failures. It offers both theoretical insight and practical utility (early-stop defense), is training-free, and is broadly applicable across models and attack types. Paper 2 contributes a valuable large-scale benchmark for traffic forecasting with evolving sensors, but its impact is more domain-specific. Paper 1's relevance to the rapidly growing AI safety field, methodological novelty, and cross-cutting applicability give it higher potential impact.

    vs. Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values
    gpt-5.25/29/2026

    Paper 2 likely has higher impact due to broader applicability and timeliness: span-level, Shapley-principled decomposition of input-induced uncertainty is useful across many LLM deployments (QA, dialog, clinical decision support) and yields actionable guidance for clarification. The Shapley formulation provides methodological rigor (additivity, interaction effects) and is evaluated on multiple established benchmarks plus a high-stakes domain dataset, suggesting stronger generalization and real-world relevance. Paper 1 is novel and practical for safety diagnostics/mitigation, but is narrower (jailbreak dynamics, reliance on fixed refusal lexicon) and less cross-domain.

    vs. Dr-CiK: A Testbed for Foresight-Driven Agents
    gemini-3.15/29/2026

    Paper 2 addresses LLM safety and jailbreaking, a critical and highly active area of research. By moving beyond binary success metrics to analyze how failures unfold temporally via logits, it offers both deeper diagnostic insights and an actionable, training-free mitigation strategy. This broad applicability to current AI alignment challenges gives it higher potential impact than Paper 1's more specialized benchmark for forecasting agents.

    vs. MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
    gpt-5.25/29/2026

    Paper 2 has higher potential impact due to strong real-world applicability (molecular/drug design), broader cross-field relevance (LLMs, chemistry, docking/structural biology, multi-agent systems), and a concrete representation innovation (BFE) that could generalize beyond the specific system. The reported benchmark gains and tool-grounded pipeline suggest practical usefulness and methodological breadth. Paper 1 is novel and timely for LLM safety evaluation, but its contribution is primarily diagnostic/monitoring and may have narrower downstream application compared to an end-to-end scientific agent framework with open code and direct translational potential.

    vs. Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems
    claude-opus-4.65/29/2026

    Paper 2 introduces a novel, training-free diagnostic framework (TLO) that fundamentally shifts how LLM safety failures are evaluated—from binary ASR to temporal, mechanistic understanding via logits. It offers immediate practical utility (early-stop rule cutting jailbreaks by >50%), methodological innovation (calibrated 2D plane from logit trajectories), and broader applicability to AI safety research. Paper 1 addresses an important fairness question in multi-agent systems but primarily provides empirical observations and a new metric without comparable methodological depth or actionable defense mechanisms. Paper 2's contributions are more novel, rigorous, and timely given urgent AI safety concerns.