READER: Robust Evidence-based Authorship Decoding via Extracted Representations

Jiaxu Liu, Sunnan Mu, Dong Huang, Liuyin Wang, Jing Shao, Jie Zhang

Jun 9, 2026arXiv:2606.10794v1

cs.AI

#2335of 3489·Artificial Intelligence

#2335 of 3489 · Artificial Intelligence

Tournament Score

1355±43

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7.5

Novelty7

Clarity7

Abstract

As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$ - $42.4\%$ top-1 accuracy from a single response and $70.0$ - $84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: READER

1. Core Contribution

READER addresses Dynamic Black-Box LLM Provenance — identifying which LLM produced a response when prompts vary naturally (as in agent pipelines), rather than being drawn from a fixed, controlled set. This is a meaningful shift from prior "static" provenance methods (LLM-DNA, PhyloLM, LLMmap, Model Provenance Testing) that rely on shared or crafted prompt distributions. The core technical insight is to use a frozen proxy LLM as a feature extractor: black-box responses are fed through an uninstrumented proxy model, and its internal hidden states are treated as a representation space where subtle authorship traces become linearly decodable. Two-stage processing — temporal low-pass filtering within each response, followed by Bayesian Evidence Accumulation (BEA) across multiple independently prompted responses — separates noise reduction from evidence aggregation.

2. Methodological Rigor

The experimental design is thorough. The paper evaluates on Agent500, a purpose-built 50-target benchmark with 500 agent-style prompts and 25,000 total trajectories spanning 9 model families. Prompt-level 5-fold cross-validation ensures disjoint prompts between train and test, which is critical for this setting.

The ablation analysis is exceptionally detailed (the appendix is extensive):

M×K sweeps across all nine proxies confirm that M=4 temporal filtering and K=50 evidence accumulation are near-optimal.

Attention pooling vs. mean pooling (Appendix D.2) is tested across three proxies with uniformly negative results, strongly justifying the simpler design.

Fisher ratio analysis and principal-angle diagnostics provide geometric insight into why the method works.

PII redaction robustness simulates realistic API masking with graceful degradation.

The proxy scaling analysis (9 proxies, 8B–122B) showing Pearson r=0.942 between proxy capability and attribution accuracy is a compelling finding.

However, there are methodological concerns. The BEA derivation assumes conditional independence of responses given the source model, which may not hold when prompts are semantically clustered. The linear probe is simple but the paper acknowledges that task-specific architectures could improve results. The static-benchmark comparison (Appendix E) shows READER's per-sample proxy method underperforms simple prefix-matching baselines (MPT, PhyloLM) in the aligned-prompt regime, contextualizing where the method's advantages actually lie.

3. Potential Impact

Practical applications are well-motivated: API auditing, license compliance verification, detection of unauthorized model wrapping, and post-incident attribution in agentic systems. As LLM-powered agents proliferate and API routing becomes more complex, the ability to determine which model produced a response from observed traffic alone has genuine value.

Scientific implications are twofold. First, the finding that frozen proxy LLMs contain linearly decodable authorship information about *other* models' outputs is an interesting observation for the mechanistic interpretability community. Second, the correlation between proxy capability and authorship decodability (Figure 8) suggests that more capable models develop richer "reading" representations that incidentally encode source-model identity.

The closed-set limitation significantly constrains real-world deployment. The system requires labeled training data from every candidate model and cannot handle unknown sources — a critical gap for actual API auditing where new models appear constantly. The authors acknowledge this but offer no solution beyond retraining.

4. Timeliness & Relevance

The paper is well-timed. The shift toward agentic LLM applications, multi-model routing, and third-party API wrappers creates genuine provenance needs that prior static methods do not address. The formalization of "Dynamic Black-Box LLM Provenance" as a distinct problem is a useful contribution to the field's vocabulary.

The target ecosystem (50 models including recent Qwen3.5/3.6, Llama 4, Gemma 4) is impressively current, though it omits closed-source API targets (GPT-4o, Claude, Gemini), which are arguably the most important in real auditing scenarios.

5. Strengths & Limitations

Key Strengths:

Novel problem formulation: Dynamic prompts are genuinely harder and more realistic than static benchmarks.

Strong empirical improvement: 31–42% single-response accuracy vs. ~12% for best sentence-encoder baseline on a 50-way task (2% chance) is substantial.

Principled aggregation: BEA consistently outperforms geometric mean-pooling, with clear theoretical motivation and empirical validation.

Exhaustive ablations: The paper leaves few design choices unjustified.

Scaling insight: The proxy capability → attribution accuracy correlation is clean and suggestive.

Notable Limitations:

Closed-set assumption: No rejection mechanism for unknown models severely limits deployment viability.

No adaptive adversary evaluation: A provider aware of READER could paraphrase or randomize style. The PII masking test is helpful but not adversarial.

Single-source assumption: Real agent pipelines often chain or mix models.

Absolute accuracy ceiling: 84% top-1 at K=50 with the best proxy, while impressive relative to baselines, may be insufficient for high-stakes attribution. Within-family confusion (e.g., Qwen3 variants) remains substantial.

Computational cost: Requiring a frozen proxy LLM (up to 122B) to read every response for feature extraction is non-trivial, though the linear probe itself is lightweight.

No closed-source targets: Omission of GPT-4o, Claude, etc. limits the practical benchmark scope.

Overall Assessment

READER makes a solid contribution by formalizing a realistic provenance setting and demonstrating that frozen proxy LLM representations carry exploitable authorship signatures. The methodology is clean, the ablations are thorough, and the scaling analysis provides genuine scientific insight. However, the closed-set constraint, lack of adversarial robustness evaluation, and absence of closed-source API targets limit the paper's immediate practical impact. The work is best understood as a strong proof-of-concept that authorship evidence exists in proxy representations and can be accumulated probabilistically, rather than a deployment-ready auditing system.

Rating:6.5/ 10

Significance 6.5Rigor 7.5Novelty 7Clarity 7

Generated Jun 10, 2026

Comparison History (20)

Wonvs. Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

Paper 1 is more methodologically novel and broadly impactful: it formalizes a difficult “dynamic black-box provenance” setting and introduces a principled Bayesian evidence-accumulation framework using proxy-LLM representations, with clear empirical gains and implications for security, accountability, and model governance across many LLM applications. Its core idea (authorship signals latent in frozen representations, aggregatable across prompts) is general and timely. Paper 2 has strong real-world relevance and human-subject evaluation, but the structured pipeline approach is more incremental in LLM systems and its impact is narrower to negotiation/mediation domains.

gpt-5.2·Jun 11, 2026

Wonvs. SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

Paper 1 is more novel and broadly impactful: it introduces a new setting (dynamic black-box LLM provenance) and a principled method (proxy-LLM activation mapping + temporal filtering + Bayesian evidence accumulation) that can generalize across prompts and scales with multiple proxy readers. The applications (model provenance, auditing, security, governance) are immediate and cross-cutting. While Paper 2 is timely and methodologically careful for agent evaluation, its contribution is narrower (skill organization effects) and the demonstrated outcome gains are modest, suggesting more limited near-term impact across fields.

gpt-5.2·Jun 11, 2026

Wonvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

Paper 1 addresses a timely and practically important problem—LLM provenance in agentic systems—with a novel framework (READER) combining frozen proxy LLM representations with Bayesian evidence accumulation. It introduces a new problem formulation (Dynamic Black-Box LLM Provenance), provides a scalable methodology, and has broad implications for AI security, accountability, and trust. Paper 2, while strong on Theory of Mind benchmarks, is more incremental—applying recursive prompting to an established problem—and its impact is narrower, primarily limited to ToM evaluation. Paper 1's novelty, real-world applicability, and cross-field relevance give it higher impact potential.

claude-opus-4-6·Jun 11, 2026

Lostvs. HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Paper 2 addresses a fundamental and pervasive challenge in AI capability development: credit assignment in multi-turn autonomous agents. By utilizing local environmental observations for hindsight self-distillation, it provides a scalable alternative to standard RLHF/GRPO methods, which is critical for the next generation of reasoning and agentic models. While Paper 1 offers a novel security and provenance tool, Paper 2's methodological advancements in agent training are likely to have a broader and more direct impact on pushing forward state-of-the-art AI capabilities.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 1 is more novel and broadly impactful: it formalizes “dynamic black-box LLM provenance” and introduces a principled Bayesian evidence-accumulation framework using proxy-LLM representation space, with clear methodological claims (multi-query calibration, temporal filtering, scaling across readers) and a new dataset setting. Its applications span security, compliance, model governance, and agentic systems across many domains, making it timely and field-crossing. Paper 2 appears more competition-driven and engineering-focused for a specific benchmark, with narrower generality beyond autonomous-driving scenario mining.

gpt-5.2·Jun 11, 2026

Wonvs. Accelerating NeurASP with vectorization and caching

Paper 2 addresses a highly timely and critical issue in AI safety and security: LLM provenance and authorship attribution in black-box settings. Its novel approach of using a frozen proxy LLM to decode authorship traces has broad applications in AI policy, copyright, and misinformation tracking. In contrast, Paper 1 offers valuable, yet narrow, engineering optimizations for a specific neurosymbolic framework (NeurASP). Given the explosive growth of LLM APIs and the urgent need for provenance tools, Paper 2 demonstrates significantly higher potential for broad real-world application and cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Parthenon Law: A Self-Evolving Legal-Agent Framework

Paper 2 likely has higher scientific impact due to stronger real-world applicability and breadth: it introduces an auditable, self-evolving agent framework tailored to a high-stakes domain (law), backed by a large empirical dataset (12,510 trajectories) and a concrete improvement loop that generalizes beyond any single model. Its modular decomposition (model/harness/skills/tools/knowledge) is timely for enterprise agent deployment and governance, with implications for other regulated workflows. Paper 1 is novel and relevant for provenance, but its impact is narrower (attribution) and appears more benchmark-specific.

gpt-5.2·Jun 10, 2026

Lostvs. From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

Paper 2 addresses knowledge conflicts in Retrieval-Augmented Generation (RAG), a critical bottleneck in LLM deployment. By shifting from context-aware to conflict-aware decoding and introducing a new benchmark (TriState-Bench) and routing method, it offers a robust solution to a fundamental reliability issue. While Paper 1 presents an innovative approach to LLM provenance, Paper 2's focus on improving factual reliability in augmented generation systems has broader immediate applicability and higher potential impact across both academic NLP research and industry applications.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

Paper 1 presents a novel, well-defined technical contribution (READER framework) with concrete experimental results addressing a timely and practical problem—LLM provenance in agentic systems. It introduces a new methodology with rigorous evaluation, demonstrating strong performance gains. Paper 2 is primarily a survey/framework paper that maps existing systems against proposed criteria and offers a protocol without completed experimental results. Paper 1's combination of novelty, methodological rigor, practical relevance to the growing LLM API ecosystem, and empirical validation gives it substantially higher potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

Paper 2 has higher estimated impact: it advances embodied autonomy with a world-model-based VLA approach tailored to UAV navigation under partial observability, introduces a new Urban Canyon benchmark, and targets a high-stakes real-world domain (urban UAVs). The methodological contribution (joint future video prediction + action via coupled flow matching) and benchmark can drive broader follow-on work across robotics, planning, and multimodal learning. Paper 1 is timely for LLM provenance and offers a clever Bayesian evidence-accumulation scheme, but its application scope is narrower and results (needing many queries for strong accuracy) may limit immediate operational uptake.

gpt-5.2·Jun 10, 2026

#2335of 3489·Artificial Intelligence

#2335 of 3489 · Artificial Intelligence

Tournament Score

1355±43

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7.5

Novelty7

Clarity7