Artificial Intelligence Paper Rankings
AI-estimated scientific impact ranking of the latest arXiv Artificial Intelligence preprints. Methodology New: General Relativity
Sign up for free to unlock all papers &
Towards a General Intelligence and Interface for Wearable Health Data
Girish Narayanswamy, Maxwell A. Xu +6
A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
Ziqing Yu, Yuhui Tao +6
Advancing Mathematics Research with AI-Driven Formal Proof Search
George Tsoukalas, Anton Kovsharov +6
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Rui Meng, Bhavana Dalvi Mishra +6
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Amartya Roy, Sonali Parbhoo
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Max Lamparth, Daniel Fein +3
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Shanghua Gao, Ada Fang +1
Entropy Distribution as a Fingerprint for Hallucinations in Generative Models
Mattia J. Villani, Pranav Deshpande +3
Calibrating Conservatism for Scalable Oversight
William Overman, Mohsen Bayati
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
Shubham Agarwal, Alexander Krentsel +6
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Dongyoon Hahm, Dylan Hadfield-Menell +1
When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
Dasol Choi, Alex Kwon
Inference Time Context Sparsity: Illusion or Opportunity?
Sahil Joshi, Prithvi Dixit +6
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Yifan Yang, Ziyang Gong +6
v2SIA: Self Improving AI with Harness & Weight Updates
Prannay Hebbar, Yogendra Manawat +5
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
Zhe Yu, Wenpeng Xing +5
Neuro-Inspired Inverse Learning for Planning and Control
Maryna Kapitonova, Tonio Ball
Learning to Search and Searching to Learn for Generalization in Planning
Michael Aichmüller, Yannik Hesse +1
Human-like in-group bias in instruction-tuned language model agents
Messi H. J. Lee
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Bowen Wang, Dunjie Lu +6
RULER: Representation-Level Verification of Machine Unlearning
Georgina Cosma, Axel Finke
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
Xiang Wang, Wei Wei
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Jingchu Gai, Guanning Zeng +5
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
Zhe Yu, Wenpeng Xing +5
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes
Caijun Xu, Changyi Xiao +2
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Ali Hatamizadeh, Yejin Choi +1
Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
Dao Tran, Duc Anh Le +4
Fundamental Limitation in Explaining AI
Atsushi Suzuki, Jing Wang
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee +1
v2Forecasting Scientific Progress with Artificial Intelligence
Sean Wu, Pan Lu +6
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
Abdelghny Orogat, Essam Mansour
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
MiniMax, : +6
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
Jianing Zhu, Yeonju Ro +6
Voluntary Collusion with Secret Tools in Competing LLM Agents
Xijie Zeng, Frank Rudzicz
LACUNA: Safe Agents as Recursive Program Holes
Yaoyu Zhao, Yichen Xu +4
Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
Zhaoyang Jiang, Xuanqi Peng +6
AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
Ruiyi Zhang, Peijia Qin +3
A governance horizon for ethical-use constraints in open-weight AI models
Weiwei Xu, Hengzhi Ye +4
Credit Assignment with Resets in Language Model Reasoning
Ankur Samanta, Akshayaa Magesh +6
v2PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
Manpreet Kaur, Xingying Zhang +1
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
Yiqun Chen, Wei Yang +6
LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design
Leshu Li, An Lu +6
CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models
Abhilash Durgam, Nyle Siddiqui +4
Learning to Reason Efficiently with A* Post-Training
Andreas Opedal, Francesco Ignazio Re +4
Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
Kohsei Matsutani, Gouki Minegishi +3
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Linas Nasvytis, Simon Jerome Han +4
Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
Xiaoyue Lu, Xianglin Yang +5
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
Zisu Huang, Jingwen Xu +6
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
Qianshu Cai, Yonggang Zhang +5
AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems
Michael Hardy, Anka Reuel +6
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
Yamato Arai, Yuma Ichikawa
Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
Zhe Yu, Wenpeng Xing +5
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
HuiMing Fan, Xiao Wang +6
SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models
Chao Ding, Mouxiao Bian +6
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
Zhexin Hu, Li Wang +5
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Junlin Yang, Dylan Zhang +6
From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
Xiaohua Wang, Jiakang Yuan +6
OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
Bojie Li
Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents
Yongxiang Li, Moxin Li +5
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Kia-Jüng Yang, Dominik Meier +3
Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
Yuxin Zhang, Mengxue Hu +6
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
Yuyang Hu, Hongjin Qian +6
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
Dongxin Guo
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Guoxin Ma, Yibing Liu +6
Proper Scoring Rules for Agentic Uncertainty Quantification
Suresh Raghu, Satwik Pandey +1
Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
Renjie Gu, Jiaxu Li +6
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
Gabriele Cesa, Thomas Hehn +5
StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
Yang Luo, Xinran Liu +4
Behavioural Analysis of Alignment Faking
Nathaniel Mitrani Hadida, Rhea Karty +2
From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation
Shuaike Li, Kai Zhang +3
Advancing Creative Physical Intelligence in Large Multimodal Models
Cheng Qian, Hyeonjeong Ha +6
Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
Zhikai Pan, Chih-Ting Liao +6
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
Yuyu Liu, Haotian Xu +4
PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models
Mustafa Hayri Bilgin, Mariam Barry +3
Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
Yilun Yao, Xinyu Tan +6
Position: AI Safety Requires Effective Controllability
Yige Li, Yunhao Feng +1
Hypothesis Generation and Inductive Inference in Children and Language Models
Jeffrey Qin, Wasu Top Piriyakulki +5
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
Chen Linze, Cai Yufan +2
MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents
Zihan Li, Xingyu Fan +2
PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
Yuchen Guo, Junli Gong +3
A Unified Framework for the Evaluation of LLM Agentic Capabilities
Pengyu Zhu, Lijun Li +6
Can LLMs Introspect? A Reality Check
Shashwat Singh, Tal Linzen +1
AMEL: Accumulated Message Effects on LLM Judgments
Sid-ali Temkit
Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
Aman Priyanshu, Supriti Vijay +1
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
Vartan Shadarevian, Kia Ghods +2
Continual Model Routing in Evolving Model Hubs
Jack Bell, Giacomo Carfì +2
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
Simon Dennis, Rivaan Patil +2
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
Xu Shen, Zhen Tan +5
Do Clinical Models Change Treatment Decisions?
Dongkyu Cho, Miao Zhang +1
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Yuyang Hu, Hongjin Qian +6
How Well Do Models Follow Their Constitutions?
Arya Jakkli, Senthooran Rajamanoharan +1
Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction
João Sedoc, Baotong Zhang +1
A Policy-Driven Runtime Layer for Agentic LLM Serving
Rui Zhang, Chaeeun Kim +1
From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch
Haiyang You, Chengwei Lou +4
Emotional intelligence in large language models is fragmented across perception, cognition, and interaction
Minghao Lv, Lu Chen +6
Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations
Simardeep Singh, Paras Chopra
MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation
Xiaoyu Dong, Zhi Li +1
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
Dingbang Wu, Rui Hao +6
GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
Junjie Zhao, Jingyi Liang +6
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Yue Cheng, Jiajun Zhang +4
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
Haoran Zhang, Zhaohua Sun
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Zihan Liang, Yufei Ma +5
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning
Chusen Li, Zhou Liu +2
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
Lingyu Jiang, Zirui Li +6
EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
Yunqi Liu, Tong Niu +5
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
Jiawei Kong, Hao Fang +6
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
Andrii Kryshtal
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
Qitao Tan, Xiaoying Song +6
JobBench: Aligning Agent Work With Human Will
Yuetai Li, Yichen Feng +6
CODESKILL: Learning Self-Evolving Skills for Coding Agents
Yanzhou Li, Yiran Zhang +3
A Sober Look at Agentic Misalignment in Automated Workflows
Wenqian Ye, Bo Yuan +5
Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
Hao-Hsuan Chen
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
Leizhen Zhang, Shuhan Chen +1
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Yifan Zeng, Yiran Wu +5
Verifiable Benchmarking of Long-Horizon Spatial Biology
Ian Diks, Harihara Muralidharan +2
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Maksim Ivanov, Abhijay Rana
ConceptMoE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology
Xuan Wang, Zhongling Xu +6
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Jeongeun Lee, Chanyoung Park +1
Retrying vs Resampling in AI Control
James Lucassen, Adam Kaufman
v2From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
Shangding Gu
Clustering as Reasoning: A -Means Interpretation of Chain-of-Thought Graph Learning
Xuanting Xie, Zhaochen Guo +5
On the Origin of Synthetic Information by Means of Steganographic Inheritance
Ching-Chun Chang, Isao Echizen
Multi-Adapter Representation Interventions via Energy Calibration
Manjiang Yu, Hongji Li +5
Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems
Andy Xu, Yu-Wing Tai
Automatic Layer Selection for Hallucination Detection
Xinpeng Wang, William Cao +2
Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?
Qingyuan Zeng, Ziyang Chen +6
MemFail: Stress-Testing Failure Modes of LLM Memory Systems
Ishir Garg, Neel Kolhe +2
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
Sadia Asif, Mohammad Mohammadi Amiri +3
TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval
Yuhang Zhang, Keyan Ding +6
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
Yuxin Chen, Yi Zhang +6
Measuring Progress Toward AGI: A Cognitive Framework
Ryan Burnell, Yumeya Yamamori +6
Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
Minghao Fu, Fan Feng +2
Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
Pin Qian, Su Wang +6
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
Guiyao Tie, Jiawen Shi +6
Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns
Guni Sharon
Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
Zhenyu Cui, Xiangzhong Luo
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
Minwei Kong, Chonghe Jiang +6
Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
Seokil Ham, Jaehyuk Jang +2
Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
Yang Zhang, Xiaoshuai Sun +6
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Aritra Dutta, Somak Aditya
Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration
Yuanzhi Xu, Qian Gao +5
MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
Thao Nguyen, Heng Ji
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
Yunhui Gan, Tan Pan +6
Test-Time Deep Thinking to Explore Implicit Rules
Wentong Chen, Xin Cong +6
Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement
Jyotirmoy Nath, Neeraj Kumar +1
Energy Shields for Fairness
Filip Cano, Thomas A. Henzinger +1
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Yubo Li, Yidi Miao +2
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
Matteo Gioele Collu, Riccardo Conte +5
Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis
Xiaoyang Fan, Yufan Cai +2
Plan Before Search: Search Agents Need Plan
Zhipeng Qian, Zihan Liang +6
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
Yuting Xu, Jiayi Tian +5
Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values
Seongjun Lee, Suwan Yoon +1
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
Soeun Kim, Albert No
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Yuhao Shen, Lang Cao +5
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Heng Qu, Yike Liu +5
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
Harshada Badave, Santosh Borse +6
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
Yoosung Hong
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
Aristotelis Lazaridis, Dylan Bates +4
Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
Bowen Wei, Nan Wang +3
Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages
Shuoming Zhang, Qiuchu Yu +6
Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology
Gustavo, Angulo
You Live More Than Once: Towards Hierarchical Skill Meta-Evolving
Xujun Li, Kehan Zheng +6
Revealing Algorithmic Deductive Circuits for Logical Reasoning
Phuong Minh Nguyen, Tien Huu Dang +1
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs
Yansong Ning, Mianpeng Liu +3
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
Mateusz Czyżnikiewicz, Ryszard Tuora +6
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Zhewen Tan, Yilun Yao +6
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
Yingtie Lei, Zhongwei Wan +6
BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models
Fei Deng, Yanwu Xu +5
Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models
Andrew Corbett, Archit Sood +3
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
Ali Şenol, Garima Agrawal +1
NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding
Sijin Yu, Zijiao Chen +6
Design and Report Benchmarks for Knowledge Work
Yining Hua, Hongbin Na +2
Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network
Qiming Ye, Peixain Zhang +3
v2Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning
Yi Wang, Haojie Lu +3
Representation Without Control: Testing the Realization Effect in Language Models
Ciarán Walsh, Emilio Barkett
Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
Yexing Du, Kaiyuan Liu +5
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
Yuxin Chen, Xiaodong Cai +6
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
Hankyeol Kim, Pilsung Kang
DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution
Yunhai Hu, Zining Liu +6
Towards end-to-end LLM-based censoring-aware survival analysis
Yishu Wei, Hexin Dong +4
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Banghao Chi, Yining Xie +6
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Adam Bawatneh, Sagar Sapkota +3
-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
Aoxi Liu, Yupeng Chen +6
Generating Robust Portfolios of Optimization Models using Large Language Models
Eleni Straitouri, Cheol Woo Kim +1
Agentic Proving for Program Verification
Alessandro Sosso, Akhil Arora +1
When Mean CE Fails: Median CE Can Better Track Language Model Quality
Hao Guo, Simon Dennis +2
Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
Niklas Weller, Emilio Barkett
SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver
Rongsheng Chen, Changliang Zhou +5
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
Yusong Lin, Xinyuan Liang +6
Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction
Simon Dennis, Kevin Shabahang +2
DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs
Yi Li, Songtao Wei +4
Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness
Jaechang Kim, Sunung Mun +3
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Jiazheng Kang, Bowen Zhang +5
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions
Jingwei Sun, Jianing Zhu +4
Natural Language Query to Configuration for Retrieval Agents
Melissa Z. Pan, Negar Arabzadeh +4
AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Hanjun Luo, Zhimu Huang +6
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data
Junlan Feng, Fanyu Meng +6
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
Camilo Chacón Sartori, José H. García
StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
Yanfei Zhang, Xu Lin +1
DART: Semantic Recoverability for Structured Tool Agents
Ke Yang, Panpan Li +4