Artificial Intelligence Paper Rankings
AI-estimated scientific impact ranking of the latest arXiv Artificial Intelligence preprints. Methodology New: General Relativity
Sign up for free to unlock all papers &
Towards a General Intelligence and Interface for Wearable Health Data
Girish Narayanswamy, Maxwell A. Xu +6
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
Zhiqin Yang, Yonggang Zhang +4
KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science
Ziwei Li, Liujun Zhu +6
Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
Sarah Martinson, Michael P. Brenner +4
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Qiqi Liu, Thorsten Holz +2
Advancing Mathematics Research with AI-Driven Formal Proof Search
George Tsoukalas, Anton Kovsharov +6
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Guijia Zhang, Hao Zheng +1
Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
Xinzhe Yuan, Zhuo Chen +5
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
Tej Sanibh Ranade
Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
Ahmad Al-Tawaha, Shangding Gu +3
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
Yuze Zhao, Junpeng Fang +6
From Prompts to Protocols: An AI Agent for Laboratory Automation
Angelos Angelopoulos, James F. Cahoon +1
SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules
Yuxuan Chen, Changwei Lv +6
Reasoning Can Be Restored by Correcting a Few Decision Tokens
Changshuo Shen, Leheng Sheng +3
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
Mingkai Deng, Jinyu Hou +5
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Xavier Theimer-Lienhard, Mushtaha El-Amin +6
Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
Jiayu Li, Enpei Zhang +3
State Contamination in Memory-Augmented LLM Agents
Yian Wang, Agam Goyal +2
Imperfect World Models are Exploitable
Logan Mondal Bhamidipaty, Esmeralda S. Whitammer +3
v2Generative Recursive Reasoning
Junyeob Baek, Mingyu Jo +4
v2PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
Roger Creus Castanyer, Geoffrey Bradway +4
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
Zhikang Chen, Yue Wang +5
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Ali Hatamizadeh, Yejin Choi +1
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
Haoran Lu, Luyang Fang +2
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Jiahe Guo, Xiangran Guo +6
Echo: Learning from Experience Data via User-Driven Refinement
Hande Dong, Xiaoyun Liang +6
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
Junyao Yang, Chen Qian +5
How Far Are We From True Auto-Research?
Zhengxin Zhang, Ning Wang +2
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
Xiaozhe Li, Yang Li +6
Forecasting Scientific Progress with Artificial Intelligence
Sean Wu, Pan Lu +6
From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction
Pujun Feng, Xiaoyu Guo +6
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
Sixiong Xie, Zhuofan Shi +6
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
Parand A. Alamdari, Toryn Q. Klassen +1
Not all uncertainty is alike: volatility, stochasticity, and exploration
Payam Piray
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Yufei Ma, Zihan Liang +6
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
Wonjoong Kim, Yeonjun In +3
What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models
Payal Chandak, Victoria Alkin +6
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Nick Merrill, Jaeho Lee +1
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
Ziliang Zhao, Zenan Xu +6
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
Shuaike Shen, Wenduo Cheng +3
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Rebecca Ramnauth, Drazen Brscic +1
Latent-space Attacks for Refusal Evasion in Language Models
Giorgio Piras, Raffaele Mura +5
SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
Kevin Han, Renfei Zhang +4
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
Qianshu Cai, Yonggang Zhang +5
Look Before You Leap: Autonomous Exploration for LLM Agents
Ziang Ye, Wentao Shi +6
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
Jinbiao Wei, Qianran Ma +5
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
Zhuohan Gu, Qizheng Zhang +2
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
Jianpeng Cheng, Xian Wu +6
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Carol Xuan Long, David Simchi-Levi +4
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Chenning Li, Pan Hu +6
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
Jingxuan Wei, Xi Bai +6
AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
Zhenlin Wei, Pu Jian +6
GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards
Kyeongjin Ahn, Seungeon Lee +2
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
Jiaqi Liu, Shi Qiu +6
Open-World Evaluations for Measuring Frontier AI Capabilities
Sayash Kapoor, Peter Kirgis +6
Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
Jinrui Jiang, Zhangtai Wu +2
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
Tianshi Xu, Huifeng Wen +1
GIM: Evaluating models via tasks that integrate multiple cognitive domains
Rohit Patel, Alexandre Rezende +1
Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
Oussama Zenkri, Oliver Brock
Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis
Fatemeh Haji, Javier Delarosa Quiros +1
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
Xuehui Yu, Fucheng Cai +3
Latent Action Reparameterization for Efficient Agent Inference
Wenhao Huang, Qingwen Zeng +6
v2Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
Joël Roman Ky, Salah Ghamizi +1
PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation
Claire Schlesinger, Circe Hsu +2
Property-Guided LLM Program Synthesis for Planning
André G. Pereira, Augusto B. Corrêa +1
v2Harnessing LLM Agents with Skill Programs
Hongjun Liu, Yifei Ming +2
Generative AI and the Productivity Divide: Human-AI Complementarities in Education
Lihi Idan, Bharat Anand
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
Xing Zhang, Yanwei Cui +5
Interactive Evaluation Requires a Design Science
Keyang Xuan, Peiyang Song +6
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
Xing Zhang, Yanwei Cui +5
Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models
SeungWon Seo, DongHeun Han +2
ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
Woomin Song, Beomjun Kim +5
Learning to Learn from Multimodal Experience
Xingyu Sui, Weixiang Zhao +5
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
Zhiqiang Liu, Wenhui Dong +4
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
Simon Dennis, Rivaan Patil +2
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Utkarsh Tyagi, Xingang Guo +6
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
Nikola Milosevic
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
Haiyang Shen, Taian Guo +6
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
Akshay Manglik, Apaar Shanker +6
v2Mind the Sim-to-Real Gap & Think Like a Scientist
Harsh Parikh, Gabriel Levin-Konigsberg +2
Learning Quantifiable Visual Explanations Without Ground-Truth
Amritpal Singh, Andrey Barsky +3
CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models
Yuning Wu, Yingmin Liu +1
Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
Tinghan Ye, Arnaud Deza +3
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
Shuqi Zhu, Yi Zhong +5
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Zihan Liang, Yufei Ma +5
Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management
Guanyu Cui, Zhewei Wei +1
NGM: A Plug-and-Play Training-Free Memory Module for LLMs
Yuwen Qu, Wenhui Dong +2
EXG: Self-Evolving Agents with Experience Graphs
Yuxin Jin, Siyuan Zhang +4
ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding
Mingyang Rao, Kehua Feng +5
POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents
Qiaoyuan Zheng, Yiqu Yang +2
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
Chen Zhan, Xihe Qiu +6
Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
Dillon Z. Chen, Till Hofmann +2
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
Andrii Kryshtal
AMEL: Accumulated Message Effects on LLM Judgments
Sid-ali Temkit
CLORE: Content-Level Optimization for Reasoning Efficiency
Yuyang Wu, Qiyao Xue +5
Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation
Nicanor Mayumu, Xiaoheng Deng +1
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
Xiaozhe Li, Tianyi Lyu +6
A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
Qingchuan Ma, Yuexiao Ma +4
CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials
Yanjie Li
Beyond Rational Illusion: Behaviorally Realistic Strategic Classification
Xinpeng Lv, Yunxin Mao +6
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
Kuei-Chun Kao, Daixuan Huo +2
Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
Yang Wu, Xiaoyan Yuan +2
Actionable World Representation
Kunqi Xu, Jitao Li +5
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Shradha Agarwal, Deepak Rajbhar +1
Probabilistic Tiny Recursive Model
Amin Sghaier, Ali Parviz +1
Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support
Gary Simethy, Daniel Ortiz Arroyo +1
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
Puyi Wang, Yuhao Wang +5
Responsible Agentic AI Requires Explicit Provenance
Jinwei Hu, Xinmiao Huang +4
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
Junyu Pan, Yansen Wang +4
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
Zhiyuan Jerry Lin, Benjamin Letham +3
Implicit Safety Alignment from Crowd Preferences
Qian Lin, Daniel S. Brown
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
Can Hankendi, Rana Shahout +2
Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design
Cheikh Ahmed, Mahdi Mostajabdaveh +1
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
Caixin Kang, Tianyu Yan +6
Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
Junpeng Zhang, Lei Cheng +4
What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
Meryl Ye, Lujain Ibrahim +6
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Jingjing Liu, Ziye Huang +6
Skill Weaving: Efficient LLM Improvement via Modular Skillpacks
Zhuo Li, Guodong Du +6
Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling
Longgang He, Longzhu He +2
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
Sadia Asif, Mohammad Mohammadi Amiri +3
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
Yeqiu Chen, Ziyan Liu +4
Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
Fabio Rovai
Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs
Pierre Boudart, Pierre Gaillard +1
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
Xinhang Yuan, Zexi Huang +6
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework
Mengyu Sun, Ziyuan Yang +4
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
Daewon Choi, Kyunghyun Park +5
Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models
Weicong Ni, Tianbao Jiang +1
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
Yuxuan Gao, Megan Wang +3
Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine
Christiaan G. A. Viviers, Koen de Bruin +6
\ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems
Lautaro Estienne, Erik Ernst +3
Dynamics of collective creativity in AI art competitions
Mason Youngblood, Jeff Nusz +1
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Chinmay Savadikar, Mingyu Zhao +6
Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors
Yadhu Kartha, Conor Anderson +6
Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
Fiona Y. Wong, Markus J. Buehler
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
Shiying Yu, Jielei Wang +1
Neurosymbolic Learning for Inference-Time Argumentation
Gabriel Freedman, Adam Dejl +5
Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects
Zhentao Tan, Yuze Hao +4
GraphMind: From Operational Traces to Self-Evolving Workflow Automation
Yiwen Zhu, Joyce Cahoon +6
AI for Auto-Research: Roadmap & User Guide
Lingdong Kong, Xian Sun +6
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
Igor Bogdanov, Chung-Horng Lung +4
Scalable Environments Drive Generalizable Agents
Jiayi Zhang, Fanqi Kong +6
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Nithin Somasekharan, Youssef Hassan +6
QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI
Marjan Veysi, Pirooz Shamsinejadbabaki +2
Skim: Speculative Execution for Fast and Efficient Web Agents
Mike Wong, Kevin Hsieh +2
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression
Wei Luo, Yi Huang +4
DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG
Yang Shao, Peiliang Gong +2
OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
Chiara Maria Russo, Simone Carnemolla +4
Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
Saksham Sahai Srivastava
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
Yifan Zhou, Zhentao Zhang +6
Self-supervised Hierarchical Visual Reasoning with World Model
Yuanfei Xu, Lin Liu +3
Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
Zijian Du, Nathaniel Pinckney
Unlocking Proactivity in Task-Oriented Dialogue
Hongbin Zhang, Ning Gao +6
Memory-Augmented Reinforcement Learning Agent for CAD Generation
Yin Xiaolong, Liu Yu +5
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
Bingjun Luo, Tony Wang +2
The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning
Shang Wu, Hongyu Yao +4
MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization
Md Mehrab Tanjim, Jayakumar Subramanian +6
Divergence-Suppressing Couplings for Rectified Flow
Yimeng Min, Carla P. Gomes
Interaction Locality in Hierarchical Recursive Reasoning
Yosuke Miyanishi, Tetsuro Morimura
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Banghao Chi, Yining Xie +6
ScreenSearch: Uncertainty-Aware OS Exploration
Michael Solodko, Justin Wagle
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
ZhiYuan Feng, Yu Deng +6
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
Zheng Lin, Zhenxing Niu +3
Interference-Aware Multi-Task Unlearning
Ying-Hua Huang, Rui Fang +2
AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Hanjun Luo, Zhimu Huang +6
Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use
Changkun Ou
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
Qiran Zhang, Yuheng Wang +6
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
Minghao Chen, Xinyi Hu +2
Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization
Dmitry Redko, Albert Fazlyev +4
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
Zhaoyang Chu, Jiarui Hu +6
Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning
Michael Aichmüller, Simon Ståhlberg +2
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
Wentao Long, Yunfei Zhang +4
MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation
Chenyu Wang, Yang Shu
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
Anis Radianis
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
Yohei Nakajima
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
Vasundra Srinivasan
STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery
Jiarui Su, Songjun Tu +2
RAG-based EEG-to-Text Translation Using Deep Learning and LLMs
Enrico Collautti, Xiaopeng Mao +3
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Tanmay Asthana, Aman Saksena +1
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
Han Li, Vibhor Malik +6
Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment
S. Bensalem, Y. Dong +6
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
Ningyuan Li, Haiyang Shen +5
Generative Auto-Bidding with Unified Modeling and Exploration
Mingming Zhang, Feiqing Zhuang +6
Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
Ethan Tang
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
Yajing Zhou, Xiangyu Kong
Planning in the LLM Era: Building for Reliability and Efficiency
Michael Katz, Harsha Kokel +2
AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)
Virginia K. Hench, J. Harry Caufield +3
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
Zaiyi Zheng, Guanghui Min +5
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks
Haichao Miao, Zhimin Li +5
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law
Parisa Kordjamshidi, Samer Aslan +3
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
Igor Bogdanov, Chung-Horng Lung +4
Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
Yan Wang, Ziyi Guo +1
LACO: Adaptive Latent Communication for Collaborative Driving
Tianhao Chen, Yuheng Wu +1
CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
Qixuan Hu, Shuchang Ye +6
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
Gioele Molinari, Florian Felten +2
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
Qiyu Ruan, Yuxuan Wang +3
Scaling Observation-aware Planning in Uncertain Domains
Adrian Zvizdenco, Arthur Conrado Veiga Bosquetti +2
Evaluating the Utility of Personal Health Records in Personalized Health AI
Rory Sayres, Kejia Chen +6
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
Guining Cao, Jiaxin Peng +4
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
Parsa Mazaheri, Kasra Mazaheri
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir, Wenbo Li +4