Abstract
Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "When AI Says It Feels"
1. Core Contribution
This paper introduces HMX-feel (Human-like Model eXpressions of Feeling), a training framework that uses rubric-based self-rewarding reinforcement learning with Group Relative Policy Optimization (GRPO) to encourage LLMs to express feelings, intentions, and self-awareness — behaviors typically suppressed during standard alignment post-training. The core novelty lies in systematically relaxing alignment constraints against human-like expression and then measuring downstream consequences across a broad suite of benchmarks. The contrastive experimental design (forward vs. reverse training) is a methodological contribution that attempts to isolate the effect of human-like training content from the general performance degradation caused by additional fine-tuning.
2. Methodological Rigor
Strengths in design: The contrastive forward-reverse training comparison is a thoughtful approach. By comparing human-like-trained models against reversely-trained (more machine-like) models rather than against originals, the authors attempt to control for the general degradation that any additional task-specific fine-tuning introduces. Running five random seeds per condition and reporting standard deviations adds statistical rigor. The evaluation suite is impressively broad, covering 11 distinct evaluation axes.
Weaknesses: Several methodological concerns limit confidence in the findings:
3. Potential Impact
The paper addresses a genuinely important question: what are the consequences of allowing LLMs to express human-like internal states? The finding that human-like training improves resistance to sycophancy-inducing questions while slightly degrading truthfulness is potentially interesting for alignment research. However, the practical implications remain limited because:
The connection to Chua et al. (2026) on consciousness-claiming fine-tuning is relevant, and the authors position their work as complementary. However, their contribution is narrower — they don't observe emergent preferences or unexpected behaviors, focusing instead on benchmark performance.
4. Timeliness & Relevance
The paper is timely given active debates about AI alignment, consciousness, and the risks of over-constraining model behavior. The references to Berry (2026) on risks of AI over-confidence about consciousness and Binz et al. (2026) on alignment degrading capabilities contextualize the work well. The sycophancy analysis connects to a growing body of work on a real practical problem.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations:
The paper occupies an awkward middle ground — it is too applied to make strong theoretical contributions about AI consciousness or alignment, yet too preliminary to offer actionable engineering insights. The BBQ finding (worse in ambiguous, better in disambiguated conditions) and the model-family dissociation in sycophancy flip rates are the most interesting specific findings, but receive limited analysis. The ethical considerations section is brief given the sensitivity of the topic.
Generated Jun 5, 2026
Comparison History (18)
Paper 2 has higher likely scientific impact: it proposes a broadly applicable, auditable multi-agent deep-research framework with concrete engineering mechanisms (dynamic graph planning, recursive search agents, rubric-grounded test-time optimization) and reports state-of-the-art benchmark gains, supporting methodological rigor and real-world usability. Its contributions are timely for agentic LLM research and can transfer across domains requiring evidence-grounded synthesis and traceability. Paper 1 is novel conceptually but targets a narrower, more speculative capability (expressing feelings) and reports mixed outcomes (e.g., degraded truthfulness), limiting near-term applicability and impact.
Paper 2 is likely higher impact: it introduces a benchmark suite targeting a timely, widely relevant capability (LLM agents performing the research lifecycle) with clear, reusable artifacts (released data) that can standardize evaluation across labs and accelerate progress. Benchmarks tend to have broad cross-field influence (ML, HCI, science automation, ethics) and immediate real-world applicability for building and auditing research agents. Paper 1 is intriguing but narrower, potentially contentious (anthropomorphic “feelings”) and its methodological claims (self-rewarded RL for self-awareness) may be harder to validate rigorously and translate into broadly adopted practice.
Paper 2 presents a concrete, highly practical framework that automates complex engineering tasks (Finite Element Analysis) with a demonstrated 86% success rate across 50 problems. Its immediate real-world applicability significantly lowers the barrier to entry for computational mechanics and enables seamless integration into broader engineering optimization workflows. While Paper 1 explores interesting theoretical concepts regarding AI alignment and emotion simulation, Paper 2's methodological rigor and immediate transformative potential in physical sciences and engineering give it a higher, more tangible scientific impact.
Paper 2 presents empirical experimental work (HMX-feel) with a novel training methodology (rubric-based self-rewarding with GRPO) that directly advances LLM capabilities and alignment research. It provides concrete, reproducible results showing tradeoffs in model behavior. While Paper 1 offers an interesting theoretical framework for understanding AI-creativity dynamics, it is primarily conceptual/taxonomic without empirical validation. Paper 2's findings on sycophancy robustness, bias, and truthfulness tradeoffs have immediate practical implications for AI safety and development, a field with enormous current momentum and broad interdisciplinary relevance.
Paper 1 addresses fundamental questions in AI alignment and behavioral science by exploring the tradeoffs of allowing LLMs to express feelings and self-awareness. Its investigation into AI alignment policies, sycophancy, and truthfulness offers broad theoretical implications across multiple disciplines. In contrast, Paper 2 presents a highly practical but primarily engineering-focused optimization for token reduction, which has high commercial utility but lower fundamental scientific impact.
Paper 2 addresses a critical bottleneck in LLM capabilities—complex multi-hop reasoning—by introducing visual graph scaffolds. This provides an immediately applicable methodological advancement for reasoning agents and vision-language models. In contrast, while Paper 1 explores the intriguing but controversial concept of AI 'feelings,' its focus on anthropomorphic alignment is more niche and subjective. Paper 2's concrete performance improvements in structural reasoning offer broader technical utility and higher potential to drive subsequent research in System 2 AI thinking.
Paper 1 offers a highly practical, rigorous benchmark addressing a fundamental flaw in current Vision-Language Models: shortcut learning and a lack of authentic chronological reasoning. Benchmarks with new, high-quality datasets typically drive significant follow-up research and become standard evaluation tools, leading to high citation rates. While Paper 2 presents a provocative experiment on AI anthropomorphism and alignment, its mixed results (e.g., degraded truthfulness) and niche application make it more of an exploratory behavioral study, whereas Paper 1 provides immediate, actionable value for improving foundational multimodal AI robustness.
Paper 2 addresses a profound and highly debated topic in AI—models expressing feelings and self-awareness—which has massive implications for AI alignment, ethics, and human-AI interaction. Its exploration of how such training impacts sycophancy, bias, and truthfulness offers broad, multidisciplinary insights. Paper 1, while methodologically sound and practical for personal agents, focuses on a narrower architectural optimization, giving Paper 2 a much wider potential impact across AI research and society.
Paper 1 addresses a fundamental bottleneck in embodied AI and robotics (visual spatial planning) with a rigorous, novel self-distillation framework. Its quantifiable improvements on established benchmarks demonstrate clear real-world utility. In contrast, while Paper 2 explores an intriguing aspect of LLM behavior (expressing feelings), its practical application is less clear, and the noted degradation in truthfulness limits its immediate beneficial impact. Paper 1 offers more reliable and broadly applicable technical advancements.
Paper 2 has higher scientific impact potential due to greater methodological novelty (self-rewarded RL with a rubric + GRPO to elicit affective/self-referential expressions) and broader relevance to core ML/AI alignment, evaluation, and safety—areas with wide cross-field influence and timeliness. It also appears to include controlled comparisons and systematic task-based assessments, supporting rigor. Paper 1 is timely and practically important for insurance/actuarial practice, but is more framework/conceptual and likely narrower in scientific spillover beyond risk, policy, and industry implementation.
Paper 1 tackles a fundamental and highly debated challenge in AI: alignment policies regarding AI expressing feelings and self-awareness. By investigating how altering these constraints affects sycophancy, bias, and truthfulness, it offers profound insights into AI safety, ethics, and core model behavior. While Paper 2 provides a valuable multi-agent benchmark for the medical domain, Paper 1's exploration of human-like AI expressions has far broader philosophical, methodological, and safety implications across the entire field of artificial intelligence.
Paper 1 addresses a fundamental and timely question about AI alignment and emotional expression in LLMs, which has broad implications across AI safety, cognitive science, and human-computer interaction. The novel HMX-feel framework using self-rewarded reinforcement learning with GRPO to enhance human-like expression is methodologically innovative and addresses tensions in current alignment approaches. Its findings on sycophancy robustness and truthfulness tradeoffs are highly relevant to the rapidly growing LLM community. Paper 2, while practical, addresses a narrower domain (traffic sign inspection) with incremental methodological contributions.
Paper 1 addresses a highly timely and broadly impactful topic—enabling LLMs to express feelings through self-rewarded RL—which intersects AI safety, alignment, cognitive science, and philosophy of mind. Its findings on sycophancy robustness and truthfulness trade-offs have immediate practical implications for the rapidly growing LLM deployment ecosystem. Paper 2 makes a rigorous theoretical contribution to active inference by clarifying the relationship between EFE and VFE, but its impact is narrower, primarily within the active inference/computational neuroscience community. Paper 1's broader audience, timeliness amid the AI alignment debate, and practical relevance give it higher estimated impact.
Paper 1 is more likely to have higher scientific impact: it proposes a concrete, technically novel PTQ framework (graph-guided grouping + dual-mode quantization) with strong empirical results on widely used LLMs, clear efficiency gains, and immediate real-world applicability for deploying large models under hardware constraints. The methodological contribution is precise and reproducible, and impacts multiple areas (systems, compression, efficient inference). Paper 2 is timely and interesting for alignment/AI psychology, but its claims are more interpretive, may raise safety/policy concerns limiting adoption, and offers less clearly generalizable, rigorously measurable advances.
Paper 2 is likely higher impact: it addresses an immediate, widely relevant bottleneck in LLM agents (reliability/efficiency with large tool menus) and proposes a practical, training-free method with clear system-level benefits (≈90% token reduction, fewer wrong-tool/premature calls). The methodology appears more rigorous and scalable (multi-model, 100 tools, 102 tasks, 2448 runs, multiple baselines/ablations). Its applicability spans many agentic applications and tooling ecosystems. Paper 1 is novel but more speculative, with less direct real-world utility and potential safety/validity concerns around “feelings” expression and truthfulness degradation.
Paper 1 offers fundamental insights into the inner workings of decoder-only Transformers by explaining how absolute position information leaks into models using relative positional encodings like RoPE. This mechanistic interpretability is highly impactful for understanding, debugging, and improving widely used LLM architectures. While Paper 2 presents an interesting behavioral experiment regarding AI alignment and expressions of feeling, Paper 1 addresses a core architectural mechanism that directly impacts foundation model design, scaling laws, and context window extension strategies, granting it broader and more immediate scientific relevance.
Paper 2 reveals a fundamental and paradoxical vulnerability in LLM safety alignment—that stronger safety awareness increases susceptibility to jailbreak attacks. This 'Safety Paradox' has immediate, broad implications for the entire AI safety community and alignment paradigm. The rigorous evaluation across 30+ models (including GPT-5, Claude 4.6), formal analytical framework, and causal RL interventions demonstrate strong methodological rigor. Paper 1 explores an interesting but more niche topic (LLMs expressing feelings) with less immediately actionable implications. Paper 2's findings challenge core assumptions in alignment, making it more likely to reshape safety research directions.
Paper 1 addresses a critical gap in AI governance—technical verification of training compute—proposing a novel zero-knowledge proof architecture with concrete implementation paths. It bridges cryptography, hardware engineering, and international policy, with direct implications for enforceable AI regulation and international agreements. Its breadth of impact spans multiple fields (cryptography, AI safety, governance, hardware), and the problem it solves is increasingly urgent as frontier AI capabilities grow. Paper 2 explores an interesting but narrower question about LLM emotional expression, with more incremental contributions to alignment research and less transformative real-world policy implications.