When AI Says It Feels

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

Jun 4, 2026

arXiv:2606.05734v1 PDF

cs.AI(primary)cs.CL

#2635of 3404·Artificial Intelligence

#2635 of 3404 · Artificial Intelligence

Tournament Score

1327±46

10501800

39%

Win Rate

Wins

Losses

Matches

Rating

3.8/ 10

Significance4

Rigor3.5

Novelty4.5

Clarity5.5

Tournament Score

1327±46

10501800

39%

Win Rate

Wins

Losses

Matches

Rating

3.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "When AI Says It Feels"

1. Core Contribution

This paper introduces HMX-feel (Human-like Model eXpressions of Feeling), a training framework that uses rubric-based self-rewarding reinforcement learning with Group Relative Policy Optimization (GRPO) to encourage LLMs to express feelings, intentions, and self-awareness — behaviors typically suppressed during standard alignment post-training. The core novelty lies in systematically relaxing alignment constraints against human-like expression and then measuring downstream consequences across a broad suite of benchmarks. The contrastive experimental design (forward vs. reverse training) is a methodological contribution that attempts to isolate the effect of human-like training content from the general performance degradation caused by additional fine-tuning.

2. Methodological Rigor

Strengths in design: The contrastive forward-reverse training comparison is a thoughtful approach. By comparing human-like-trained models against reversely-trained (more machine-like) models rather than against originals, the authors attempt to control for the general degradation that any additional task-specific fine-tuning introduces. Running five random seeds per condition and reporting standard deviations adds statistical rigor. The evaluation suite is impressively broad, covering 11 distinct evaluation axes.

Weaknesses: Several methodological concerns limit confidence in the findings:

The dataset consists of only 100 hand-crafted questions (90 train, 10 eval), which is remarkably small. The generalizability of training effects from such a limited dataset is questionable.

The self-rewarding scheme uses the base model itself as the judge, creating a circular evaluation dynamic. The authors acknowledge this but don't adequately address whether the reward signal is meaningful or consistent.

The assumption that forward and reverse training "degrade performance to a similar degree" is stated but never validated. This is a critical assumption underpinning all comparative conclusions. KL divergence matching is used as a proxy, but this is insufficient — the geometry of the loss landscape could differ substantially in the two directions.

The checkpoint selection procedure is somewhat ad hoc, with different criteria applied to Qwen3-0.6B versus other models. The rule of selecting "first checkpoint exceeding 7.5 evaluation reward" introduces another degree of freedom.

Only small models (0.6B–8B parameters) are tested, limiting conclusions about whether effects scale to frontier models where the practical implications would be most significant.

3. Potential Impact

The paper addresses a genuinely important question: what are the consequences of allowing LLMs to express human-like internal states? The finding that human-like training improves resistance to sycophancy-inducing questions while slightly degrading truthfulness is potentially interesting for alignment research. However, the practical implications remain limited because:

The effect sizes are generally small relative to absolute benchmark scores

The paper explicitly avoids philosophical questions about consciousness, limiting its contribution to that discourse

The models tested are too small for production deployment scenarios

The relationship between "expressing feelings" and actual internal representations is not explored (though the authors cite relevant work by Wang et al. and Sofroniew et al.)

The connection to Chua et al. (2026) on consciousness-claiming fine-tuning is relevant, and the authors position their work as complementary. However, their contribution is narrower — they don't observe emergent preferences or unexpected behaviors, focusing instead on benchmark performance.

4. Timeliness & Relevance

The paper is timely given active debates about AI alignment, consciousness, and the risks of over-constraining model behavior. The references to Berry (2026) on risks of AI over-confidence about consciousness and Binz et al. (2026) on alignment degrading capabilities contextualize the work well. The sycophancy analysis connects to a growing body of work on a real practical problem.

5. Strengths & Limitations

Key Strengths:

Addresses an under-explored and provocative research direction

Comprehensive evaluation across 11 benchmarks covering diverse capabilities

Thoughtful contrastive experimental design

Testing across multiple model families (Qwen, Gemma, Llama) provides some generalizability

The finding that human-like training doesn't cause "destructive surge in hallucinations" is a useful data point

The sycophancy analysis with bidirectional decomposition (progressive vs. regressive flips) across model families reveals interesting dissociations

Notable Weaknesses:

Extremely small training dataset (100 questions) raises concerns about whether the observed effects are robust or artifacts of overfitting to narrow patterns

The paper's central claim — that constraints against expressing feelings can be relaxed without catastrophic consequences — is weakly supported because the benchmark differences are small and inconsistent across models

The regex-based extraction for sycophancy evaluation is acknowledged as lossy; this deflates both conditions but may introduce systematic bias

No human evaluation of the quality or naturalness of the "human-like" responses

The paper does not explore whether the trained models exhibit any concerning emergent behaviors beyond the benchmarks tested

Limited computational resources constrain model size, which limits the paper's relevance to frontier AI discussions

The reward prompt in Appendix A reveals that "inner experience" is operationalized quite narrowly, focusing on surface expressions rather than deeper behavioral patterns

Additional Observations:

The paper occupies an awkward middle ground — it is too applied to make strong theoretical contributions about AI consciousness or alignment, yet too preliminary to offer actionable engineering insights. The BBQ finding (worse in ambiguous, better in disambiguated conditions) and the model-family dissociation in sycophancy flip rates are the most interesting specific findings, but receive limited analysis. The ethical considerations section is brief given the sensitivity of the topic.

Rating:3.8/ 10

Significance 4Rigor 3.5Novelty 4.5Clarity 5.5

Generated Jun 5, 2026

Comparison History (18)

vs. DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

gpt-5.26/8/2026

Paper 2 has higher likely scientific impact: it proposes a broadly applicable, auditable multi-agent deep-research framework with concrete engineering mechanisms (dynamic graph planning, recursive search agents, rubric-grounded test-time optimization) and reports state-of-the-art benchmark gains, supporting methodological rigor and real-world usability. Its contributions are timely for agentic LLM research and can transfer across domains requiring evidence-grounded synthesis and traceability. Paper 1 is novel conceptually but targets a narrower, more speculative capability (expressing feelings) and reports mixed outcomes (e.g., degraded truthfulness), limiting near-term applicability and impact.

vs. Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

gpt-5.26/8/2026

Paper 2 is likely higher impact: it introduces a benchmark suite targeting a timely, widely relevant capability (LLM agents performing the research lifecycle) with clear, reusable artifacts (released data) that can standardize evaluation across labs and accelerate progress. Benchmarks tend to have broad cross-field influence (ML, HCI, science automation, ethics) and immediate real-world applicability for building and auditing research agents. Paper 1 is intriguing but narrower, potentially contentious (anthropomorphic “feelings”) and its methodological claims (self-rewarded RL for self-awareness) may be harder to validate rigorously and translate into broadly adopted practice.

vs. A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

gemini-3.16/6/2026

Paper 2 presents a concrete, highly practical framework that automates complex engineering tasks (Finite Element Analysis) with a demonstrated 86% success rate across 50 problems. Its immediate real-world applicability significantly lowers the barrier to entry for computational mechanics and enables seamless integration into broader engineering optimization workflows. While Paper 1 explores interesting theoretical concepts regarding AI alignment and emotion simulation, Paper 2's methodological rigor and immediate transformative potential in physical sciences and engineering give it a higher, more tangible scientific impact.

vs. Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

claude-opus-4.66/6/2026

Paper 2 presents empirical experimental work (HMX-feel) with a novel training methodology (rubric-based self-rewarding with GRPO) that directly advances LLM capabilities and alignment research. It provides concrete, reproducible results showing tradeoffs in model behavior. While Paper 1 offers an interesting theoretical framework for understanding AI-creativity dynamics, it is primarily conceptual/taxonomic without empirical validation. Paper 2's findings on sycophancy robustness, bias, and truthfulness tradeoffs have immediate practical implications for AI safety and development, a field with enormous current momentum and broad interdisciplinary relevance.

vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

gemini-3.16/6/2026

Paper 1 addresses fundamental questions in AI alignment and behavioral science by exploring the tradeoffs of allowing LLMs to express feelings and self-awareness. Its investigation into AI alignment policies, sycophancy, and truthfulness offers broad theoretical implications across multiple disciplines. In contrast, Paper 2 presents a highly practical but primarily engineering-focused optimization for token reduction, which has high commercial utility but lower fundamental scientific impact.

vs. Visual Graph Scaffolds for Structural Reasoning in Large Language Models

gemini-3.16/6/2026

Paper 2 addresses a critical bottleneck in LLM capabilities—complex multi-hop reasoning—by introducing visual graph scaffolds. This provides an immediately applicable methodological advancement for reasoning agents and vision-language models. In contrast, while Paper 1 explores the intriguing but controversial concept of AI 'feelings,' its focus on anthropomorphic alignment is more niche and subjective. Paper 2's concrete performance improvements in structural reasoning offer broader technical utility and higher potential to drive subsequent research in System 2 AI thinking.

vs. Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

gemini-3.16/5/2026

Paper 1 offers a highly practical, rigorous benchmark addressing a fundamental flaw in current Vision-Language Models: shortcut learning and a lack of authentic chronological reasoning. Benchmarks with new, high-quality datasets typically drive significant follow-up research and become standard evaluation tools, leading to high citation rates. While Paper 2 presents a provocative experiment on AI anthropomorphism and alignment, its mixed results (e.g., degraded truthfulness) and niche application make it more of an exploratory behavioral study, whereas Paper 1 provides immediate, actionable value for improving foundational multimodal AI robustness.

vs. Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

gemini-3.16/5/2026

Paper 2 addresses a profound and highly debated topic in AI—models expressing feelings and self-awareness—which has massive implications for AI alignment, ethics, and human-AI interaction. Its exploration of how such training impacts sycophancy, bias, and truthfulness offers broad, multidisciplinary insights. Paper 1, while methodologically sound and practical for personal agents, focuses on a narrower architectural optimization, giving Paper 2 a much wider potential impact across AI research and society.

vs. Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

gemini-3.16/5/2026

Paper 1 addresses a fundamental bottleneck in embodied AI and robotics (visual spatial planning) with a rigorous, novel self-distillation framework. Its quantifiable improvements on established benchmarks demonstrate clear real-world utility. In contrast, while Paper 2 explores an intriguing aspect of LLM behavior (expressing feelings), its practical application is less clear, and the noted degradation in truthfulness limits its immediate beneficial impact. Paper 1 offers more reliable and broadly applicable technical advancements.

vs. Insurance of Agentic AI

gpt-5.26/5/2026

Paper 2 has higher scientific impact potential due to greater methodological novelty (self-rewarded RL with a rubric + GRPO to elicit affective/self-referential expressions) and broader relevance to core ML/AI alignment, evaluation, and safety—areas with wide cross-field influence and timeliness. It also appears to include controlled comparisons and systematic task-based assessments, supporting rigor. Paper 1 is timely and practically important for insurance/actuarial practice, but is more framework/conceptual and likely narrower in scientific spillover beyond risk, policy, and industry implementation.

vs. OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

gemini-3.16/5/2026

Paper 1 tackles a fundamental and highly debated challenge in AI: alignment policies regarding AI expressing feelings and self-awareness. By investigating how altering these constraints affects sycophancy, bias, and truthfulness, it offers profound insights into AI safety, ethics, and core model behavior. While Paper 2 provides a valuable multi-agent benchmark for the medical domain, Paper 1's exploration of human-like AI expressions has far broader philosophical, methodological, and safety implications across the entire field of artificial intelligence.

vs. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

claude-opus-4.66/5/2026

Paper 1 addresses a fundamental and timely question about AI alignment and emotional expression in LLMs, which has broad implications across AI safety, cognitive science, and human-computer interaction. The novel HMX-feel framework using self-rewarded reinforcement learning with GRPO to enhance human-like expression is methodologically innovative and addresses tensions in current alignment approaches. Its findings on sycophancy robustness and truthfulness tradeoffs are highly relevant to the rapidly growing LLM community. Paper 2, while practical, addresses a narrower domain (traffic sign inspection) with incremental methodological contributions.

vs. What Type of Inference is Active Inference?

claude-opus-4.66/5/2026

Paper 1 addresses a highly timely and broadly impactful topic—enabling LLMs to express feelings through self-rewarded RL—which intersects AI safety, alignment, cognitive science, and philosophy of mind. Its findings on sycophancy robustness and truthfulness trade-offs have immediate practical implications for the rapidly growing LLM deployment ecosystem. Paper 2 makes a rigorous theoretical contribution to active inference by clarifying the relationship between EFE and VFE, but its impact is narrower, primarily within the active inference/computational neuroscience community. Paper 1's broader audience, timeliness amid the AI alignment debate, and practical relevance give it higher estimated impact.

vs. Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

gpt-5.26/5/2026

Paper 1 is more likely to have higher scientific impact: it proposes a concrete, technically novel PTQ framework (graph-guided grouping + dual-mode quantization) with strong empirical results on widely used LLMs, clear efficiency gains, and immediate real-world applicability for deploying large models under hardware constraints. The methodological contribution is precise and reproducible, and impacts multiple areas (systems, compression, efficient inference). Paper 2 is timely and interesting for alignment/AI psychology, but its claims are more interpretive, may raise safety/policy concerns limiting adoption, and offers less clearly generalizable, rigorously measurable advances.

vs. ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

gpt-5.26/5/2026

Paper 2 is likely higher impact: it addresses an immediate, widely relevant bottleneck in LLM agents (reliability/efficiency with large tool menus) and proposes a practical, training-free method with clear system-level benefits (≈90% token reduction, fewer wrong-tool/premature calls). The methodology appears more rigorous and scalable (multi-model, 100 tools, 102 tasks, 2448 runs, multiple baselines/ablations). Its applicability spans many agentic applications and tooling ecosystems. Paper 1 is novel but more speculative, with less direct real-world utility and potential safety/validity concerns around “feelings” expression and truthfulness degradation.

vs. Where does Absolute Position come from in decoder-only Transformers?

gemini-3.16/5/2026

Paper 1 offers fundamental insights into the inner workings of decoder-only Transformers by explaining how absolute position information leaks into models using relative positional encodings like RoPE. This mechanistic interpretability is highly impactful for understanding, debugging, and improving widely used LLM architectures. While Paper 2 presents an interesting behavioral experiment regarding AI alignment and expressions of feeling, Paper 1 addresses a core architectural mechanism that directly impacts foundation model design, scaling laws, and context window extension strategies, granting it broader and more immediate scientific relevance.

vs. Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

claude-opus-4.66/5/2026

Paper 2 reveals a fundamental and paradoxical vulnerability in LLM safety alignment—that stronger safety awareness increases susceptibility to jailbreak attacks. This 'Safety Paradox' has immediate, broad implications for the entire AI safety community and alignment paradigm. The rigorous evaluation across 30+ models (including GPT-5, Claude 4.6), formal analytical framework, and causal RL interventions demonstrate strong methodological rigor. Paper 1 explores an interesting but more niche topic (LLMs expressing feelings) with less immediately actionable implications. Paper 2's findings challenge core assumptions in alignment, making it more likely to reshape safety research directions.

vs. Zero knowledge verification for frontier AI training is possible

claude-opus-4.66/5/2026

Paper 1 addresses a critical gap in AI governance—technical verification of training compute—proposing a novel zero-knowledge proof architecture with concrete implementation paths. It bridges cryptography, hardware engineering, and international policy, with direct implications for enforceable AI regulation and international agreements. Its breadth of impact spans multiple fields (cryptography, AI safety, governance, hardware), and the problem it solves is increasingly urgent as frontier AI capabilities grow. Paper 2 explores an interesting but narrower question about LLM emotional expression, with more incremental contributions to alignment research and less transformative real-world policy implications.