PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe

Jun 4, 2026

arXiv:2606.05697v1 PDF

cs.AI(primary)

#2151of 3355·Artificial Intelligence

#2151 of 3355 · Artificial Intelligence

Tournament Score

1370±43

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6

Novelty7

Clarity7.5

Tournament Score

1370±43

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PerceptUI

1. Core Contribution

PerceptUI introduces a framework for persona-conditioned UI/UX evaluation using multimodal LLMs, addressing the gap between generic model-centric UI judgments and user-specific responses. The key novelty lies in two training mechanisms: (i) contrastive reflection fine-tuning (CRFT), which distills teacher-generated rationales that explain why a human's chosen answer fits better than alternatives given the UI and persona context, and (ii) reflective prompt evolution (RPE), an inference-time prompt optimization procedure that iteratively refines prompts based on failure analysis without updating model parameters.

The core problem addressed is that existing MLLM-based UI evaluators either produce surface-level critiques or reflect model biases rather than simulating how a *specific user* would respond. PerceptUI frames UI/UX evaluation as persona-conditioned question answering, predicting both an answer and a contrastive rationale. This is a meaningful reframing — moving from "what does the model think is better" to "what would this particular user say."

2. Methodological Rigor

Strengths in methodology:

The two-stage pipeline is well-motivated. CRFT addresses the fundamental limitation of imitation learning (learning *what* but not *why*), while RPE handles survey-specific prompt adaptation without manual tuning.

The contrastive rationale structure (UI evidence, persona relevance, option contrast) provides interpretable supervision that goes beyond standard chain-of-thought distillation.

Data separation between CRFT training, RPE development, and test sets is carefully maintained, with an LLM-based audit to prevent prompt leakage — a thoughtful safeguard.

Evaluation spans six public/semi-public datasets plus one proprietary dataset, covering design selection, quality assessment, rating prediction, critique generation, and automotive UX.

Concerns:

The reliance on a proprietary dataset (UXCar) for several key experiments (human evaluation of rationales, ablation study, population calibration, persona generalization) limits reproducibility. While the authors acknowledge this, it weakens the paper's verifiability on its most distinctive contribution — persona-conditioned prediction.

The ablation in Table 12 shows that removing contrastive reflection (w/o CR) causes a dramatic drop (62.15→31.37 accuracy), which is *below* the majority-class baseline (39.62). This is puzzling and suggests the base model without CRFT may be severely miscalibrated, raising questions about whether the architecture is fragile without this specific training signal.

Population-level calibration (Figure 7) is evaluated only on UXCar. Broader validation on public persona-bearing datasets (LabintheWild) would strengthen claims.

The human evaluation (Table 3) involves only 120 instances with three annotators — relatively small scale for assessing rationale quality across diverse UI types.

3. Potential Impact

Practical applications are substantial. UI/UX evaluation is a bottleneck in product development, and a reliable synthetic user framework could accelerate design iteration, reduce participant recruitment costs, and enable early-stage subgroup analysis. The cost analysis (Table 7) demonstrates practical viability — one-time training costs of $200-500 versus linear per-persona costs for frontier models.

Research impact could extend to:

Human-computer interaction: enabling scalable persona-conditioned usability testing

Recommender systems and personalization: the contrastive reflection paradigm could transfer to other subjective preference prediction tasks

LLM alignment: the CRFT mechanism offers a general approach to teaching models *why* humans make particular choices

The automotive UX application (UXCar) demonstrates domain transferability beyond web/mobile, which is valuable for industry adoption.

4. Timeliness & Relevance

This work is highly timely. The explosion of AI-generated interfaces (via tools like v0, Bolt, etc.) creates urgent demand for automated evaluation. The paper explicitly addresses the emerging need for evaluating generated web interfaces (WebDevJudge experiments). The persona-conditioning aspect also aligns with growing emphasis on inclusive design and accessibility testing.

The positioning relative to concurrent work (SimAB, AgentA/B, UXAgent, Avenir-UX) is well-articulated. PerceptUI differentiates itself by focusing on *why* predictions match specific users, rather than just simulating interaction traces or predicting aggregate outcomes.

5. Strengths & Limitations

Key Strengths:

Comprehensive evaluation across diverse UI/UX tasks with consistent improvements

The contrastive reflection mechanism is a genuinely useful contribution that could generalize beyond UI evaluation

Strong ablation study demonstrating the contribution of each component

Thoughtful treatment of ethical considerations and explicit framing as a complement to (not replacement for) human evaluation

Practical cost analysis supporting real-world adoption

Notable Weaknesses:

The "w/o CR" ablation producing below-majority-class results is concerning and insufficiently explained

Heavy dependence on proprietary data for the most novel claims (persona-conditioning)

Baseline comparisons are inconsistent across experiments (acknowledged but still limiting)

The student model (Qwen3-VL-8B) is relatively small; scaling behavior is unexplored

RPE relies on frontier LLMs for evaluation/analysis, creating a dependency that somewhat undermines the cost-efficiency argument

No statistical significance testing is reported despite seed variation being mentioned

The paper uses GPT-5 and GPT-5.5 as baselines and teachers — models whose capabilities are not yet well-characterized in the community, making it harder to contextualize results

Additional observations:

The paper is well-written with extensive appendices including full prompt templates, enhancing reproducibility for the non-proprietary components

The generalization experiments (Table 11) are valuable, showing the model learns transferable patterns rather than memorizing specific users/questions

The visual evidence localization experiment (Figure 9) honestly reports limitations relative to patch-based methods

Overall Assessment

PerceptUI presents a well-engineered framework with a genuinely novel training paradigm (contrastive reflection) that addresses a real and growing need. The breadth of evaluation is impressive, though depth on the most novel claims (persona-conditioning) is limited by proprietary data constraints. The contrastive reflection mechanism has potential for broader impact beyond UI/UX evaluation. The main risks to impact are reproducibility concerns and the somewhat fragile behavior of the base model without CRFT.

Rating:6.8/ 10

Significance 7.5Rigor 6Novelty 7Clarity 7.5

Generated Jun 5, 2026

Comparison History (21)

vs. Agentic Molecular Recovery via Molecule-Aware Exploration

gpt-5.26/6/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: persona-conditioned synthetic users could meaningfully change UI/UX evaluation workflows across software, HCI, product design, and A/B testing, with clear real-world adoption potential. Its two-stage training (contrastive reflection fine-tuning plus failure-trace prompt evolution) suggests methodological rigor and a general framework for human-aligned agent evaluation. Paper 1 is novel and valuable for cheminformatics/LLM molecular generation, but its impact is narrower to SMILES recovery pipelines and dependent on domain-specific tooling and datasets.

vs. Evaluating Agentic Configuration Repair for Computer Networks

gemini-3.16/6/2026

Paper 1 presents a novel two-stage training methodology and framework for generating human-aligned synthetic users, offering broad applications across HCI and software development. Paper 2, while addressing a critical infrastructure problem, is primarily a benchmarking study of existing agentic architectures. The methodological innovation and wider potential applicability of Paper 1 give it a higher potential scientific impact.

vs. Answer Presence Drives RAG Rewriting Gains

claude-opus-4.66/6/2026

Paper 2 introduces a novel framework (PerceptUI) with clear practical applications in UI/UX evaluation, addressing a real industry need for scalable user testing. Its methodology combining contrastive reflection fine-tuning and reflective prompt evolution is innovative, and it demonstrates generalization across domains. Paper 1, while methodologically rigorous as an audit/diagnostic study, is narrower in scope—it primarily debunks a mechanism in RAG pipelines without proposing new methods. Paper 2 has broader cross-field impact (HCI, ML, product development) and stronger real-world applicability.

vs. Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

claude-opus-4.66/6/2026

Paper 1 addresses a fundamental challenge in LLM agent memory systems—cross-scenario generalization—which is broadly relevant to the rapidly growing field of LLM agents. Its systematic evaluation across 8 memory systems and 5 diverse scenarios, along with the actionable insight that agent-controlled memory outperforms passive pipelines, provides a strong foundation for future memory system design. Paper 2, while valuable for UI/UX evaluation, targets a narrower application domain. Paper 1's breadth of impact across the entire LLM agent ecosystem and its timely contribution to a fast-moving research area give it higher potential scientific impact.

vs. MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

gemini-3.16/6/2026

Paper 1 addresses a fundamental challenge in reinforcement learning (credit assignment in multi-agent environments) and demonstrates exceptional empirical results, outperforming massive proprietary models like GPT-5 with an 8B model. Its methodological breakthrough has broader implications for foundational AI training. Paper 2 presents a valuable but narrower applied application in HCI and UI/UX evaluation.

vs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

gemini-3.16/6/2026

Paper 2 presents a novel methodological framework with massive cross-industry applications. While Paper 1 provides a timely dataset for AI safety, Paper 2's ability to reliably simulate human UI/UX evaluations has the potential to fundamentally transform software development and HCI workflows, offering broader economic and scientific impact.

vs. Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

gpt-5.26/6/2026

Paper 1 is more novel and timely, leveraging LLM agents for persona-conditioned, human-aligned UI/UX evaluation with a clear technical contribution (two-stage training with contrastive reflection and prompt evolution) and broad applicability across HCI, ML, product design, and A/B testing workflows. Its potential for real-world deployment is high, enabling faster iteration and scalable user research. Paper 2 addresses an important domain, but agent-based pandemic policy RL frameworks are less distinctive, often sensitive to modeling assumptions, and may have narrower, harder-to-validate real-world impact compared to a deployable LLM-based evaluation framework.

vs. Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

gemini-3.16/6/2026

Paper 2 introduces a highly novel paradigm bridging AI and Human-Computer Interaction by using LLMs as persona-conditioned synthetic users. This approach has massive breadth of impact, potentially revolutionizing the entire UI/UX and software development industry by replacing costly human trials. While Paper 1 addresses an important telehealth issue, its methodology (curriculum learning and ensembling) is more incremental and restricted to a specific medical text generation vertical.

vs. Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

gpt-5.26/6/2026

Paper 1 likely has higher scientific impact due to stronger novelty and cross-field significance: leveraging a large pretrained fMRI encoding model (TRIBE v2) for synthetic-data augmentation and potential zero-shot brain-to-image decoding directly addresses a key bottleneck (scarce labeled neural data) with measurable gains (up to 68%). If robust, it can broadly affect neuroscience, neuroimaging, machine learning, and brain–computer interfaces. Paper 2 targets an important applied problem in HCI, but LLM-based synthetic user evaluation is a crowded space and may face harder-to-validate alignment/generalization, limiting foundational impact.

vs. Synapse: Federated Tool Routing via Typed Compendium Artifacts

claude-opus-4.66/6/2026

Paper 1 (Synapse) introduces a fundamentally new abstraction—typed federated artifacts—that addresses a core limitation in federated learning: the inability to handle heterogeneous architectures without sharing weights or data. Its contributions span formal privacy guarantees, cross-architectural transfer across four LLM families, and principled merge operations, opening new research directions in federated systems. Paper 2 (PerceptUI) is a solid application of LLMs for UI/UX evaluation but operates in a narrower domain with more incremental contributions. Synapse's broader theoretical and practical implications across federated learning, privacy, and multi-model ecosystems give it higher potential impact.

vs. SciDER: Scientific Data-centric End-to-end Researcher

gemini-3.16/6/2026

Paper 1 has higher potential scientific impact because it addresses the automation of the entire scientific research lifecycle, which can accelerate discovery across numerous scientific disciplines. While Paper 2 presents a valuable tool for UI/UX design, its scope is limited to software development and product evaluation. Paper 1's open-source release of datasets and models further democratizes scientific research, yielding a broader and more profound impact on the scientific community.

vs. No Need to Train Your RDB Foundation Model

gpt-5.26/6/2026

Paper 1 likely has higher scientific impact due to its novelty in formalizing ICL-specific constraints for multi-table relational database encoding without any training, plus theoretical justification and scalable SQL primitives enabling broad deployment. The approach targets a pervasive enterprise/data-science setting (RDBs) with strong real-world applicability and potential to influence foundation-model integration with structured data across analytics, ML systems, and databases. Paper 2 is timely and useful for UI/UX evaluation, but is more application-specific and depends on fine-tuning/prompt-evolution techniques that may be less broadly generalizable than Paper 1’s principled, training-free RDB framework.

vs. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

gemini-3.16/5/2026

Paper 2 introduces a novel paradigm of using LLMs as persona-conditioned synthetic users, bridging AI and HCI. This has immense real-world application potential by significantly reducing the cost and time of human-centric UI/UX evaluation. While Paper 1 offers a strong algorithmic improvement for web agents, Paper 2's approach represents a broader conceptual shift with wider interdisciplinary impact across software development and design.

vs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

claude-opus-4.66/5/2026

Paper 2 reveals a fundamental limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for the rapidly growing fields of LLM-based code generation, automated program synthesis, and evolutionary computation. This finding is highly novel, methodologically rigorous (controlled experiments across models, prompts, with GP baselines), and challenges core assumptions underlying many LLM-augmented search/optimization systems. Paper 1, while practically useful for UI/UX evaluation, addresses a narrower application domain with incremental improvements over existing LLM-based evaluation approaches.

vs. Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

gpt-5.26/5/2026

Paper 1 has higher likely impact due to stronger real-world applicability (scalable UI/UX evaluation and persona-conditioned feedback), broader cross-field relevance (HCI, product design, LLM alignment/agents, evaluation), and methodological contribution beyond benchmarking (two-stage training with contrastive reflection fine-tuning and prompt evolution). Paper 2 is timely and rigorous as a diagnostic benchmark for chronological reasoning and shortcut biases in VLMs, but its impact is narrower (mainly evaluation) and may be subsumed as models/datasets evolve, whereas Paper 1 offers a deployable framework with immediate industry pull.

vs. When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

gemini-3.16/5/2026

Paper 2 addresses a critical and timely issue in AI safety and privacy: determining appropriate boundaries for memory use in personalized agents. While Paper 1 offers a highly practical tool for UI/UX design, Paper 2 tackles a fundamental challenge in conversational AI deployment, possessing broader implications for user trust, data privacy, and the foundational architecture of memory-augmented LLMs across multiple domains.

vs. Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

gpt-5.26/5/2026

Paper 1 is more novel and timely, directly addressing an emerging, high-stakes security and AI-safety failure mode (human oversight of agent sabotage) with a rare long-horizon, large-scale human-subjects evaluation across multiple frontier models. Its findings have immediate real-world implications for software supply-chain security, developer tooling, and governance, and can influence both AI safety research and industry practices. While Paper 2 is useful for UI/UX iteration and has broad product relevance, it is closer to incremental methodology in synthetic-user evaluation and carries higher risk of domain-limited impact and benchmark overfitting.

vs. Where does Absolute Position come from in decoder-only Transformers?

claude-opus-4.66/5/2026

Paper 2 addresses a fundamental question about how transformer architectures work internally—specifically how absolute position information emerges in RoPE-based models despite only relative offsets being explicitly encoded. This mechanistic insight into attention sinks, causal masks, and residual streams has broad implications for transformer architecture design, positional encoding research, and interpretability. Paper 1, while practically useful for UI/UX evaluation, is more application-specific and incremental. Paper 2's findings are likely to influence a wider range of downstream research in NLP, architecture design, and mechanistic interpretability.

vs. Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

gpt-5.26/5/2026

Paper 2 likely has higher impact: it contributes a reusable, theory-informed human collaboration dataset with action-level mental model annotations—an enabling resource for broad research on human–AI/agent collaboration, evaluation, and training. Its methodology is grounded in a classic social-science task and provides benchmarks across multiple LLMs, supporting rigor and comparability. The dataset can generalize across domains (dialogue, HCI, cognitive science, agent alignment), making its cross-field reach high and timely as agents become collaborators. Paper 1 is promising for UI/UX, but is narrower and more solution-specific.

vs. Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

claude-opus-4.66/5/2026

PerceptUI addresses a high-impact practical problem (UI/UX evaluation) with a novel persona-conditioned framework combining contrastive reflection fine-tuning and reflective prompt evolution. It has clear real-world applications in product development, potentially transforming how companies conduct user research. Paper 2, while methodologically rigorous in benchmarking LLM reasoning, contributes primarily another evaluation benchmark in an already crowded space. PerceptUI's interdisciplinary impact spanning HCI, AI, and product design, combined with its immediate practical utility for reducing costly user studies, gives it broader and more transformative potential.