PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
Nicolas Bougie, Xiaotong Ye, Gian Maria Marconi, Narimasa Watanabe
Abstract
User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PerceptUI
1. Core Contribution
PerceptUI introduces a framework for persona-conditioned UI/UX evaluation using multimodal LLMs, addressing the gap between generic model-centric UI judgments and user-specific responses. The key novelty lies in two training mechanisms: (i) contrastive reflection fine-tuning (CRFT), which distills teacher-generated rationales that explain why a human's chosen answer fits better than alternatives given the UI and persona context, and (ii) reflective prompt evolution (RPE), an inference-time prompt optimization procedure that iteratively refines prompts based on failure analysis without updating model parameters.
The core problem addressed is that existing MLLM-based UI evaluators either produce surface-level critiques or reflect model biases rather than simulating how a *specific user* would respond. PerceptUI frames UI/UX evaluation as persona-conditioned question answering, predicting both an answer and a contrastive rationale. This is a meaningful reframing — moving from "what does the model think is better" to "what would this particular user say."
2. Methodological Rigor
Strengths in methodology:
Concerns:
3. Potential Impact
Practical applications are substantial. UI/UX evaluation is a bottleneck in product development, and a reliable synthetic user framework could accelerate design iteration, reduce participant recruitment costs, and enable early-stage subgroup analysis. The cost analysis (Table 7) demonstrates practical viability — one-time training costs of $200-500 versus linear per-persona costs for frontier models.
Research impact could extend to:
The automotive UX application (UXCar) demonstrates domain transferability beyond web/mobile, which is valuable for industry adoption.
4. Timeliness & Relevance
This work is highly timely. The explosion of AI-generated interfaces (via tools like v0, Bolt, etc.) creates urgent demand for automated evaluation. The paper explicitly addresses the emerging need for evaluating generated web interfaces (WebDevJudge experiments). The persona-conditioning aspect also aligns with growing emphasis on inclusive design and accessibility testing.
The positioning relative to concurrent work (SimAB, AgentA/B, UXAgent, Avenir-UX) is well-articulated. PerceptUI differentiates itself by focusing on *why* predictions match specific users, rather than just simulating interaction traces or predicting aggregate outcomes.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional observations:
Overall Assessment
PerceptUI presents a well-engineered framework with a genuinely novel training paradigm (contrastive reflection) that addresses a real and growing need. The breadth of evaluation is impressive, though depth on the most novel claims (persona-conditioning) is limited by proprietary data constraints. The contrastive reflection mechanism has potential for broader impact beyond UI/UX evaluation. The main risks to impact are reproducibility concerns and the somewhat fragile behavior of the base model without CRFT.
Generated Jun 5, 2026
Comparison History (21)
Paper 2 likely has higher impact due to broader applicability and timeliness: persona-conditioned synthetic users could meaningfully change UI/UX evaluation workflows across software, HCI, product design, and A/B testing, with clear real-world adoption potential. Its two-stage training (contrastive reflection fine-tuning plus failure-trace prompt evolution) suggests methodological rigor and a general framework for human-aligned agent evaluation. Paper 1 is novel and valuable for cheminformatics/LLM molecular generation, but its impact is narrower to SMILES recovery pipelines and dependent on domain-specific tooling and datasets.
Paper 1 presents a novel two-stage training methodology and framework for generating human-aligned synthetic users, offering broad applications across HCI and software development. Paper 2, while addressing a critical infrastructure problem, is primarily a benchmarking study of existing agentic architectures. The methodological innovation and wider potential applicability of Paper 1 give it a higher potential scientific impact.
Paper 2 introduces a novel framework (PerceptUI) with clear practical applications in UI/UX evaluation, addressing a real industry need for scalable user testing. Its methodology combining contrastive reflection fine-tuning and reflective prompt evolution is innovative, and it demonstrates generalization across domains. Paper 1, while methodologically rigorous as an audit/diagnostic study, is narrower in scope—it primarily debunks a mechanism in RAG pipelines without proposing new methods. Paper 2 has broader cross-field impact (HCI, ML, product development) and stronger real-world applicability.
Paper 1 addresses a fundamental challenge in LLM agent memory systems—cross-scenario generalization—which is broadly relevant to the rapidly growing field of LLM agents. Its systematic evaluation across 8 memory systems and 5 diverse scenarios, along with the actionable insight that agent-controlled memory outperforms passive pipelines, provides a strong foundation for future memory system design. Paper 2, while valuable for UI/UX evaluation, targets a narrower application domain. Paper 1's breadth of impact across the entire LLM agent ecosystem and its timely contribution to a fast-moving research area give it higher potential scientific impact.
Paper 1 addresses a fundamental challenge in reinforcement learning (credit assignment in multi-agent environments) and demonstrates exceptional empirical results, outperforming massive proprietary models like GPT-5 with an 8B model. Its methodological breakthrough has broader implications for foundational AI training. Paper 2 presents a valuable but narrower applied application in HCI and UI/UX evaluation.
Paper 2 presents a novel methodological framework with massive cross-industry applications. While Paper 1 provides a timely dataset for AI safety, Paper 2's ability to reliably simulate human UI/UX evaluations has the potential to fundamentally transform software development and HCI workflows, offering broader economic and scientific impact.
Paper 1 is more novel and timely, leveraging LLM agents for persona-conditioned, human-aligned UI/UX evaluation with a clear technical contribution (two-stage training with contrastive reflection and prompt evolution) and broad applicability across HCI, ML, product design, and A/B testing workflows. Its potential for real-world deployment is high, enabling faster iteration and scalable user research. Paper 2 addresses an important domain, but agent-based pandemic policy RL frameworks are less distinctive, often sensitive to modeling assumptions, and may have narrower, harder-to-validate real-world impact compared to a deployable LLM-based evaluation framework.
Paper 2 introduces a highly novel paradigm bridging AI and Human-Computer Interaction by using LLMs as persona-conditioned synthetic users. This approach has massive breadth of impact, potentially revolutionizing the entire UI/UX and software development industry by replacing costly human trials. While Paper 1 addresses an important telehealth issue, its methodology (curriculum learning and ensembling) is more incremental and restricted to a specific medical text generation vertical.
Paper 1 likely has higher scientific impact due to stronger novelty and cross-field significance: leveraging a large pretrained fMRI encoding model (TRIBE v2) for synthetic-data augmentation and potential zero-shot brain-to-image decoding directly addresses a key bottleneck (scarce labeled neural data) with measurable gains (up to 68%). If robust, it can broadly affect neuroscience, neuroimaging, machine learning, and brain–computer interfaces. Paper 2 targets an important applied problem in HCI, but LLM-based synthetic user evaluation is a crowded space and may face harder-to-validate alignment/generalization, limiting foundational impact.
Paper 1 (Synapse) introduces a fundamentally new abstraction—typed federated artifacts—that addresses a core limitation in federated learning: the inability to handle heterogeneous architectures without sharing weights or data. Its contributions span formal privacy guarantees, cross-architectural transfer across four LLM families, and principled merge operations, opening new research directions in federated systems. Paper 2 (PerceptUI) is a solid application of LLMs for UI/UX evaluation but operates in a narrower domain with more incremental contributions. Synapse's broader theoretical and practical implications across federated learning, privacy, and multi-model ecosystems give it higher potential impact.
Paper 1 has higher potential scientific impact because it addresses the automation of the entire scientific research lifecycle, which can accelerate discovery across numerous scientific disciplines. While Paper 2 presents a valuable tool for UI/UX design, its scope is limited to software development and product evaluation. Paper 1's open-source release of datasets and models further democratizes scientific research, yielding a broader and more profound impact on the scientific community.
Paper 1 likely has higher scientific impact due to its novelty in formalizing ICL-specific constraints for multi-table relational database encoding without any training, plus theoretical justification and scalable SQL primitives enabling broad deployment. The approach targets a pervasive enterprise/data-science setting (RDBs) with strong real-world applicability and potential to influence foundation-model integration with structured data across analytics, ML systems, and databases. Paper 2 is timely and useful for UI/UX evaluation, but is more application-specific and depends on fine-tuning/prompt-evolution techniques that may be less broadly generalizable than Paper 1’s principled, training-free RDB framework.
Paper 2 introduces a novel paradigm of using LLMs as persona-conditioned synthetic users, bridging AI and HCI. This has immense real-world application potential by significantly reducing the cost and time of human-centric UI/UX evaluation. While Paper 1 offers a strong algorithmic improvement for web agents, Paper 2's approach represents a broader conceptual shift with wider interdisciplinary impact across software development and design.
Paper 2 reveals a fundamental limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for the rapidly growing fields of LLM-based code generation, automated program synthesis, and evolutionary computation. This finding is highly novel, methodologically rigorous (controlled experiments across models, prompts, with GP baselines), and challenges core assumptions underlying many LLM-augmented search/optimization systems. Paper 1, while practically useful for UI/UX evaluation, addresses a narrower application domain with incremental improvements over existing LLM-based evaluation approaches.
Paper 1 has higher likely impact due to stronger real-world applicability (scalable UI/UX evaluation and persona-conditioned feedback), broader cross-field relevance (HCI, product design, LLM alignment/agents, evaluation), and methodological contribution beyond benchmarking (two-stage training with contrastive reflection fine-tuning and prompt evolution). Paper 2 is timely and rigorous as a diagnostic benchmark for chronological reasoning and shortcut biases in VLMs, but its impact is narrower (mainly evaluation) and may be subsumed as models/datasets evolve, whereas Paper 1 offers a deployable framework with immediate industry pull.
Paper 2 addresses a critical and timely issue in AI safety and privacy: determining appropriate boundaries for memory use in personalized agents. While Paper 1 offers a highly practical tool for UI/UX design, Paper 2 tackles a fundamental challenge in conversational AI deployment, possessing broader implications for user trust, data privacy, and the foundational architecture of memory-augmented LLMs across multiple domains.
Paper 1 is more novel and timely, directly addressing an emerging, high-stakes security and AI-safety failure mode (human oversight of agent sabotage) with a rare long-horizon, large-scale human-subjects evaluation across multiple frontier models. Its findings have immediate real-world implications for software supply-chain security, developer tooling, and governance, and can influence both AI safety research and industry practices. While Paper 2 is useful for UI/UX iteration and has broad product relevance, it is closer to incremental methodology in synthetic-user evaluation and carries higher risk of domain-limited impact and benchmark overfitting.
Paper 2 addresses a fundamental question about how transformer architectures work internally—specifically how absolute position information emerges in RoPE-based models despite only relative offsets being explicitly encoded. This mechanistic insight into attention sinks, causal masks, and residual streams has broad implications for transformer architecture design, positional encoding research, and interpretability. Paper 1, while practically useful for UI/UX evaluation, is more application-specific and incremental. Paper 2's findings are likely to influence a wider range of downstream research in NLP, architecture design, and mechanistic interpretability.
Paper 2 likely has higher impact: it contributes a reusable, theory-informed human collaboration dataset with action-level mental model annotations—an enabling resource for broad research on human–AI/agent collaboration, evaluation, and training. Its methodology is grounded in a classic social-science task and provides benchmarks across multiple LLMs, supporting rigor and comparability. The dataset can generalize across domains (dialogue, HCI, cognitive science, agent alignment), making its cross-field reach high and timely as agents become collaborators. Paper 1 is promising for UI/UX, but is narrower and more solution-specific.
PerceptUI addresses a high-impact practical problem (UI/UX evaluation) with a novel persona-conditioned framework combining contrastive reflection fine-tuning and reflective prompt evolution. It has clear real-world applications in product development, potentially transforming how companies conduct user research. Paper 2, while methodologically rigorous in benchmarking LLM reasoning, contributes primarily another evaluation benchmark in an already crowded space. PerceptUI's interdisciplinary impact spanning HCI, AI, and product design, combined with its immediate practical utility for reducing costly user studies, gives it broader and more transformative potential.