AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters
Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li
Abstract
Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AtelierEval
1. Core Contribution
AtelierEval addresses a genuine gap in the T2I evaluation ecosystem: while numerous benchmarks evaluate generative models given fixed prompts, none systematically measure the *prompter's* ability to translate intent into effective prompts. The paper contributes three things: (1) a 360-task benchmark spanning three cognitively-grounded task categories (Open-ended, Constrained, Imitation), (2) AtelierJudge, a memory-augmented agentic evaluator that decouples subjective and objective scoring, and (3) an extensive empirical study benchmarking 8 MLLMs against 48 human users across 4 T2I backends.
The problem formulation is the paper's strongest conceptual contribution. By distinguishing Paradigm 3 (prompting proficiency as capability-oriented, model-agnostic evaluation of the prompter's policy π) from model benchmarking (Paradigm 1) and prompt optimization (Paradigm 2), the authors carve out a well-defined evaluation niche. The three-category task decomposition grounded in Guilford's Structure of Intellect theory—mapping divergent production, convergent production, and cognition to OE, CO, and IM tasks—provides a principled rather than ad-hoc task taxonomy.
2. Methodological Rigor
Benchmark Design. The 360 expert-crafted tasks with challenge primitives (4 semantic + 5 constraint types) provide systematic coverage. The formal justification of task completeness (Appendix A) using information-theoretic arguments under the single-turn text-only assumption is sound, though the completeness claim is necessarily bounded by this restrictive assumption. The dual-interface design enabling direct human-MLLM comparison under identical conditions is well-executed.
AtelierJudge. The dual-process architecture (subjective RAG-based scoring + objective binary verification) is well-motivated. The meta-evaluation results are compelling: Spearman ρ=0.81 with GPT-5.4 as backbone approaches human agreement (0.83), and objective accuracy of 95.5% is strong. The ablation studies (Table 11-13) validate design choices including similarity-based retrieval over alternatives and K=3 as optimal. However, the evaluator's reliance on frontier models (GPT-5.4) for deployment raises questions about accessibility and cost.
Human Study. The 48-participant study with balanced Latin square design, stratified novice/skilled groups, and screen recording verification demonstrates careful experimental methodology. The compensation structure (50 skilled) and prohibition of LLM tools during testing add rigor.
Potential Concerns. The single-turn restriction, while enabling controlled evaluation, significantly limits ecological validity—real workflows involve iteration. The acknowledged demographic concentration (primarily young adults in North America and East Asia) limits generalizability of human findings. The use of very recent, potentially unstable model versions (GPT-5.2, GPT-5.4, Claude-4.5) means results may not be reproducible as APIs evolve.
3. Potential Impact
Research Infrastructure. As the first systematic benchmark for prompting proficiency, AtelierEval could become a standard evaluation tool for prompt engineering research, replacing fragmented qualitative studies. The open-source release amplifies this potential.
Practical Insights. Several findings have direct implications: (a) the "homogenization" effect of MLLM middleware compressing quality differences across prompters suggests diminishing returns for advanced prompting on platforms like ChatGPT; (b) the "logical interference" finding—where external MLLM reasoning conflicts with internal middleware on GI-1, dropping objective scores from 69.6% to 47.1%—is practically important for workflow design; (c) the superiority of imitation over planning (IM outperforming CO on objective metrics despite comparable constraint complexity) motivates image-augmented prompting as a research direction.
Education. The framework provides a structured diagnostic tool for prompt engineering education, with category-specific feedback enabling targeted skill development.
4. Timeliness & Relevance
The paper is highly timely. As T2I systems integrate MLLM middleware (ChatGPT, Gemini), understanding the prompter's role becomes increasingly important yet remains unmeasured. The emergence of MLLM-as-prompter workflows (both implicit and explicit integration) makes this evaluation gap urgent. The finding about middleware interference is particularly relevant as more platforms adopt this architecture.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations: The paper is thorough (60 pages with appendices) and transparent about limitations. The stability analysis (Appendix S) confirming that 1 prompt × 4 images is sufficient strengthens confidence in the experimental design. However, the sheer number of frontier models used (GPT-5.4, GPT-5.2, Claude-4.5-Sonnet, Gemini-3-Pro) makes this expensive to replicate.
Generated May 22, 2026
Comparison History (24)
Paper 1 tackles a highly complex, high-value problem (automated hardware design) by introducing an innovative test-time scaling and skill evolution framework. Overcoming the limits of long-context EDA tasks without model fine-tuning offers immense real-world applications in chip design. While Paper 2 provides a valuable benchmarking tool for text-to-image prompting, Paper 1's methodological novelty and its potential to unlock breakthroughs in specialized, high-stakes engineering domains give it a higher potential for broad scientific and industrial impact.
Paper 2 addresses a critical and universal bottleneck in LLM deployment: achieving multi-domain specialization under strict memory and inference constraints. Its framework has broad applicability across virtually all NLP and AI domains, offering significant efficiency gains (e.g., a 9B model outperforming a 32B model). While Paper 1 introduces a novel and valuable benchmark, its scope is constrained to the specific niche of text-to-image prompting, limiting its breadth of impact compared to the foundational efficiency improvements proposed in Paper 2.
Paper 1 addresses a critical and emerging security challenge in multi-agent LLM architectures—securing latent communication (KV cache sharing) against privacy leakage. Its focus on representation-level safety via adversarial training has profound implications for deploying secure, efficient multi-agent systems in real-world environments. In contrast, Paper 2 presents a benchmark for text-to-image prompting, which is valuable but narrower in scope and less likely to drive foundational shifts in AI safety and system design.
Paper 1 addresses a foundational challenge in AI: improving LLM capabilities while adhering to strict memory and inference constraints. Its modular approach (SkillWeave/SkillZip) offers broad, cross-domain applications for deploying efficient models, demonstrated by a 9B model outperforming a 32B model with a 4x speedup. While Paper 2 presents a novel and useful evaluation benchmark for text-to-image prompting, evaluation frameworks for specific prompting workflows typically have a narrower, more transient impact compared to core architectural and deployment efficiency improvements for general-purpose LLMs.
Paper 2 (AMEL) addresses a fundamental bias in LLM-as-judge evaluation pipelines that affects a vast range of applications (code review, content moderation, output scoring). Its findings—that conversational history biases subsequent judgments, with negativity asymmetry and entropy-dependent effects—have immediate, broad implications for anyone using LLMs as evaluators. The rigorous experimental design (75,898 API calls, 11 models, 4 providers) and actionable mitigation advice (fresh context per item) give it high practical impact. Paper 1 introduces a valuable but more niche benchmark for T2I prompting proficiency, with narrower applicability.
Paper 2 likely has higher impact: it introduces a general, system-level inference strategy (idle-time speculative planning) applicable across many LLM-agent settings with tool latency (web, code, robotics, assistants), offering immediate practical deployment value and broad cross-field relevance. The method is timely given agent tool-use latency and shows measurable gains on widely used benchmarks (GAIA, FRAMES, MLE-Bench). Paper 1 is novel and useful, but its impact is narrower (text-to-image prompting evaluation) and depends on adoption of a specific benchmark/judge setup, with more domain-specific applicability.
Paper 1 addresses a fundamental challenge in AI planning with LLMs, proposing a paradigm shift toward generating verified symbolic solvers rather than relying on LLMs at inference time. This has broad implications across AI agents, robotics, and automated reasoning. Its focus on reliability and efficiency addresses critical bottlenecks in deploying LLM-based agents. Paper 2, while introducing a useful benchmark for T2I prompting evaluation, addresses a narrower problem space. Paper 1's conceptual framework for categorizing planner-generation methods and its roadmap for future research are likely to influence a wider community and have longer-lasting impact.
Paper 1 addresses a critical gap in the rapidly expanding field of Generative AI by evaluating upstream prompting proficiency, an issue relevant to both human-computer interaction and multimodal LLM research. Its introduction of a unified benchmark and an agentic evaluator has broad applicability across AI disciplines. While Paper 2 offers valuable real-world regulatory impact for toxicology and alternatives to animal testing, Paper 1's generalizable methodology and relevance to a larger, highly active scientific community give it a higher potential for widespread scientific impact and citation.
Paper 1 addresses a fundamental and broadly applicable problem—systematic diagnosis of LLM agent failures at scale—which is relevant across virtually all LLM agent deployments. It formalizes a new problem (corpus-level trace diagnostics), proposes a multi-agent architecture, and demonstrates concrete downstream improvements (30.4pp). This has high practical impact for the rapidly growing LLM agent ecosystem. Paper 2, while novel in benchmarking prompter proficiency for T2I systems, addresses a narrower problem space with more limited cross-field applicability. Paper 1's infrastructure-level contribution has broader and more lasting impact potential.
AtelierEval addresses a significant gap in T2I evaluation by creating the first benchmark for prompting proficiency rather than model quality, with a comprehensive framework spanning 360 tasks, 8 MLLMs, 48 humans, and 4 backends. Its breadth of impact spans HCI, generative AI evaluation, and cognitive science. Paper 2, while technically interesting in combining Bayesian optimization with LLM-elicited embeddings for prompt tuning, addresses a narrower optimization problem with more incremental contributions. AtelierEval's benchmark and agentic evaluator have stronger potential for community-wide adoption and downstream research impact.
AtelierEval addresses a more broadly impactful gap—evaluating prompting proficiency for text-to-image systems, which affects the rapidly growing generative AI ecosystem. It introduces both a benchmark and a novel agentic evaluator (AtelierJudge) with strong human correlation, conducts extensive experiments across MLLMs and humans, and provides actionable insights (mimicry vs. planning). Its scope (360 tasks, 8 MLLMs, 48 humans, 4 T2I backends) and methodological depth give it broader relevance. SGR-Bench tackles a more niche problem (state-gated retrieval) with a smaller benchmark (100 tasks) and narrower applicability, though it provides valuable failure analysis for search agents.
Paper 1 is more scientifically impactful: it introduces a clearly novel benchmark (AtelierEval) for an unmeasured component (prompting proficiency) with a sizable, expert-designed task suite and a validated evaluator (0.79 Spearman vs experts), making it methodologically stronger and more citable as shared infrastructure for T2I and multimodal evaluation research. It also yields actionable findings about prompting strategies. Paper 2 is promising tooling, but relies on limited internal case studies and subjective preferences; impact may be more product/engineering-facing and harder to generalize or reproduce scientifically.
Paper 1 evaluates AI agents across highly diverse, impactful scientific domains (astronomy, epidemiology, chemistry, history) to answer a fundamental question: when do coordinated AI agents actually improve scientific workflows over simple baselines? This addresses a critical gap in AI-for-science methodological rigor. Paper 2, while introducing a valuable benchmark for text-to-image prompting, has a narrower scope restricted to generative AI and prompt engineering, making Paper 1's potential breadth of impact across broader scientific disciplines significantly higher.
FLUID addresses a fundamental limitation of ID-based recommender systems for ephemeral content (livestreaming), proposing a novel ID-free framework with multimodal semantic codes. Its deployment at massive scale (1B+ users) with measurable online gains demonstrates real-world impact. The paradigm shift from ID-based to semantic code-based recommendation has broad implications for recommender systems research. Paper 2, while introducing a useful benchmark for T2I prompting evaluation, addresses a narrower problem with less transformative potential and limited applicability beyond the text-to-image domain.
Paper 1 addresses a timely and practical gap in T2I evaluation by introducing the first benchmark for prompting proficiency, with rigorous methodology (360 expert tasks, 8 MLLMs, 48 humans, 4 backends) and a strong automated evaluator (0.79 Spearman with experts). It targets the rapidly growing generative AI ecosystem with clear reproducibility and broad applicability. Paper 2 tackles an interesting but niche question about coordinated AI agents across disparate scientific domains; however, its cross-domain setup feels contrived (e.g., molecular sonification), findings are mixed, and the impact is more exploratory than foundational.
Paper 1 has higher likely scientific impact due to its demonstrated large-scale, real-world deployment and measurable online gains on a billion-user industrial livestreaming recommender, addressing a core, pervasive problem (extreme item cold-start and ephemeral content) with an ID-free ranking architecture and multimodal discrete codes. This combination of methodological innovation plus proven production value suggests broad applicability to other short-lived content domains and recommender systems. Paper 2 is timely and useful as a benchmark, but its impact depends on community adoption and primarily targets evaluation within the T2I prompting niche.
Paper 2 likely has higher impact due to stronger novelty and broader relevance: it introduces a new benchmark for evaluating prompting (an under-measured but critical component in T2I pipelines) and an agentic judge correlated with expert ratings, enabling scalable, standardized assessment across humans and MLLMs. This can influence evaluation practices across generative AI, HCI, and multimodal alignment. Paper 1 is a useful, efficient plug-and-play method for video LLM token compression, but its scope is narrower and more incremental relative to existing pooling/token-reduction techniques.
AtelierEval introduces a novel benchmark addressing an unexamined gap—evaluating prompter proficiency rather than T2I models—which opens a new research direction. Its comprehensive framework (360 tasks, cognitive taxonomy, dual human/MLLM interface, agentic evaluator with 0.79 Spearman correlation) provides substantial methodological contributions. The finding that mimicry outperforms planning offers actionable insights for future prompter design. Paper 2, while useful, proposes an incremental training-free enhancement for video token compression, a more crowded area with less paradigm-shifting potential.
Paper 2 introduces a concrete, actionable benchmark (AtelierEval) addressing a clear gap—evaluating prompting proficiency for T2I systems—with empirical validation (0.79 Spearman correlation with experts), extensive experiments across MLLMs and human users, and released artifacts. It has immediate practical impact for the rapidly growing generative AI community. Paper 1 proposes a theoretical framework (ontological continuum) for knowledge graph re-engineering that, while intellectually interesting, is more abstract, primarily visionary, and relies on a single case study with five open challenges yet to be addressed. Paper 2's timeliness, empirical rigor, and broader community relevance give it higher near-term impact.
SciCore-Mol addresses a fundamental bottleneck in AI for science (molecular representation) and has profound, direct applications in drug discovery and chemical synthesis. Its interdisciplinary impact across chemistry and pharmacology gives it a higher potential for transformative scientific advancements compared to AtelierEval, which focuses on the narrower domain of text-to-image prompt evaluation.