Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang
Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.
IntElicit introduces an interactive elicitation paradigm for contextualized creativity assessment, where an AI interviewer adaptively scaffolds participants through multi-turn dialogue to surface creative potential that static assessments might miss. The key insight is that observed creative performance in complex scenarios is confounded by non-creative factors (domain knowledge gaps, low confidence, unwillingness to elaborate), and that non-directive interactive scaffolding can reduce these confounders while preserving the participant as the source of creative ideas.
The primary technical contribution is a decomposed process reward mechanism that addresses two problems in open-ended dialogue optimization: (1) sparse rewards in multi-turn creative assessment, and (2) reward hacking where the AI might dictate answers rather than elicit them. The framework combines forward interaction sampling with look-ahead windows, AHP-weighted scenario-adaptive dimension scoring, turn-level process rewards that value pedagogically meaningful scaffolding, and a local reward model distilled from LLM-judge evaluations to guide PPO training.
The experimental design is relatively thorough, spanning simulated participants (three persona types across 16 scenarios), a human subject study (N=64), qualitative edge-case analysis, and dialogue strategy analysis. Several design choices demonstrate methodological awareness:
However, several methodological concerns arise. The human study is modestly sized (N=64, between-subjects with only 2 participants per scenario per condition), and the authors appropriately note this limits inferential claims. The AHP weighting relies on only two expert psychologists, and the consistency ratio for one category (CR=0.138) exceeds the standard threshold of 0.1, which the authors acknowledge but proceed with. The simulated participant personas, while useful for training, represent stylized engagement patterns rather than validated psychological profiles. The reliance on LLM-as-a-Judge for both training rewards and evaluation creates a circularity concern, though cross-judge checks partially mitigate this.
Educational assessment: The framework offers a conceptually important shift from static to interactive creativity assessment, which aligns with authentic assessment principles. If validated at scale, this could influence how creativity is measured in educational settings, particularly in AI-mediated learning environments.
Dialogue policy optimization: The decomposed process reward mechanism addresses a genuine challenge in educational AI—optimizing for learning outcomes without the agent simply providing answers. This "elicitation alignment" problem is relevant beyond creativity assessment, applicable to tutoring systems, clinical interviews, and Socratic dialogue agents.
Creativity science: The distinction between "unaided creativity" and "elicited creative potential" is theoretically meaningful. The framework operationalizes the idea that assessment conditions shape observable performance, offering a practical tool for studying how scaffolding interacts with creative processes.
Limitations of impact: The scenarios are domain-specific (environmental/technological futures), and generalization to other creative domains (artistic, scientific, entrepreneurial) remains untested. The framework's computational requirements (multiple large LLMs for training data generation) limit accessibility.
The paper addresses a timely intersection: (1) growing recognition that AI-mediated creative work requires new assessment paradigms, (2) advances in RLHF/dialogue optimization that make adaptive educational agents feasible, and (3) longstanding concerns about ecological validity in creativity measurement. The framing around FPSP and contextualized assessment connects meaningfully to established creativity research traditions while pushing toward interactive formats.
1. Well-motivated problem formulation: The paper clearly articulates why static assessment misses creative potential and why interactive elicitation is theoretically justified, grounding claims in creativity science literature.
2. Decomposed process reward: The mechanism that rewards pedagogically meaningful elicitation while penalizing answer dictation is the paper's strongest technical contribution, addressing a real alignment challenge.
3. Multi-perspective evaluation: The combination of simulation, human study, edge-case analysis, and strategy analysis provides converging evidence, even if each individual evaluation has limitations.
4. Qualitative edge cases: The six adversarial scenarios (Tables A10–A15) compellingly demonstrate where baseline models fail and IntElicit succeeds, particularly in maintaining non-directive scaffolding.
5. Transparent methodology: Extensive appendix materials (prompts, scenarios, persona definitions) support reproducibility.
1. Small human study: N=64 with only 2 participants per condition per scenario provides limited statistical power. The descriptive-only reporting of human results is appropriately cautious but limits the strength of conclusions.
2. Construct validity concerns: The paper claims to measure "elicited creative potential" but does not validate this against external creativity criteria or longitudinal outcomes. Whether higher scores under scaffolding reflect genuine creative potential rather than scaffolding-dependent performance is an open question.
3. Evaluation circularity: Both training rewards and evaluation use LLM-based scoring. While cross-judge checks help, the fundamental concern that improvements may reflect better alignment with LLM preferences rather than genuine creativity enhancement remains.
4. Limited demographic diversity: The sample (67% male, university students from a single institution) constrains generalizability claims.
5. No comparison with human interviewers: The paper argues that human interviewers can partially address confounding, but never compares IntElicit against actual human-conducted interviews.
6. Scalability untested: The computational pipeline involving multiple frontier LLMs for data generation is resource-intensive and may not scale to practical deployment settings.
IntElicit presents a thoughtful and well-executed framework at the intersection of creativity assessment and dialogue optimization. The decomposed process reward mechanism is a genuine technical contribution, and the problem formulation is well-grounded in both educational and AI research. The paper's main limitation is that the empirical evidence, while multi-faceted, remains preliminary—particularly the human study. The work opens an interesting research direction but requires substantially more validation before the framework could be considered reliable for educational deployment.
Generated Jun 11, 2026
Paper 2 (IntElicit) introduces a novel framework at the intersection of AI, creativity assessment, and education—fields with broad interdisciplinary impact. Its dialogue policy optimization approach for eliciting creativity is methodologically innovative, combining reinforcement learning with pedagogical theory. It addresses timely concerns about human-AI interaction and has wide applicability across education, psychology, and AI alignment. Paper 1, while technically solid for BIM compliance checking, addresses a narrower domain (AEC industry) with more incremental improvements. Paper 2's cross-disciplinary relevance, human subject validation, and alignment with the growing human-AI collaboration paradigm give it higher impact potential.
Paper 2 (IntElicit) has higher likely scientific impact due to broader real-world applicability and cross-field relevance: it contributes a general framework for interactive creativity elicitation/assessment spanning education, psychometrics, HCI, and AI alignment, backed by a human-subject study and explicit mechanisms against reward hacking. Paper 1 (RecToM) is novel and rigorous within LLM Theory-of-Mind prompting, but its impact is narrower (benchmark-centric ToM reasoning) and primarily advances inference-time prompting rather than introducing a widely deployable evaluation/intervention paradigm.
Paper 2 (TreeSeeker) likely has higher scientific impact due to broader applicability and timeliness: inference-time branch-and-return control for web/deep search is a core capability for LLM agents across domains (QA, research assistants, enterprise search). Its tree-structured UCB-style selection with memory for uncertainty/conflicts generalizes beyond education and could transfer to other tool-using settings. The evaluation spans multiple established deep-search benchmarks with consistent gains, suggesting stronger methodological breadth. Paper 1 is innovative for creativity assessment but is more niche and depends on human-subject context, limiting cross-field uptake.
HORMA addresses a fundamental and broadly applicable challenge in LLM agents—efficient memory management for long-horizon tasks—with a novel hierarchical organization approach combining structured construction and RL-based retrieval. It demonstrates strong results across multiple benchmarks with significant efficiency gains (22% token usage). Paper 2, while innovative in creativity assessment via dialogue optimization, addresses a narrower niche at the intersection of educational assessment and AI. HORMA's contributions are more likely to influence the rapidly growing LLM agent ecosystem, giving it broader impact potential across multiple fields and applications.
Paper 1 demonstrates higher potential impact due to its critical focus on the reliability of AI agents in high-stakes scientific and healthcare domains. Its large-scale benchmark and novel clean-room evaluation harness address a major flaw in current LLM evaluation: data leakage. The findings audit widely used consumer-facing agents, revealing severe factual shortcomings. This provides urgent, timely, and broad implications across AI safety, medical informatics, and public health. In contrast, Paper 2, while methodologically sound, targets a narrower application in educational creativity assessment with a smaller scale of validation.
Paper 1 is more novel and broadly impactful: it introduces a dialogue policy optimization framework with decomposed process rewards to elicit creativity while reducing knowledge/agency confounds—relevant to core ML, HCI, and educational assessment. Its methodology includes both simulations and a human study, and the problem is timely given widespread human–AI interaction. Paper 2 is strong and practical for civil/structural engineering automation, but its impact is narrower (domain-specific tooling around FE workflows) and depends on integration with proprietary software, limiting breadth despite open-sourcing.
Paper 1 introduces a technically rigorous framework using dialogue policy optimization and decomposed process rewards to assess creativity. Its approach to mitigating reward hacking in educational AI has broad applications in AI-mediated learning and alignment. Paper 2, while offering an interesting behavioral experiment, lacks the algorithmic innovation and technical depth of Paper 1, making Paper 1 more likely to drive future research in both AI assessment and human-AI interaction.
Paper 2 addresses a fundamental and critical bottleneck in modern AI: LLM deployment and multi-agent orchestration under dynamic infrastructure constraints. By bridging system-level signals with multi-agent planning and routing, it offers broad, scalable applications across cloud computing and AI services. Its massive performance improvements (7x latency reduction, near-perfect SLO compliance) indicate substantial real-world and industry impact. In contrast, Paper 1 is innovative but its scope is relatively narrower, focusing primarily on educational technology and human-AI creativity assessment.
Paper 2 demonstrates higher potential scientific impact due to its conclusive results and methodological innovation. While Paper 1 addresses an important medical AI application, its findings are exploratory and statistically non-significant due to high expert-rating noise. In contrast, Paper 2 introduces a novel dialogue policy optimization framework (IntElicit) with a decomposed process reward mechanism that successfully prevents reward hacking in educational AI. Supported by a human study showing clear improvements over expert baselines, Paper 2 offers immediate, broadly applicable contributions to AI-mediated learning, human-computer interaction, and cognitive assessment.
Paper 2 addresses a highly interdisciplinary challenge spanning AI, education, and cognitive psychology. By applying dialogue policy optimization and process rewards to the novel task of eliciting human creativity, it offers a broader societal and methodological impact compared to Paper 1, which focuses on a narrower, albeit rigorous, optimization of LLM agent skill architecture.